Closed Bug 1251580 Opened 10 years ago Closed 8 years ago

[meta] Firefox Data Platform

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: rvitillo, Unassigned)

References

Details

User Story

This bug provides a big-picture view of work happening on various fronts. Note: most of the items mentioned in the User Story have landed. This meta bug is now mostly used to track refinements.

- Define a schema for Telemetry pings and use it to validate incoming data.
This is going to be benefit all analyses and ETL jobs downstream as currently data has to be cleaned over and over again.

- Use Spark to process raw data
We will keep storing raw data in its raw json form and our moztelemetry API will continue to be the go-to solution for analyses that can’t be run on our derived datasets. There is some work that needs to happen to make the API more user-friendly and get rid of some rough edges.

- Use Parquet as the common representation for our derived datasets.
Parquet is a columnar storage format which can be read efficiently by various engines in the big-data ecosystem, in particular by Presto and Spark. A particular mention here is the derived longitudinal dataset which will provide users the ability to run analyses that were previously based on FHR data.

- Use Presto to run SQL queries on derived datasets
SQL access should be provided to our users to lower the skills/time needed to access our data. Presto is able to access our derived Parquet datasets on S3 directly and is able to deal with nested structs, arrays and maps so that we can easily query things like keyed histograms or add-on data.

- Use Redash to build dashboards
Maintaining N different custom dashboards built by K different people isn’t scalable. There will always be the need for some very specific custom-tailored dashboard but there should be a standard way to build simple dashboards that display a few plots and get updated automatically over time. Redash is an excellent candidate for it.

- Extend our analysis dashboard
The analysis dashboard (a.t.m.o) should be the gateway to our data for our users. Everything we provide should eventually be accessible from that dashboard. For instance, users should be able to launch Spark clusters to analyse raw data or run SQL queries from their browser on our derived datasets and create plots and dashboards they can share with others.

- Use a proper job scheduling mechanism
The way we currently schedule various ETL and analysis jobs is simple and not scalable. We should move to Luigi/Airflow and come up with a single unified way of scheduling workflows.

Roberto Agostino Vitillo (:rvitillo)

Reporter

Description

•

10 years ago

No description provided.

Roberto Agostino Vitillo (:rvitillo)

Reporter