Closed Bug 1251580 Opened 10 years ago Closed 8 years ago

[meta] Firefox Data Platform

Categories

(Data Platform and Tools :: General, defect, P3)

defect
Points:
3

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rvitillo, Unassigned)

References

Details

User Story

This bug provides a big-picture view of work happening on various fronts. Note: most of the items mentioned in the User Story have landed. This meta bug is now mostly used to track refinements.

- Define a schema for Telemetry pings and use it to validate incoming data.
This is going to be benefit all analyses and ETL jobs downstream as currently data has to be cleaned over and over again.

- Use Spark to process raw data
We will keep storing raw data in its raw json form and our moztelemetry API will continue to be the go-to solution for analyses that can’t be run on our derived datasets. There is some work that needs to happen to make the API more user-friendly and get rid of some rough edges.

- Use Parquet as the common representation for our derived datasets.
Parquet is a columnar storage format which can be read efficiently by various engines in the big-data ecosystem, in particular by Presto and Spark. A particular mention here is the derived longitudinal dataset which will provide users the ability to run analyses that were previously based on FHR data.

- Use Presto to run SQL queries on derived datasets
SQL access should be provided to our users to lower the skills/time needed to access our data. Presto is able to access our derived Parquet datasets on S3 directly and is able to deal with nested structs, arrays and maps so that we can easily query things like keyed histograms or add-on data. 

- Use Redash to build dashboards
Maintaining N different custom dashboards built by K different people isn’t scalable. There will always be the need for some very specific custom-tailored dashboard but there should be a standard way to build simple dashboards that display a few plots and get updated automatically over time. Redash is an excellent candidate for it.

- Extend our analysis dashboard
The analysis dashboard (a.t.m.o) should be the gateway to our data for our users. Everything we provide should eventually be accessible from that dashboard. For instance, users should be able to launch Spark clusters to analyse raw data or run SQL queries from their browser on our derived datasets and create plots and dashboards they can share with others.

- Use a proper job scheduling mechanism
The way we currently schedule various ETL and analysis jobs is simple and not scalable. We should move to Luigi/Airflow and come up with a single unified way of scheduling workflows.
No description provided.
User Story: (updated)
User Story: (updated)
Summary: Telemetry overhaul → [meta] Telemetry overhaul
Depends on: 1251625
Depends on: 1251626
Depends on: 1251628
Depends on: 1251630
Depends on: 1251631
Depends on: 1251637
Depends on: 1251648
Depends on: 1251653
Depends on: 1251663
Depends on: 1175115
Depends on: 1251746
Depends on: 1251747
Depends on: 1251750
Depends on: 1251756
Depends on: 1252232
Depends on: 1252567
Depends on: 1252825
Depends on: 1252826
Points: --- → 3
Priority: -- → P1
Depends on: 1253675
Depends on: 1254547
Depends on: 1254650
Depends on: 1255457
Depends on: 1255738
Depends on: 1255739
Depends on: 1255741
Depends on: 1255748
No longer depends on: 1251631
No longer depends on: 1251637
No longer depends on: 1254650
No longer depends on: 1255738
Depends on: 1255751
No longer depends on: 1255457
No longer depends on: 1251663
No longer depends on: 1251648
Depends on: 1255752
No longer depends on: 1252826
No longer depends on: 1255739
No longer depends on: 1255741
No longer depends on: 1175115
No longer depends on: 1232642
No longer depends on: 1243528
No longer depends on: 1245569
Depends on: 1255754
No longer depends on: 1251625
No longer depends on: 1251747
Depends on: 1255755
No longer depends on: 1245490
No longer depends on: 1249658
No longer depends on: 1252567
No longer depends on: 1253675
Priority: P1 → P3
Depends on: 1269754
Depends on: 1283447
User Story: (updated)
Summary: [meta] Telemetry overhaul → [meta] Firefox Data Platform overhaul
Depends on: 1284522
No longer depends on: 1283447
No longer depends on: 1255748
Depends on: 1286215
Depends on: 1286269
Summary: [meta] Firefox Data Platform overhaul → [meta] Firefox Data Platform
Depends on: 1305432
Depends on: 1306560
Depends on: 1307092
Depends on: 1309010
Depends on: 1315282
Depends on: 1315799
Depends on: 1290120
Depends on: 1325391
Depends on: 1325561
Depends on: 1331871
Depends on: 1340640
Depends on: 1353101
Component: Metrics: Pipeline → General
Product: Cloud Services → Data Platform and Tools
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.