Closed
Bug 1251580
Opened 10 years ago
Closed 8 years ago
[meta] Firefox Data Platform
Categories
(Data Platform and Tools :: General, defect, P3)
Data Platform and Tools
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rvitillo, Unassigned)
References
Details
User Story
This bug provides a big-picture view of work happening on various fronts. Note: most of the items mentioned in the User Story have landed. This meta bug is now mostly used to track refinements. - Define a schema for Telemetry pings and use it to validate incoming data. This is going to be benefit all analyses and ETL jobs downstream as currently data has to be cleaned over and over again. - Use Spark to process raw data We will keep storing raw data in its raw json form and our moztelemetry API will continue to be the go-to solution for analyses that can’t be run on our derived datasets. There is some work that needs to happen to make the API more user-friendly and get rid of some rough edges. - Use Parquet as the common representation for our derived datasets. Parquet is a columnar storage format which can be read efficiently by various engines in the big-data ecosystem, in particular by Presto and Spark. A particular mention here is the derived longitudinal dataset which will provide users the ability to run analyses that were previously based on FHR data. - Use Presto to run SQL queries on derived datasets SQL access should be provided to our users to lower the skills/time needed to access our data. Presto is able to access our derived Parquet datasets on S3 directly and is able to deal with nested structs, arrays and maps so that we can easily query things like keyed histograms or add-on data. - Use Redash to build dashboards Maintaining N different custom dashboards built by K different people isn’t scalable. There will always be the need for some very specific custom-tailored dashboard but there should be a standard way to build simple dashboards that display a few plots and get updated automatically over time. Redash is an excellent candidate for it. - Extend our analysis dashboard The analysis dashboard (a.t.m.o) should be the gateway to our data for our users. Everything we provide should eventually be accessible from that dashboard. For instance, users should be able to launch Spark clusters to analyse raw data or run SQL queries from their browser on our derived datasets and create plots and dashboards they can share with others. - Use a proper job scheduling mechanism The way we currently schedule various ETL and analysis jobs is simple and not scalable. We should move to Luigi/Airflow and come up with a single unified way of scheduling workflows.
No description provided.
Reporter | ||
Updated•10 years ago
|
Reporter | ||
Updated•10 years ago
|
User Story: (updated)
Reporter | ||
Updated•10 years ago
|
User Story: (updated)
Reporter | ||
Updated•10 years ago
|
Summary: Telemetry overhaul → [meta] Telemetry overhaul
Reporter | ||
Updated•10 years ago
|
Points: --- → 3
Updated•10 years ago
|
Priority: -- → P1
Updated•9 years ago
|
Priority: P1 → P3
Reporter | ||
Updated•9 years ago
|
User Story: (updated)
Reporter | ||
Updated•9 years ago
|
Summary: [meta] Telemetry overhaul → [meta] Firefox Data Platform overhaul
Reporter | ||
Updated•9 years ago
|
Summary: [meta] Firefox Data Platform overhaul → [meta] Firefox Data Platform
Updated•8 years ago
|
Component: Metrics: Pipeline → General
Product: Cloud Services → Data Platform and Tools
Updated•8 years ago
|
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•