Closed Bug 1286277 Opened 9 years ago Closed 9 years ago

Document Spark & Presto datasets (MVP)

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: rvitillo, Assigned: harter)

References

Details

User Story

All datasets available from Spark & Presto should be documented on our wiki. Note that there are other datasets available from Redshift, PostgreSQL, etc. Fresh bugs should be filed for those if needed.

The documentation should contain to the very minimum:
- a text that explains which queries the dataset is particularly suited to answer
- a pointer to a list of documented fields that the dataset contains
- some example queries (both for Spark & Presto)
- caveats (like "dataset contains only 1% of the population" or "json parsing can be very slow")

References
- http://www.slideshare.net/RobertoAgostinoVitil/telemetry-datasets
- https://github.com/mozilla/emr-bootstrap-spark/tree/master/examples
- https://wiki.mozilla.org/Telemetry/LongitudinalExamples

Roberto Agostino Vitillo (:rvitillo)

Reporter

Description

•

9 years ago

No description provided.

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

9 years ago

Priority: -- → P2

Ryan Harter [:harter]

Assignee

Updated

•

9 years ago

Blocks: 1286273

Ryan Harter [:harter]

Assignee

Comment 1

•

9 years ago

This article will be the head node for this documentation: https://wiki.mozilla.org/Telemetry/Available_Telemetry_Datasets_and_their_Applications

Ryan Harter [:harter]

Assignee

Comment 2

•

9 years ago

Roberto, the Churn and E10sExperiment views appear to be unavailable in STMO. Should we include these in the documentation? Similarly, there are a number of tables in STMO/presto which aren't generated by the batch view repo. Do you have links to these definitions or owners I can consult? Are these in scope for this project?

Status: NEW → ASSIGNED

Flags: needinfo?(rvitillo)

Ryan Harter [:harter]

Assignee

Comment 3

•

9 years ago

I've added some notes to the Longitudinal Dataset documentation [0] and I have a proposal. I'd like to move this documentation into a .md file in the batch view repository, possibly replacing [1] I'm a fan of keeping documentation close to the code to encourage us to keep it the two in sync. More so, as I dug through some of these examples to understand their purpose, I wished I had a blame view so I knew who to contact and about how old each snippet was. Any objections or concerns? Georg, I'm ni? you since you have some recent edits on this page. [0] https://wiki.mozilla.org/Telemetry/LongitudinalExamples [1] https://github.com/mozilla/telemetry-batch-view/blob/master/docs/Longitudinal.md

Flags: needinfo?(gfritzsche)

Georg Fritzsche [:gfritzsche]

Comment 4

•

9 years ago

(Commenting on User Story) > The documentation should contain to the very minimum: > - a text that explains which queries the dataset is particularly suited to > answer > - a pointer to a list of documented fields that the dataset contains > - some example queries (both for Spark & Presto) > - caveats (like "dataset contains only 1% of the population" or "json > parsing can be very slow") What about a "source" link for all of them? It can be hard to actually find the repository & code that generates the various data sets.

Georg Fritzsche [:gfritzsche]

Comment 5

•

9 years ago

(In reply to Ryan Harter [:harter] from comment #3) > I've added some notes to the Longitudinal Dataset documentation [0] and I > have a proposal. I'd like to move this documentation into a .md file in the > batch view repository, possibly replacing [1] > > I'm a fan of keeping documentation close to the code to encourage us to keep > it the two in sync. More so, as I dug through some of these examples to > understand their purpose, I wished I had a blame view so I knew who to > contact and about how old each snippet was. > > Any objections or concerns? Georg, I'm ni? you since you have some recent > edits on this page. I agree, having it side-by-side with the source is good (i only started the LongitudinalExamples page to have any location to point people to). The current documentation in Longitudinal.md is very data engineer & pipeline focused though. I think we should keep the "user" documentation in a separate document from that to avoid confusion. Can we redirect the LongitudinalExamples article to Available_Telemetry_Datasets_and_their_Applications#Longitudinal or so for people who still use the link?

Flags: needinfo?(gfritzsche)

Roberto Agostino Vitillo (:rvitillo)

Reporter

Comment 6

•

9 years ago

(In reply to Ryan Harter [:harter] from comment #2) > Roberto, the Churn and E10sExperiment views appear to be unavailable in > STMO. Should we include these in the documentation? The following tables should be included in the documentation: - main_summary (mreid) - longitudinal (rvitillo) - client_count (rvitillo) - crash_aggregates (mdoglio) - cross_sectional (harter) You can safely ignore Churn and E10sExperiment > Similarly, there are a number of tables in STMO/presto which aren't > generated by the batch view repo. Do you have links to these definitions or > owners I can consult? Are these in scope for this project? I would say we should add the mobile datasets to the documentation as well; they don't necessarily have to be documented in detail though: - android_events (barbara) - android_clients (barbara) - android_addons (barbara) - mobile_clients (barbara)

Flags: needinfo?(rvitillo)

Georg Fritzsche [:gfritzsche]

Comment 7

•

9 years ago

(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #6) > I would say we should add the mobile datasets to the documentation as well; > they don't necessarily have to be documented in detail though: > - android_events (barbara) > - android_clients (barbara) > - android_addons (barbara) > - mobile_clients (barbara) Documentation for these currently lives here: https://wiki.mozilla.org/Mobile/Metrics/Redash

Ryan Harter [:harter]

Assignee

Comment 8

•

9 years ago

Cross Sectional and Main Summary are now documented at https://wiki.mozilla.org/Telemetry/Available_Telemetry_Datasets_and_their_Applications

Frank Bertsch [:frank]

Comment 9

•

9 years ago

We have words for each of the datasets, but I'm not completely happy with all of it. We can perhaps deem this bug complete, but I would prefer that the documentation for the datasets be more uniform.

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

9 years ago

Summary: Document Spark & Presto datasets → Document Spark & Presto datasets (MVP)

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

9 years ago

Points: --- → 3

Priority: P2 → P1

Roberto Agostino Vitillo (:rvitillo)

Reporter

Comment 10

•

9 years ago

This is OK for a first iteration. We will keep improving the documentation in the next quarters.

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

7 years ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Document Spark & Presto datasets (MVP)

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

Tracking

(Not tracked)

People

(Reporter: rvitillo, Assigned: harter)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Updated

Comment 10

Updated