Closed
Bug 1286277
Opened 9 years ago
Closed 9 years ago
Document Spark & Presto datasets (MVP)
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rvitillo, Assigned: harter)
References
Details
User Story
All datasets available from Spark & Presto should be documented on our wiki. Note that there are other datasets available from Redshift, PostgreSQL, etc. Fresh bugs should be filed for those if needed. The documentation should contain to the very minimum: - a text that explains which queries the dataset is particularly suited to answer - a pointer to a list of documented fields that the dataset contains - some example queries (both for Spark & Presto) - caveats (like "dataset contains only 1% of the population" or "json parsing can be very slow") References - http://www.slideshare.net/RobertoAgostinoVitil/telemetry-datasets - https://github.com/mozilla/emr-bootstrap-spark/tree/master/examples - https://wiki.mozilla.org/Telemetry/LongitudinalExamples
No description provided.
Reporter | ||
Updated•9 years ago
|
Priority: -- → P2
Assignee | ||
Comment 1•9 years ago
|
||
This article will be the head node for this documentation:
https://wiki.mozilla.org/Telemetry/Available_Telemetry_Datasets_and_their_Applications
Assignee | ||
Comment 2•9 years ago
|
||
Roberto, the Churn and E10sExperiment views appear to be unavailable in STMO. Should we include these in the documentation?
Similarly, there are a number of tables in STMO/presto which aren't generated by the batch view repo. Do you have links to these definitions or owners I can consult? Are these in scope for this project?
Status: NEW → ASSIGNED
Flags: needinfo?(rvitillo)
Assignee | ||
Comment 3•9 years ago
|
||
I've added some notes to the Longitudinal Dataset documentation [0] and I have a proposal. I'd like to move this documentation into a .md file in the batch view repository, possibly replacing [1]
I'm a fan of keeping documentation close to the code to encourage us to keep it the two in sync. More so, as I dug through some of these examples to understand their purpose, I wished I had a blame view so I knew who to contact and about how old each snippet was.
Any objections or concerns? Georg, I'm ni? you since you have some recent edits on this page.
[0] https://wiki.mozilla.org/Telemetry/LongitudinalExamples
[1] https://github.com/mozilla/telemetry-batch-view/blob/master/docs/Longitudinal.md
Flags: needinfo?(gfritzsche)
Comment 4•9 years ago
|
||
(Commenting on User Story)
> The documentation should contain to the very minimum:
> - a text that explains which queries the dataset is particularly suited to
> answer
> - a pointer to a list of documented fields that the dataset contains
> - some example queries (both for Spark & Presto)
> - caveats (like "dataset contains only 1% of the population" or "json
> parsing can be very slow")
What about a "source" link for all of them?
It can be hard to actually find the repository & code that generates the various data sets.
Comment 5•9 years ago
|
||
(In reply to Ryan Harter [:harter] from comment #3)
> I've added some notes to the Longitudinal Dataset documentation [0] and I
> have a proposal. I'd like to move this documentation into a .md file in the
> batch view repository, possibly replacing [1]
>
> I'm a fan of keeping documentation close to the code to encourage us to keep
> it the two in sync. More so, as I dug through some of these examples to
> understand their purpose, I wished I had a blame view so I knew who to
> contact and about how old each snippet was.
>
> Any objections or concerns? Georg, I'm ni? you since you have some recent
> edits on this page.
I agree, having it side-by-side with the source is good (i only started the LongitudinalExamples page to have any location to point people to).
The current documentation in Longitudinal.md is very data engineer & pipeline focused though. I think we should keep the "user" documentation in a separate document from that to avoid confusion.
Can we redirect the LongitudinalExamples article to Available_Telemetry_Datasets_and_their_Applications#Longitudinal or so for people who still use the link?
Flags: needinfo?(gfritzsche)
Reporter | ||
Comment 6•9 years ago
|
||
(In reply to Ryan Harter [:harter] from comment #2)
> Roberto, the Churn and E10sExperiment views appear to be unavailable in
> STMO. Should we include these in the documentation?
The following tables should be included in the documentation:
- main_summary (mreid)
- longitudinal (rvitillo)
- client_count (rvitillo)
- crash_aggregates (mdoglio)
- cross_sectional (harter)
You can safely ignore Churn and E10sExperiment
> Similarly, there are a number of tables in STMO/presto which aren't
> generated by the batch view repo. Do you have links to these definitions or
> owners I can consult? Are these in scope for this project?
I would say we should add the mobile datasets to the documentation as well; they don't necessarily have to be documented in detail though:
- android_events (barbara)
- android_clients (barbara)
- android_addons (barbara)
- mobile_clients (barbara)
Flags: needinfo?(rvitillo)
Comment 7•9 years ago
|
||
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #6)
> I would say we should add the mobile datasets to the documentation as well;
> they don't necessarily have to be documented in detail though:
> - android_events (barbara)
> - android_clients (barbara)
> - android_addons (barbara)
> - mobile_clients (barbara)
Documentation for these currently lives here: https://wiki.mozilla.org/Mobile/Metrics/Redash
Assignee | ||
Comment 8•9 years ago
|
||
Cross Sectional and Main Summary are now documented at https://wiki.mozilla.org/Telemetry/Available_Telemetry_Datasets_and_their_Applications
Comment 9•9 years ago
|
||
We have words for each of the datasets, but I'm not completely happy with all of it. We can perhaps deem this bug complete, but I would prefer that the documentation for the datasets be more uniform.
Reporter | ||
Updated•9 years ago
|
Summary: Document Spark & Presto datasets → Document Spark & Presto datasets (MVP)
Reporter | ||
Updated•9 years ago
|
Points: --- → 3
Priority: P2 → P1
Reporter | ||
Comment 10•9 years ago
|
||
This is OK for a first iteration. We will keep improving the documentation in the next quarters.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•