Closed Bug 1315282 Opened 9 years ago Closed 8 years ago

Proposal: Website to download data samples

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P5)

defect

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: frank, Unassigned)

References

Details

Our users often want to do small scale analysis on a subset of data before moving to large-scale using ATMO or STMO. We can support this by providing a simple website that provides the following: - Downloads a subset of telemetry pings (some max count) as a JSON document - Choose subset (obvious dimensions: release, build, application, date, etc.) - Choose JSON paths The resulting data would be aggregated into a single JSON doc, which is easily analyzed using tools locally. We would set some maximum number of resulting pings, maybe 1000; with the usual caveats about small sample sizes for large populations (mainly, release). This would have the added benefit of reducing load on ATMO and STMO, and since pings are self-documenting, would make it easier for users to examine them to find what they might need. Finally, we would want to consider how such a tool could be misused. Mainly, to find correlations that aren't really there; but this is possible in our current frameworks as well.
I like this idea. To be clear, the example data would be used to understand the ping format not necessarily produce population estimates, correct? If we can anonymize the data, then we have effectively build the ping-fuzzer[0] which would be generally useful when developing batch views. https://bugzilla.mozilla.org/show_bug.cgi?id=1310324
(In reply to Ryan Harter [:harter] from comment #1) > I like this idea. To be clear, the example data would be used to understand > the ping format not necessarily produce population estimates, correct? Hindsight's UI can be used to explore data in real-time as it flows through our pipeline.
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #2) > (In reply to Ryan Harter [:harter] from comment #1) > > I like this idea. To be clear, the example data would be used to understand > > the ping format not necessarily produce population estimates, correct? > > Hindsight's UI can be used to explore data in real-time as it flows through > our pipeline. Yes I think the idea was to do some basic analysis for population estimates. rvitillo mentioned the following points in IRC: "- if they do their own thing on a sample they might get wrong conclusions as they are not proficient with stats - since they are doing their own thing, they could analyze their data in whatever form which is not something that resembles a notebook" This is part of a discussion about the tension between providing tools that are statistically sound and reviewable, and the difficulty of both learning to use those tools, and context-switching to those tools for engineers. This problem probably lies more with ATMO than STMO. Better documentation is probably part of the solution.
Priority: -- → P5
Closing abandoned bugs in this product per https://bugzilla.mozilla.org/show_bug.cgi?id=1337972
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INCOMPLETE
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.