Closed Bug 1255543 Opened 10 years ago Closed 8 years ago

Create longitudinal dataset for Telemetry Experiments

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P5)

defect

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: mreid, Unassigned)

References

Details

We should create a dataset that contains all submissions running one or more Telemetry Experiments. This would involve: - Adding a Heka field that flags records that are running an experiment - Sending these records to a separate location in S3 - A periodic job for reorganizing this data into a longitudinal structure using the same code as bug 1242039 - Make this data available in all the usual ways (Spark, Presto, etc)
might need to broken into sub bugs
Points: --- → 3
Priority: -- → P3
jjensen, mreid and rvitillo are proposing a dataset to support analysis of telemetry experiments. Let us know if there are any other requirements for search test experiments that are not covered in https://bugzilla.mozilla.org/show_bug.cgi?id=1258529. (If there are other search test experiments spec'd out, would be good to know about them). Un-prioritizing to retriage, as we need this for search
Flags: needinfo?(jjensen)
Priority: P3 → --
I will punt this back a little bit, but from what I understand from this bug and bug 1258529, I think we're OK. Simply put, we need longitudinal data for profiles in the experiment, where "in the experiment" also means "in the control group". By "longitudinal data", I'm typically referring to usage data: - # sessions - session duration - # searches, by SAP - # crashes, by type - etc In short, many/most of the "non-Telemetry" pings in the UT payloads. If this approach provides this, then I think we are OK. To be safe I'm going to needinfo my talented colleague Sam Penrose about this as well.
Flags: needinfo?(jjensen) → needinfo?(spenrose)
(In reply to John Jensen from comment #3) > I will punt this back a little bit, but from what I understand from this bug > and bug 1258529, I think we're OK. > > Simply put, we need longitudinal data for profiles in the experiment, where > "in the experiment" also means "in the control group". By "longitudinal > data", I'm typically referring to usage data: > - # sessions > - session duration > - # searches, by SAP > - # crashes, by type > - etc > > In short, many/most of the "non-Telemetry" pings in the UT payloads. If this > approach provides this, then I think we are OK. To be safe I'm going to > needinfo my talented colleague Sam Penrose about this as well. This description sounds like enough to get started on. It does not look precise enough to vet a solution, even allowing for shifts in which fields are included. What does the analysis process look like, for example? Are the people who will do the analysis trained on a toolchain that can answer their questions with acceptable latency? (Has that toolchain been chosen?) I have requested permission to access the problem statement document in the other bug, which may answer those questions.
Flags: needinfo?(spenrose)
> This description sounds like enough to get started on. It does not look > precise enough to vet a solution, even allowing for shifts in which fields > are included. As suggested in Comment 1 the proposed solution is to use the same job that generates the longitudinal dataset to generate the per-experiment datasets. That will allow the datasets to be queried with SQL as well. > What does the analysis process look like, for example? See [1] for an example analysis of an A/B experiment. > Are the people who will do the analysis trained on a toolchain that can answer their > questions with acceptable latency? (Has that toolchain been chosen?) I can't speak for everyone else but I am fairly confident that our engineers can use, or learn how to use, our toolchain with acceptable latency. [1] https://github.com/vitillo/e10s_analyses
Assignee: nobody → mreid
Priority: -- → P1
(In reply to Mark Reid [:mreid] from comment #0) > - Adding a Heka field that flags records that are running an experiment See https://github.com/mozilla-services/data-pipeline/pull/200 r? trink
Flags: needinfo?(mtrinkala)
(In reply to Mark Reid [:mreid] from comment #0) > - Sending these records to a separate location in S3 See https://github.com/mozilla-services/puppet-config/pull/1917 r? whd
Flags: needinfo?(whd)
For the separate location in S3, I propose to store data using the following dimensions: - submissionDate - docType - activeExperimentId
PR200 merged
Flags: needinfo?(mtrinkala)
Puppet PR 1917 has also been merged. Thanks!
Flags: needinfo?(whd)
Remaining work: - A periodic job for reorganizing this data into a longitudinal structure using the same code as bug 1242039 - Make this data available in all the usual ways (Spark, Presto, etc) Roberto, does your team have capacity for this?
Flags: needinfo?(rvitillo)
(In reply to Mark Reid [:mreid] from comment #11) > Roberto, does your team have capacity for this? This is probably not going to happen this quarter. The data in its current form should be accessible through Spark though (get_records), right?
Flags: needinfo?(rvitillo)
Yes, the data should be accessible via get_records in Spark. Just use "telemetry-experiments" as the name of the data source. The fields available for filtering the data are listed in Comment 8.
Assignee: mreid → nobody
Priority: P1 → --
Priority: -- → P3
due to e10s nit needing this, we'll accept a patch but not being actively worked on
Priority: P3 → P5
Closing abandoned bugs in this product per https://bugzilla.mozilla.org/show_bug.cgi?id=1337972
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INCOMPLETE
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.