1226379 - Port v2 desktop churn/retention analysis cohort reports to unified telemetry

Reporter

Description

•

10 years ago

the v2 bug that covered the work: https://bugzilla.mozilla.org/show_bug.cgi?id=1198537 form our meeting with churn guys: -drops every monday (previous saturday through sunday) -dump to redshift

Katie Parlante

Comment 1

•

10 years ago

Chris, we could use pointers to the existing code and outputs.

Flags: needinfo?(chrismore.bugzilla)

Katie Parlante

Updated

•

10 years ago

Priority: -- → P1

Chris More [:cmore]

Comment 2

•

10 years ago

dzeber: can you drop in location of your v2 scripts? Katie: output of the scripts are here, but Dave is moving them to a MySQL db: https://metrics.mozilla.com/protected/dzeber/cohort-activity/

Flags: needinfo?(chrismore.bugzilla) → needinfo?(dzeber)

Chris More [:cmore]

Updated

•

10 years ago

Blocks: 1198541

Whiteboard: [fxgrowth]

Dave Zeber [:dzeber]

Comment 3

•

10 years ago

The code lives here: https://github.com/mozilla/churn-analysis/tree/master/cohort-report It runs on hala every Thursday evening, and dumps the new data to the MySQL table described in bug 1221331. Connection info is given at https://github.com/mozilla/churn-analysis/blob/master/cohort-report/pull-cohort-data.R#L244 (ping me on IRC for the password).

Flags: needinfo?(dzeber)

Mike Trinkala [:trink]

Comment 4

•

10 years ago

I get a 404 on that repo

Mike Trinkala [:trink]

Comment 5

•

10 years ago

I will not be able to address this request before the holiday. I still think this is a very good opportunity for the metrics team to learn how to use the new infrastructure. We have been bootstrapping a bunch of existing v2 reports for months and the core data pipeline work has been suffering it would be beneficial to move to a self-service model sooner rather than later. There are two ways this can be implemented. 1) Using the core pipeline infrastructure and a Lua sandbox (the solution I would provide) Pros - the computation can be performed while the data is being read (faster, less expensive, no intermediate output or storage costs) - all the business logic exists in a single system Cons - everything we write against the pipeline seems to be owned by us forever - depending on the requirements the sandboxes usually contain more code since they are more generalized - when we need higher performance we switch to custom C Lua modules which are not as approachable to many 2) Using a spark cluster in which case R can still be used Pros - metrics team can continue using R and self-serve in the future - more external resources, documentation and help Cons - most likely still requires a Lua sandbox to transform/pre-digest the data into something Spark can consume directly (additional time, costs, code maintenance)