Closed
Bug 1226379
Opened 10 years ago
Closed 10 years ago
Port v2 desktop churn/retention analysis cohort reports to unified telemetry
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: thuelbert, Assigned: mreid)
References
(Blocks 1 open bug)
Details
(Whiteboard: [fxgrowth])
Attachments
(1 file)
3.59 KB,
text/csv
|
Details |
the v2 bug that covered the work: https://bugzilla.mozilla.org/show_bug.cgi?id=1198537
form our meeting with churn guys:
-drops every monday (previous saturday through sunday)
-dump to redshift
Comment 1•10 years ago
|
||
Chris, we could use pointers to the existing code and outputs.
Flags: needinfo?(chrismore.bugzilla)
Updated•10 years ago
|
Priority: -- → P1
Comment 2•10 years ago
|
||
dzeber: can you drop in location of your v2 scripts?
Katie: output of the scripts are here, but Dave is moving them to a MySQL db: https://metrics.mozilla.com/protected/dzeber/cohort-activity/
Flags: needinfo?(chrismore.bugzilla) → needinfo?(dzeber)
Comment 3•10 years ago
|
||
The code lives here: https://github.com/mozilla/churn-analysis/tree/master/cohort-report
It runs on hala every Thursday evening, and dumps the new data to the MySQL table described in bug 1221331. Connection info is given at https://github.com/mozilla/churn-analysis/blob/master/cohort-report/pull-cohort-data.R#L244 (ping me on IRC for the password).
Flags: needinfo?(dzeber)
Comment 4•10 years ago
|
||
I get a 404 on that repo
Comment 5•10 years ago
|
||
I will not be able to address this request before the holiday. I still think this is a very good opportunity for the metrics team to learn how to use the new infrastructure. We have been bootstrapping a bunch of existing v2 reports for months and the core data pipeline work has been suffering it would be beneficial to move to a self-service model sooner rather than later.
There are two ways this can be implemented.
1) Using the core pipeline infrastructure and a Lua sandbox (the solution I would provide)
Pros
- the computation can be performed while the data is being read (faster, less expensive, no intermediate output or storage costs)
- all the business logic exists in a single system
Cons
- everything we write against the pipeline seems to be owned by us forever
- depending on the requirements the sandboxes usually contain more code since they are more generalized
- when we need higher performance we switch to custom C Lua modules which are not as approachable to many
2) Using a spark cluster in which case R can still be used
Pros
- metrics team can continue using R and self-serve in the future
- more external resources, documentation and help
Cons
- most likely still requires a Lua sandbox to transform/pre-digest the data into something Spark can consume directly (additional time, costs, code maintenance)
Assignee | ||
Updated•10 years ago
|
Assignee: mtrinkala → mreid
Assignee | ||
Comment 6•10 years ago
|
||
:Dexter and I are going to work on this.
Comment 7•10 years ago
|
||
What's the timeline here? For the marketing and product team, understanding and measuring churn/retention is a top-line KPI.
I have a number of retention cohort tests that I need to deploy in Q1 2016 and it makes me feel a bit uncomfortable that I don't have a way to measure retention based off of Unified Telemetry data.
When in Q1 can I expect to see a draft of this new dashboard? It doesn't have to be perfect to start. Just something that can be iterated on over time.
Thanks!
Assignee | ||
Comment 8•10 years ago
|
||
I'm currently working on a derived data set to power the churn analysis from Unified Telemetry. I expect to have that data set ready this week.
Once that data set is available, I don't anticipate too much more work to build a drop-in replacement for the current churn aggregates. I have functioning code to calculate the "presence" signal of clientIds we observed, the small data set will make it easier to infer the "absence" signal of clientIds we observed earlier, but not in the current period (for calculating inactive clientIds).
I will be out of the office a couple of days this week, so I expect to have something tangible to look at by mid next week.
Assignee | ||
Comment 9•10 years ago
|
||
I've landed the code to generate the churn data set:
https://github.com/vitillo/telemetry-batch-view/pull/6
I'm backfilling the data now.
Comment 10•10 years ago
|
||
(In reply to Mark Reid [:mreid] from comment #9)
> I've landed the code to generate the churn data set:
> https://github.com/vitillo/telemetry-batch-view/pull/6
>
> I'm backfilling the data now.
Hey Mark. Any update here?
Once you get something re-built with UT, where will the data go? CSVs? redshift? MySQL as mentioned in bug 1221331?
Just want to get up something soon to review, even if it isn't right, just to make sure we're going in the right direction. There are many Firefox and marketing initiatives in 2016 that are tied to retention and our old model and method of using FHR isn't valid now that UT is rolled out.
Thanks!
Flags: needinfo?(mreid)
Comment 11•10 years ago
|
||
Mark: Just want to make sure wherever you place the data that Josephine has access to it.
Josephine: If mark places the unified telemetry in redshift, will you be able to access it there for Tableau? Will we still need MySQL?
Flags: needinfo?(jtanumijaya)
Comment 12•10 years ago
|
||
Hi Chris,
Yes, Tableau can access redshift and we don't need MySQL for UT data.
Question: do you still need the old model and method using FHR as your reference. If not and you're thinking to start fresh using UT then we don't have to do the MySQL for FHR data anymore.
Flags: needinfo?(jtanumijaya)
Assignee | ||
Comment 13•10 years ago
|
||
(In reply to Chris More [:cmore] from comment #10)
> (In reply to Mark Reid [:mreid] from comment #9)
> > I've landed the code to generate the churn data set:
> > https://github.com/vitillo/telemetry-batch-view/pull/6
> >
> > I'm backfilling the data now.
>
> Hey Mark. Any update here?
I ran into some complications with the data set generation, so I'm continuing to work on that part. I anticipate another day or two for that, then back to the main logic of the churn aggregations.
> Once you get something re-built with UT, where will the data go? CSVs?
> redshift? MySQL as mentioned in bug 1221331?
CSV (to S3) or redshift output should both be straightforward.
>
> Just want to get up something soon to review, even if it isn't right, just
> to make sure we're going in the right direction. There are many Firefox and
> marketing initiatives in 2016 that are tied to retention and our old model
> and method of using FHR isn't valid now that UT is rolled out.
Yes, this makes sense! Getting churn data from UT is my #1 priority, and I'll try to get something reviewable asap.
Flags: needinfo?(mreid)
Comment 14•10 years ago
|
||
Hi Mark,
I was just looking at what I think is the PR for this:
https://github.com/vitillo/telemetry-batch-view/pull/6/files#diff-ea98ba9cfc5c794f0e9dd4b68900a97fR17
Assuming that this is the full schema for the resulting churn dataset, do we not need to include the profile build id as well? Isn't that what is used for identifying funnelcakes? Or am I misunderstanding?
Comment 15•10 years ago
|
||
Never mind, I see it's in the release channel.
Comment 16•10 years ago
|
||
(In reply to Josephine Tanumijaya from comment #12)
> Hi Chris,
>
> Yes, Tableau can access redshift and we don't need MySQL for UT data.
>
> Question: do you still need the old model and method using FHR as your
> reference. If not and you're thinking to start fresh using UT then we don't
> have to do the MySQL for FHR data anymore.
I would prefer having the old FHR report around as long as possible for reference. Likely the numbers will change some and we'll learn something new on the UT version.
Assignee | ||
Comment 17•10 years ago
|
||
Quick update - I believe I've solved the complications I mentioned in Comment 13 with this PR:
https://github.com/vitillo/telemetry-batch-view/pull/10
I should be able to go back to working on the churn aggregation logic presently.
Assignee | ||
Updated•10 years ago
|
Points: --- → 3
Assignee | ||
Comment 18•10 years ago
|
||
Per discussion this morning, the initial output will be CSV stored on S3.
Assignee | ||
Comment 19•10 years ago
|
||
Where would be a good place to put sample data from the new aggregation code? Josephine, if I share a CSV file on GDrive, would that be convenient for you?
Flags: needinfo?(jtanumijaya)
Comment 20•10 years ago
|
||
yeah Gdrive is good if only for sample data - not a huge data.
For the real data, dzeber put a weekly file here:
https://metrics.mozilla.com/protected/dzeber/cohort-activity/
Flags: needinfo?(jtanumijaya)
Assignee | ||
Comment 21•10 years ago
|
||
It's pretty small so far, about 5MB compressed. I've shared the folder with you.
Assignee | ||
Comment 22•10 years ago
|
||
Note that this initial version only contains counts where "is_active" == True, so the overall counts are not expected to match up. I found that the active counts are about 10x higher in the CSV created from FHR v2, so I will investigate that further.
Assignee | ||
Comment 23•10 years ago
|
||
Minor update - I've generated and backfilled the derived dataset that's being used for churn. Data has been updated going back to November 1, 2015.
I've also deployed a scheduled job to update the dataset each day.
Assignee | ||
Comment 24•10 years ago
|
||
I've started computing the CSV aggregates including both "active" and "inactive" signal, and have uploaded some updated sample data to gdrive. So far I have processed November and part of December. I will continue to add more data as it completes.
Assignee | ||
Comment 25•10 years ago
|
||
I've run into an issue scaling up to larger datasets. Roberto, could you take a look at the notebook and let me know if you have any suggestions?
https://gist.github.com/mreid-moz/3cc075448f796caca658
Flags: needinfo?(rvitillo)
Comment 26•10 years ago
|
||
As discussed on IRC replacing the countByKey with a reduceByKey might help.
Flags: needinfo?(rvitillo)
Comment 27•10 years ago
|
||
Here's instructions on how to unzip and combine all CSVs into a single CSV to use within Tableau:
These are for the *.actives.csv.gz files:
1) download all .gz files into a folder on your desktop (call it cohorts for example)
2) Open up ther Terminal
3) cd Desktop/
4) cd cohorts/
5) On the command line in the cohort directory, type in and hit enter:
for f in *.actives.csv.gz; do gunzip $f; done
6) In the command line, type and hit enter after:
echo "channel,geo,is_funnelcake,acquisition_period,start_version,sync_usage,current_version,current_week,is_active,n_profiles" > combined-actives.csv
7) On the command line, type in and hit enter:
tail -qn +2 churn-*.actives.csv >> combined-actives.csv
You should now have a single combined-actives.csv file to load into tableau for further analysis.
Assignee | ||
Comment 28•10 years ago
|
||
For now, I've uploaded sample data files containing only the "active" users.
Comment 29•10 years ago
|
||
:mreid: Given what we learned in bug 1246722, we will need to include another field in the CSVs for the distributionID field. Without that field, I won't be able to compare the funnelcakes and that's a must have for retention. I had it with FHR because the funnelcake IDs were part of the channel name. Sorry for the change in scope, but that was unknown until now and I had no idea they changed it.
Assignee | ||
Comment 30•10 years ago
|
||
I'll update the Churn dataset accordingly. While we're still in "validation" mode, I can reconstruct the funnelcake-as-channel format from the old CSV if that makes things easier.
Flags: needinfo?(chrismore.bugzilla)
Comment 31•10 years ago
|
||
(In reply to Chris More [:cmore] from comment #29)
> :mreid: Given what we learned in bug 1246722, we will need to include
> another field in the CSVs for the distributionID field. Without that field,
> I won't be able to compare the funnelcakes and that's a must have for
> retention. I had it with FHR because the funnelcake IDs were part of the
> channel name. Sorry for the change in scope, but that was unknown until now
> and I had no idea they changed it.
Sorry if this is already clear, but just in case - we'd only want the distributionID for funnelcake builds. Non-funnelcake release channel profiles have a very wide variety of distributionID strings that are not relevant here.
Maybe we could add a 'funnelcakeID' column which gives the ID number (XX in mozillaXX) for funnelcake builds and is null for non-funnelcake, or something like that.
Comment 32•10 years ago
|
||
is the harm of including non-funnelcake builds in the distributionID that the data set just gets much bigger given all of the partner builds? I can see use-cases where it would be nice to know the retention of partner builds beyond funnelcakes, but I guess to limit scope, mozillaXX (^mozilla[0-4]+$) is suffice for now.
Flags: needinfo?(chrismore.bugzilla)
Comment 33•10 years ago
|
||
Yes, it gets a lot bigger, and I think it would mean a change in scope.
(In reply to Chris More [:cmore] from comment #32)
> (^mozilla[0-4]+$)
Did you mean ^mozilla[0-9]+$ ?
Comment 34•10 years ago
|
||
(In reply to Dave Zeber [:dzeber] from comment #33)
> Yes, it gets a lot bigger, and I think it would mean a change in scope.
>
> (In reply to Chris More [:cmore] from comment #32)
> > (^mozilla[0-4]+$)
>
> Did you mean ^mozilla[0-9]+$ ?
Yes, sorry long days.
Assignee | ||
Comment 35•10 years ago
|
||
For this round, I've detected the funnelcake values from distributionId and used them to re-create the normalized[1] channel identifier we were using previously.
So if the distributionId field matches ^mozilla[0-9]+$, then update the channel name to "<channel>-cck-<distributionId>", otherwise use the channel name directly.
It should be a drop-in replacement on the dashboard we've been using for validation.
[1] The normalized channel looks at variants of a channel (ccks, mostly) and uses the base channel, so things like "release-cck-mozillaonline" become simply "release".
Comment 36•10 years ago
|
||
(In reply to Mark Reid [:mreid] from comment #35)
> For this round, I've detected the funnelcake values from distributionId and
> used them to re-create the normalized[1] channel identifier we were using
> previously.
>
> So if the distributionId field matches ^mozilla[0-9]+$, then update the
> channel name to "<channel>-cck-<distributionId>", otherwise use the channel
> name directly.
>
> It should be a drop-in replacement on the dashboard we've been using for
> validation.
>
> [1] The normalized channel looks at variants of a channel (ccks, mostly) and
> uses the base channel, so things like "release-cck-mozillaonline" become
> simply "release".
Sounds good! Jean and I will get it plugged in and see what it looks like.
Comment 37•10 years ago
|
||
Here's the new instructions to combine into one csv given the filename changes with the funnelcakes:
1) download all .gz files that you need (funnelcake or not) into a folder on your desktop (call it cohorts for example)
2) Open up ther Terminal
3) cd Desktop/
4) cd cohorts/
5) On the command line in the cohort directory, type in and hit enter:
for f in *.actives*.csv.gz; do gunzip $f; done
6) In the command line, type and hit enter after:
echo "channel,geo,is_funnelcake,acquisition_period,start_version,sync_usage,current_version,current_week,is_active,n_profiles" > combined-actives.csv
7) On the command line, type in and hit enter:
tail -qn +2 churn-*.actives.csv >> combined-actives.csv
You should now have a single combined-actives.csv file to load into tableau for further analysis.
Comment 38•10 years ago
|
||
opps, needed to make another change:
Here's the new instructions to combine into one csv given the filename changes with the funnelcakes:
1) download all .gz files that you need (funnelcake or not) into a folder on your desktop (call it cohorts for example)
2) Open up ther Terminal
3) cd Desktop/
4) cd cohorts/
5) On the command line in the cohort directory, type in and hit enter:
for f in *.actives*.csv.gz; do gunzip $f; done
6) In the command line, type and hit enter after:
echo "channel,geo,is_funnelcake,acquisition_period,start_version,sync_usage,current_version,current_week,is_active,n_profiles" > combined-actives.csv
7) On the command line, type in and hit enter:
tail -qn +2 *.actives*.csv >> combined-actives.csv
You should now have a single combined-actives.csv file to load into tableau for further analysis.
Comment 39•10 years ago
|
||
Hi Mark. Once we wrap this one, we can probably move on to bug 1253751 as this will just add another column of data to the data source that we can display with Tableau on separate tabs. Thanks in advance!
Comment 40•10 years ago
|
||
mreid: what's the status of this datasource with csvs on s3?
Also, do we have a bug filed to move this to redshift?
Flags: needinfo?(mreid)
Assignee | ||
Comment 41•10 years ago
|
||
The CSV output is ready to go - I've been talking with Josephine about the most convenient place to output them.
I just filed bug 1257506 to migrate the output to Redshift.
Flags: needinfo?(mreid)
Comment 42•10 years ago
|
||
:mreid: any idea why the multi devices FxA cohort data is showing 100% churn (or 0% retention). The single and no account look fine. Is there a problem with the source CSV file for the multiple devices?
https://dataviz.mozilla.org/views/FirefoxDesktopCohortAnalysis-UT_0/BySyncType#1
Josephine, any idea?
Flags: needinfo?(mreid)
Flags: needinfo?(jtanumijaya)
Assignee | ||
Comment 43•10 years ago
|
||
(In reply to Chris More [:cmore] from comment #42)
> :mreid: any idea why the multi devices FxA cohort data is showing 100% churn
> (or 0% retention). The single and no account look fine. Is there a problem
> with the source CSV file for the multiple devices?
>
> https://dataviz.mozilla.org/views/FirefoxDesktopCohortAnalysis-UT_0/
> BySyncType#1
>
> Josephine, any idea?
I suspect the measure for how many devices a sync account is connected to (WEAVE_DEVICE_COUNT_*) haven't made their way to release yet. If you change to the beta channel, it appears non-zero.
Flags: needinfo?(mreid)
Comment 44•10 years ago
|
||
(In reply to Mark Reid [:mreid] from comment #43)
> (In reply to Chris More [:cmore] from comment #42)
> > :mreid: any idea why the multi devices FxA cohort data is showing 100% churn
> > (or 0% retention). The single and no account look fine. Is there a problem
> > with the source CSV file for the multiple devices?
> >
> > https://dataviz.mozilla.org/views/FirefoxDesktopCohortAnalysis-UT_0/
> > BySyncType#1
> >
> > Josephine, any idea?
>
> I suspect the measure for how many devices a sync account is connected to
> (WEAVE_DEVICE_COUNT_*) haven't made their way to release yet. If you change
> to the beta channel, it appears non-zero.
Ah yes, it was supposed to make it into 45, but it looks like the target milestone is 46: https://bugzilla.mozilla.org/show_bug.cgi?id=1232050
Comment 45•10 years ago
|
||
Chris, it looks like you got the answer?
I have attached the data for multiple devices (export from Tableau) for you to have a look. I don't think there is enough data to conclude retention rate for multiple devices.
Flags: needinfo?(jtanumijaya)
Assignee | ||
Comment 46•10 years ago
|
||
Calling this bug "done" - the last mile of making data available to Tableau will be done over in bug 1257506.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment 47•10 years ago
|
||
Mark, are you still continuously adding the weekly data to the google drive you provided?
Thanks,
Josephine
Updated•7 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•