Closed
Bug 1221331
Opened 10 years ago
Closed 10 years ago
Create MySQL table for cohort retention report
Categories
(Data & BI Services Team :: DB: MySQL, task)
Data & BI Services Team
DB: MySQL
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dzeber, Assigned: team73)
Details
Currently the cohort retention report (bug 1198537) is a CSV table that gets generated on a weekly basis. There is a request (bug 1204951) to get this output dumped into MySQL so that the data can be loaded into Tableau.
I would like to create a MySQL database with a table set up to contain this data, and I will be performing a weekly upsert on the table as a part of the job that generates the new dataset.
Here is the schema for the table with the proposed data type and a short description:
1 | channel | varchar(50) | update channel (sanitized)
2 | geo | varchar(2) | 2-letter country code
3 | is_funnelcake | boolean | whether or not the channel is on funnelcake
4 | acquisition_period | date | the week of profile creation
5 | start_version | varchar(15) | the version on which the profile was created
6 | sync_usage | 3 possible values | whether the profile has sync set up on no devices, a single device, or multiple devices
7 | current_version | varchar(15) | the version the profile was on during the current week
8 | current_week | int | the current week, numbered relative to acquisition week (week 0)
9 | is_active | boolean | whether the profile was active during the current week
10 | n_profiles | int | number of profiles in the segment described by the other columns
I'm not sure what is the appropriate type for 'sync_usage', a categorical variable with 3 possible values (eg. maybe int {0,1,2}).
The actual CSVs are available behind LDAP at https://metrics.mozilla.com/protected/dzeber/cohort-activity/.
The key for the upsert will be columns 1-9. For each such key in the new dataset, if it already exists in the database its count should be overwritten with the new value, otherwise a new row should be added. Existing keys that are not in the new dataset should be left as-is.
Also, it would be great if I could get a code snippet to perform the upsert that includes the right paths and SQL commands.
![]() |
||
Comment 1•10 years ago
|
||
sync_usage could be an ENUM - enumerated list - looks like text but underlying is numbers, so it's small and fast but you can query using text. e.g. an ENUM list of {yes, no, maybe} corresponds to {1, 2, 3} but you query using "yes" "no" "maybe" and that's what it looks like in the DB, so no need to memorize that 1 is yes and 2 is no.
Pythian:
This table should be called cohort_retention and be created in a new database called exec_dashboards on the metrics cluster, master is metrics2.db.scl3.mozilla.com.
The table should be InnoDB, and have an auto increment primary key. Please also add a unique key of columns 1-9.
Please create a new user, exec_dash, with read/write on all tables in the exec_dashboards database. Please send me the password via secure mechanisms (private IRC message works as I connect via IRC over SSL)
Please update the tableau_read user to be able to read all tables in the exec_dashboards database.
Please give Dave some sample upsert statements for this - using INSERT....ON DUPLICATE KEY UPDATE.
Thanx!
![]() |
||
Comment 2•10 years ago
|
||
It would be great if this could get done by end of day Thursday (even better if sooner)!
Assignee | ||
Comment 3•10 years ago
|
||
Hi,
We have created internal CR https://secure.pythian.com/track/cr/1002996 and started working on it.
Thanks,
Rohit Kumar
Assignee | ||
Comment 4•10 years ago
|
||
Hi,
We have executed the changes.
Create database exec_dashboards on metrics2.db.scl3.mozilla.com
create table cohort_retention
create a new user, exec_dash, with read/write on all tables in the exec_dashboards database.
Update the tableau_read user to be able to read all tables in the exec_dashboards database.
Password sent to Sheeri via secure mechanisms (private IRC message works as she connect via IRC over SSL)
There are few steps to insert data as per requirement:
Sample upsert statement for uploading data, we can create a script for this to upload efficiently.
a) Create a new temporary table.
mysql> CREATE TEMPORARY TABLE tmp_cohort_retention LIKE cohort_retention;
b) Optionally, drop all indices from the temporary table to speed things up.
mysql> SHOW INDEX FROM tmp_cohort_retention;
mysql> DROP INDEX `PRIMARY` ON tmp_cohort_retention;
mysql> DROP INDEX `unq_cohort_retention` ON tmp_cohort_retention;
c) Load the CSV into the temporary table
LOAD DATA INFILE 'your_file.csv'
INTO TABLE tmp_cohort_retention
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
(channel ,geo, is_funnelcake, acquisition_period, start_version,sync_usage , current_version ,current_week,is_active,n_profiles );
d) Copy the data using ON DUPLICATE KEY UPDATE
mysql> SHOW COLUMNS FROM cohort_retention;
mysql> INSERT INTO cohort_retention
SELECT * FROM tmp_cohort_retention
ON DUPLICATE KEY UPDATE channel = VALUES(channel), geo = VALUES(geo),is_funnelcake = VALUES(is_funnelcake),acquisition_period = VALUES(acquisition_period),start_version = VALUES(start_version),sync_usage = VALUES(sync_usage),current_version = VALUES(current_version),current_week = VALUES(current_week),is_active = VALUES(is_active),n_profiles = VALUES(n_profiles);
e) Remove the temporary table
mysql> DROP TEMPORARY TABLE tmp_cohort_retention;
Please let us know if there are any issues.
We can also prepare a procedure for upsert.
we have internal CR for reference:
https://secure.pythian.com/track/cr/1002996
Thanks,
Rohit Kumar
![]() |
||
Comment 5•10 years ago
|
||
Those load commands are great, but we were looking for a simpler upsert:
INSERT INTO cohort_retention (channel, geo, is_funnelcake, acquisition_period, start_version, sync_usage, current_version, current_week, is_active, n_profiles)
VALUES (....)
ON DUPLICATE KEY UPDATE channel = VALUES(channel), geo = VALUES(geo),is_funnelcake = VALUES(is_funnelcake),acquisition_period = VALUES(acquisition_period),start_version = VALUES(start_version),sync_usage = VALUES(sync_usage),current_version = VALUES(current_version),current_week = VALUES(current_week),is_active = VALUES(is_active),n_profiles = VALUES(n_profiles);
Where VALUES(...) contains a comma separated list of the actual values, like VALUES('channel1', 'geo1', ....)
Anyway, I think that's all that's needed, will leave it to the devs to decide if this is enough info.
Assignee | ||
Comment 6•10 years ago
|
||
Hi Sheeri,
Thanks for update, i provided it as an option (though a bit long :)) , as we have a big key (including 9 columns) which may affect insert performance.
Thanks,
Rohit Kumar
![]() |
||
Comment 7•10 years ago
|
||
Thanx!
I am working on loading data from https://metrics.mozilla.com/protected/dzeber/cohort-activity/ into table now.
![]() |
||
Comment 8•10 years ago
|
||
dzeber: When importing the data files, the boolean values are set as "yes" or "no". I can change the files themselves so "no" is 0 and "yes" is 1, but your script will need to be modified.
Also, I'm seeing a country code of ROW - what is that? (the import is complaining that the field only allows 2 chars but this is 3):
e.g.
"aurora","ROW",0,"2014-12-07","36.0","multiple","36.0",17,0,2
Reporter | ||
Comment 9•10 years ago
|
||
Rohit: thanks for setting this up!
Sheeri: yes, some values need to be modified. I need to adjust my script to prepare the data for dumping to the table. "ROW" means "rest of world", included so that we can get global totals. We can code it as "RW", or else as something like "--".
![]() |
||
Comment 10•10 years ago
|
||
For now let's do "--", as there may be Firefox in Rwanda....I can change the files I have.
Reporter | ||
Comment 11•10 years ago
|
||
Sounds good.
![]() |
||
Comment 12•10 years ago
|
||
FYI, all the data has been imported (Rohit, actually for the initial import, your statements were super helpful).
Josephine, you should have data to work from now.
Dave, let me know the best way to transfer the password for the "exec_dash" user.
Reporter | ||
Comment 13•10 years ago
|
||
Sheeri: Let's connect on IRC.
![]() |
||
Comment 14•10 years ago
|
||
We connected on IRC, and Dave confirmed he has access.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•