Closed Bug 1221331 Opened 10 years ago Closed 10 years ago

Create MySQL table for cohort retention report

Categories

(Data & BI Services Team :: DB: MySQL, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dzeber, Assigned: team73)

Details

Currently the cohort retention report (bug 1198537) is a CSV table that gets generated on a weekly basis. There is a request (bug 1204951) to get this output dumped into MySQL so that the data can be loaded into Tableau. I would like to create a MySQL database with a table set up to contain this data, and I will be performing a weekly upsert on the table as a part of the job that generates the new dataset. Here is the schema for the table with the proposed data type and a short description: 1 | channel | varchar(50) | update channel (sanitized) 2 | geo | varchar(2) | 2-letter country code 3 | is_funnelcake | boolean | whether or not the channel is on funnelcake 4 | acquisition_period | date | the week of profile creation 5 | start_version | varchar(15) | the version on which the profile was created 6 | sync_usage | 3 possible values | whether the profile has sync set up on no devices, a single device, or multiple devices 7 | current_version | varchar(15) | the version the profile was on during the current week 8 | current_week | int | the current week, numbered relative to acquisition week (week 0) 9 | is_active | boolean | whether the profile was active during the current week 10 | n_profiles | int | number of profiles in the segment described by the other columns I'm not sure what is the appropriate type for 'sync_usage', a categorical variable with 3 possible values (eg. maybe int {0,1,2}). The actual CSVs are available behind LDAP at https://metrics.mozilla.com/protected/dzeber/cohort-activity/. The key for the upsert will be columns 1-9. For each such key in the new dataset, if it already exists in the database its count should be overwritten with the new value, otherwise a new row should be added. Existing keys that are not in the new dataset should be left as-is. Also, it would be great if I could get a code snippet to perform the upsert that includes the right paths and SQL commands.
sync_usage could be an ENUM - enumerated list - looks like text but underlying is numbers, so it's small and fast but you can query using text. e.g. an ENUM list of {yes, no, maybe} corresponds to {1, 2, 3} but you query using "yes" "no" "maybe" and that's what it looks like in the DB, so no need to memorize that 1 is yes and 2 is no. Pythian: This table should be called cohort_retention and be created in a new database called exec_dashboards on the metrics cluster, master is metrics2.db.scl3.mozilla.com. The table should be InnoDB, and have an auto increment primary key. Please also add a unique key of columns 1-9. Please create a new user, exec_dash, with read/write on all tables in the exec_dashboards database. Please send me the password via secure mechanisms (private IRC message works as I connect via IRC over SSL) Please update the tableau_read user to be able to read all tables in the exec_dashboards database. Please give Dave some sample upsert statements for this - using INSERT....ON DUPLICATE KEY UPDATE. Thanx!
It would be great if this could get done by end of day Thursday (even better if sooner)!
Hi, We have created internal CR https://secure.pythian.com/track/cr/1002996 and started working on it. Thanks, Rohit Kumar
Hi, We have executed the changes. Create database exec_dashboards on metrics2.db.scl3.mozilla.com create table cohort_retention create a new user, exec_dash, with read/write on all tables in the exec_dashboards database. Update the tableau_read user to be able to read all tables in the exec_dashboards database. Password sent to Sheeri via secure mechanisms (private IRC message works as she connect via IRC over SSL) There are few steps to insert data as per requirement: Sample upsert statement for uploading data, we can create a script for this to upload efficiently. a) Create a new temporary table. mysql> CREATE TEMPORARY TABLE tmp_cohort_retention LIKE cohort_retention; b) Optionally, drop all indices from the temporary table to speed things up. mysql> SHOW INDEX FROM tmp_cohort_retention; mysql> DROP INDEX `PRIMARY` ON tmp_cohort_retention; mysql> DROP INDEX `unq_cohort_retention` ON tmp_cohort_retention; c) Load the CSV into the temporary table LOAD DATA INFILE 'your_file.csv' INTO TABLE tmp_cohort_retention FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' (channel ,geo, is_funnelcake, acquisition_period, start_version,sync_usage , current_version ,current_week,is_active,n_profiles ); d) Copy the data using ON DUPLICATE KEY UPDATE mysql> SHOW COLUMNS FROM cohort_retention; mysql> INSERT INTO cohort_retention SELECT * FROM tmp_cohort_retention ON DUPLICATE KEY UPDATE channel = VALUES(channel), geo = VALUES(geo),is_funnelcake = VALUES(is_funnelcake),acquisition_period = VALUES(acquisition_period),start_version = VALUES(start_version),sync_usage = VALUES(sync_usage),current_version = VALUES(current_version),current_week = VALUES(current_week),is_active = VALUES(is_active),n_profiles = VALUES(n_profiles); e) Remove the temporary table mysql> DROP TEMPORARY TABLE tmp_cohort_retention; Please let us know if there are any issues. We can also prepare a procedure for upsert. we have internal CR for reference: https://secure.pythian.com/track/cr/1002996 Thanks, Rohit Kumar
Those load commands are great, but we were looking for a simpler upsert: INSERT INTO cohort_retention (channel, geo, is_funnelcake, acquisition_period, start_version, sync_usage, current_version, current_week, is_active, n_profiles) VALUES (....) ON DUPLICATE KEY UPDATE channel = VALUES(channel), geo = VALUES(geo),is_funnelcake = VALUES(is_funnelcake),acquisition_period = VALUES(acquisition_period),start_version = VALUES(start_version),sync_usage = VALUES(sync_usage),current_version = VALUES(current_version),current_week = VALUES(current_week),is_active = VALUES(is_active),n_profiles = VALUES(n_profiles); Where VALUES(...) contains a comma separated list of the actual values, like VALUES('channel1', 'geo1', ....) Anyway, I think that's all that's needed, will leave it to the devs to decide if this is enough info.
Hi Sheeri, Thanks for update, i provided it as an option (though a bit long :)) , as we have a big key (including 9 columns) which may affect insert performance. Thanks, Rohit Kumar
Thanx! I am working on loading data from https://metrics.mozilla.com/protected/dzeber/cohort-activity/ into table now.
dzeber: When importing the data files, the boolean values are set as "yes" or "no". I can change the files themselves so "no" is 0 and "yes" is 1, but your script will need to be modified. Also, I'm seeing a country code of ROW - what is that? (the import is complaining that the field only allows 2 chars but this is 3): e.g. "aurora","ROW",0,"2014-12-07","36.0","multiple","36.0",17,0,2
Rohit: thanks for setting this up! Sheeri: yes, some values need to be modified. I need to adjust my script to prepare the data for dumping to the table. "ROW" means "rest of world", included so that we can get global totals. We can code it as "RW", or else as something like "--".
For now let's do "--", as there may be Firefox in Rwanda....I can change the files I have.
Sounds good.
FYI, all the data has been imported (Rohit, actually for the initial import, your statements were super helpful). Josephine, you should have data to work from now. Dave, let me know the best way to transfer the password for the "exec_dash" user.
Sheeri: Let's connect on IRC.
We connected on IRC, and Dave confirmed he has access.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.