Closed Bug 1441382 Opened 8 years ago Closed 7 years ago

Finalize plans to copy and reload main_summary after backfill

Categories

(Data Platform and Tools :: General, enhancement, P1)

enhancement
Points:
2

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bugzilla, Assigned: bugzilla)

References

Details

Really hoping this is the last backfill we do where we copy data around! We need to move ~35TB at the end of the backfill, overwriting /v4 from 20170701 on. While the data is transferring and we rerun p2h, we'll consider main_summary to be on scheduled downtime, which we'd like to minimize. If we run the transfer with multiple threads and a few of the 10Gbps machines (c5.large, say), we should be able to transfer all the data within an hour or two (I've clocked 1 day of main_summary at ~15 mins, and have transferred up to 6 days at a time one of the atmo c4.xlarge machines without any loss of parallelism with a few config tweaks.) The other consideration here is that since we're *overwriting* a path, we may run into eventual consistency issues. My proposal is to write a script that polls the path at main_summary at an interval until the files in the backfill bucket match the files in the production bucket. Once we have a clean run of this script we can commence running p2h. The hope is to get this all done within a weekend to minimize disruption to end-users. So, to-dos to complete this bug: - document threaded copy script - write the bucket comparison script - write step-by-step checklist for the actual transfer
Updates: - It's probably ~50 TB - Check how long a transfer would take via aws lambda (note lambda functions timeout after 5 minutes, so we can't do 1 day at a time, but there's also an account-level lambda limit of 1000 concurrent functions at any given time so this may or may not work)
The work in Bug 1434349 might be worth bikeshedding into this backfill, otherwise it will be difficult to switch from partitioning to sorting by sample_id.
comment 2 was discussed off-bug, and it was agreed that sorting by sample_id is too risky a change to include in this backfill.
Priority: -- → P1
This is going to be much easier than I thought on a number of fronts: - Because we merged in the backfill code to production main summary on Thursday 3/29, Athena has already picked up all the schema changes, and robotblake added all the new columns to the presto metastore today. We won't need to make any metastore changes after copying the data over - I don't actually need to write a comparison script -- because new s3 objects have read-after-write consistency and it's only the deleted parquet files we need to worry about re: eventual consistency, we just need to make sure the number of objects in the directories of interest match the number of objects expected. So, basically, I just need to spin up an enhanced networking ec2 node and run the script, and then poll s3 with --summarize and the correct "include"/"exclude" parameters until the numbers match up. The threaded copy script is here, for lack of a better place: https://gist.github.com/sunahsuh/3ff14afd55d6e45afbf48c0c0032010b Note: I'll revisit some of the things we can improve on for the next backfill, including not having all the experiential knowledge/code from past backfills collected in a single accessible place in a "lessons learned" doc after this backfill is over.
And here's the checklist I used. I have a few things I wish I'd added that I'll put in the retro doc: # Backfill Checklist - Spin up m4.16xlarge instance - Test to make sure the instance has access to telemetry-backfill and telemetry-parquet - Figure out optimal number of threads to max out connection - Determine number of objects expected after copy - Send out email announcing start of copy - Start copy with the optimal number of threads - `time ./threaded_sisyphus.py --table main_summary --version v4 --start 20170301 --end 20180401 --delete --threads 40 --dryrun` - Poll s3 cli until —summarize returns the expected number of objects - `./consistency_test.py --table main_summary --version v4 --start 20170301 --end 20180401 --threads 3` - Send out email announcing completion
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Datasets: Main Summary → General
You need to log in before you can comment on or make changes to this bug.