Closed Bug 1309688 Opened 9 years ago Closed 9 years ago

ATMO v2: Ensure that a deploy does not impact running clusters or scheduled jobs

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mreid, Unassigned)

References

Details

Attachments

(1 file)

A problem with ATMO v1 was that a deploy of the service during a job run could cause the job to be interrupted before it completed. Since users can specify that jobs run at any time, we should ensure that deploying new code does not impact running clusters or jobs.
Mark, can you elaborate how the jobs was interrupted when ATMOv1 was deployed? Did it somehow reset the jobs or something during deploy?
Flags: needinfo?(mreid)
Since the job was actually launched from the webserver node (via cron), a shutdown would stop monitoring any running jobs, so any detection of job success / failure wouldn't work. I believe it would also force-stop any old-style non-spark jobs, but that shouldn't be a concern anymore. Also, it was possible for the scheduler to "miss" jobs if their execution time happened after the previous instance was torn down, but before the new instance was fully spun up. That meant whoever was doing the deploy had to take care not to do it right around the time when jobs were scheduled to launch.
Flags: needinfo?(mreid)
Points: --- → 2
Priority: -- → P2
As long as the processes receive a SIGTERM for termination everything should be fine: gunicorn: http://docs.gunicorn.org/en/stable/signals.html#master-process rq worker: http://python-rq.org/docs/workers/ rq scheduler: https://github.com/ui/rq-scheduler/blob/396efadda8610548b474e680507b278676fc2262/rq_scheduler/scheduler.py#L52-L67 :robotblake do you know if that's the case in the dockerflow environment?
Flags: needinfo?(bimsland)
I'll do some testing but I believe that this is doable (and may work already?).
Flags: needinfo?(bimsland)
It appears that currently the process will receive a SIGTERM followed approximately 30 seconds later (assuming it's still alive) by a SIGKILL.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: