Closed
Bug 1309688
Opened 9 years ago
Closed 9 years ago
ATMO v2: Ensure that a deploy does not impact running clusters or scheduled jobs
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect, P2)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mreid, Unassigned)
References
Details
Attachments
(1 file)
A problem with ATMO v1 was that a deploy of the service during a job run could cause the job to be interrupted before it completed.
Since users can specify that jobs run at any time, we should ensure that deploying new code does not impact running clusters or jobs.
![]() |
||
Comment 1•9 years ago
|
||
Mark, can you elaborate how the jobs was interrupted when ATMOv1 was deployed? Did it somehow reset the jobs or something during deploy?
Flags: needinfo?(mreid)
Reporter | ||
Comment 2•9 years ago
|
||
Since the job was actually launched from the webserver node (via cron), a shutdown would stop monitoring any running jobs, so any detection of job success / failure wouldn't work. I believe it would also force-stop any old-style non-spark jobs, but that shouldn't be a concern anymore.
Also, it was possible for the scheduler to "miss" jobs if their execution time happened after the previous instance was torn down, but before the new instance was fully spun up. That meant whoever was doing the deploy had to take care not to do it right around the time when jobs were scheduled to launch.
Flags: needinfo?(mreid)
![]() |
||
Updated•9 years ago
|
Points: --- → 2
Priority: -- → P2
![]() |
||
Comment 3•9 years ago
|
||
As long as the processes receive a SIGTERM for termination everything should be fine:
gunicorn: http://docs.gunicorn.org/en/stable/signals.html#master-process
rq worker: http://python-rq.org/docs/workers/
rq scheduler: https://github.com/ui/rq-scheduler/blob/396efadda8610548b474e680507b278676fc2262/rq_scheduler/scheduler.py#L52-L67
:robotblake do you know if that's the case in the dockerflow environment?
Flags: needinfo?(bimsland)
![]() |
||
Comment 4•9 years ago
|
||
![]() |
||
Comment 5•9 years ago
|
||
I'll do some testing but I believe that this is doable (and may work already?).
Flags: needinfo?(bimsland)
![]() |
||
Comment 6•9 years ago
|
||
It appears that currently the process will receive a SIGTERM followed approximately 30 seconds later (assuming it's still alive) by a SIGKILL.
![]() |
||
Updated•9 years ago
|
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•