Closed Bug 1309688 Opened 9 years ago Closed 9 years ago

ATMO v2: Ensure that a deploy does not impact running clusters or scheduled jobs

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: mreid, Unassigned)

References

Details

Attachments

(1 file)

[telemetry-analysis-service] mozilla:bug1309688 > mozilla:master 9 years ago GitHub Autolander Bot 61 bytes, text/x-github-pull-request		Details \| Review

Mark Reid [:mreid]

Reporter

Description

•

9 years ago

A problem with ATMO v1 was that a deploy of the service during a job run could cause the job to be interrupted before it completed. Since users can specify that jobs run at any time, we should ensure that deploying new code does not impact running clusters or jobs.

Roberto Agostino Vitillo (:rvitillo)

Updated

•

9 years ago

Blocks: 1248688

Jannis Leidel [:jezdez]

Comment 1

•

9 years ago

Mark, can you elaborate how the jobs was interrupted when ATMOv1 was deployed? Did it somehow reset the jobs or something during deploy?

Flags: needinfo?(mreid)

Mark Reid [:mreid]

Reporter

Comment 2

•

9 years ago

Since the job was actually launched from the webserver node (via cron), a shutdown would stop monitoring any running jobs, so any detection of job success / failure wouldn't work. I believe it would also force-stop any old-style non-spark jobs, but that shouldn't be a concern anymore. Also, it was possible for the scheduler to "miss" jobs if their execution time happened after the previous instance was torn down, but before the new instance was fully spun up. That meant whoever was doing the deploy had to take care not to do it right around the time when jobs were scheduled to launch.

Flags: needinfo?(mreid)

Thomas Huelbert

Updated

•

9 years ago

Points: --- → 2

Priority: -- → P2

Mauro Doglio [:mdoglio]

Comment 3

•

9 years ago

As long as the processes receive a SIGTERM for termination everything should be fine: gunicorn: http://docs.gunicorn.org/en/stable/signals.html#master-process rq worker: http://python-rq.org/docs/workers/ rq scheduler: https://github.com/ui/rq-scheduler/blob/396efadda8610548b474e680507b278676fc2262/rq_scheduler/scheduler.py#L52-L67 :robotblake do you know if that's the case in the dockerflow environment?

Flags: needinfo?(bimsland)

GitHub Autolander Bot

Comment 4

•

9 years ago

Attached file [telemetry-analysis-service] mozilla:bug1309688 > mozilla:master — Details

Blake Imsland [:robotblake]

Comment 5

•

9 years ago

I'll do some testing but I believe that this is doable (and may work already?).

Flags: needinfo?(bimsland)

Blake Imsland [:robotblake]

Comment 6

•

9 years ago

It appears that currently the process will receive a SIGTERM followed approximately 30 seconds later (assuming it's still alive) by a SIGKILL.

Mauro Doglio [:mdoglio]

Updated

•

9 years ago

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

7 years ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

ATMO v2: Ensure that a deploy does not impact running clusters or scheduled jobs

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P2)

Tracking

(Not tracked)

People

(Reporter: mreid, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Updated

Attachment

General

Description

File Name

Content Type