Closed Bug 1326068 Opened 9 years ago Closed 8 years ago

Add Datadog Docker container monitoring to Airflow ECS cluster

Categories

(Data Platform and Tools :: Monitoring & Alerting, defect, P2)

defect
Points:
1

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bugzilla, Assigned: whd)

References

Details

(Whiteboard: [SvcOps])

I've heard we have a datadog account connected to the cloud services dev AWS account, which would be super useful to get container-level metrics on the airflow ECS instance (our zombie job issues were probably due to OOM on either the worker and/or the scheduler container, but the built-in CloudWatch metrics only show task-level metrics.) If someone with access could add me to the account I'd be happy to take the steps listed here to add monitoring and alerting: https://www.datadoghq.com/blog/monitor-docker-on-aws-ecs/
Flags: needinfo?(whd)
Whiteboard: [SvcOps]
Points: --- → 1
Priority: -- → P2
I've invited :sunasuh to datadog per https://www.datadoghq.com/blog/monitor-docker-on-aws-ecs/ and gpg+emailed the dev api key. This api key should not be stored anywhere unencrypted except on the ECS host itself running the datadog agent. I assume this is a one-off and doesn't need to be automated; if that is not the case then there is considerably more work to do here to set up proper instance provisioning logic (most imporantly, the use of SOPS so that we can pull the api key from KMS).
Flags: needinfo?(whd)
Component: Metrics: Pipeline → Monitoring & Alerting
Product: Cloud Services → Data Platform and Tools
Airflow hosts now report stats to datadog as part of bug #1336975.
Assignee: nobody → whd
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.