Closed Bug 1329779 Opened 9 years ago Closed 6 years ago

Store spark application and history logs on EMR

Categories

(Data Platform and Tools Graveyard :: Operations, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: amiyaguchi, Assigned: amiyaguchi)

References

Details

(Whiteboard: [DataOps])

Attachments

(2 files)

The applications logs for spark jobs scheduled on airflow should be stored for profiling. These application logs can be replayed by the spark history server to obtain metrics about a specific run in detail. Currently these logs are not being archived. The application logs are being stored in `file:/tmp/spark-events` determined by the property `spark.history.fs.logDirectory`. The EMR operator [1] is a good place to look for uploading this file. Another way to store these logs are to use the history server REST api at localhost:18080/api/v1 [2]. We can upload these zip files to a location in s3. An example of using this in shell is as follows: > #!/bin/bash > # Download the first application id > app_id=`curl -s 'localhost:18080/api/v1/applications' | \ > python -c "import sys, json; print json.load(sys.stdin)[0]['id']"` > curl -o my_filename.zip localhost:18080/api/v1/applications/$app_id/logs The spark api gives useful metrics for understanding the operation of a spark job and it would therefore be useful to keep this around. [1] https://github.com/mozilla/telemetry-airflow/blob/master/dags/operators/emr_spark_operator.py#L114 [2] http://spark.apache.org/docs/latest/monitoring.html#rest-api
Blocks: 1284522
Whiteboard: [SvcOps]
Points: --- → 2
Priority: -- → P3
We don't collect any history logs that are launched via EMR because they don't have access to the logging directory. This should probably be set via `spark.history.fs.logDirectory`. This would be very useful diagnostic material for failing, scheduled jobs.
Component: Metrics: Pipeline → Operations
Product: Cloud Services → Data Platform and Tools
QA Contact: jthomas
Summary: Store spark application logs for jobs scheduled on airflow in s3 → Store spark application and history logs on EMR
I've encountered this bug a few times, I'll go ahead an assign this to myself.
Assignee: nobody → amiyaguchi
Priority: P3 → P1
See Also: → 1335228
Priority: P1 → P3
Priority: P3 → P2
Priority: P2 → P1
Depends on: 1290140
Priority: P1 → P2
Whiteboard: [SvcOps] → [DataOps]

This is not a worthwhile effort now that we're migrating away from EMR. Logs are currently stored on Databricks.

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
Product: Data Platform and Tools → Data Platform and Tools Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: