Closed
Bug 1329779
Opened 9 years ago
Closed 6 years ago
Store spark application and history logs on EMR
Categories
(Data Platform and Tools Graveyard :: Operations, defect, P2)
Data Platform and Tools Graveyard
Operations
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: amiyaguchi, Assigned: amiyaguchi)
References
Details
(Whiteboard: [DataOps])
Attachments
(2 files)
The applications logs for spark jobs scheduled on airflow should be stored for profiling. These application logs can be replayed by the spark history server to obtain metrics about a specific run in detail.
Currently these logs are not being archived. The application logs are being stored in `file:/tmp/spark-events` determined by the property `spark.history.fs.logDirectory`. The EMR operator [1] is a good place to look for uploading this file.
Another way to store these logs are to use the history server REST api at localhost:18080/api/v1 [2]. We can upload these zip files to a location in s3. An example of using this in shell is as follows:
> #!/bin/bash
> # Download the first application id
> app_id=`curl -s 'localhost:18080/api/v1/applications' | \
> python -c "import sys, json; print json.load(sys.stdin)[0]['id']"`
> curl -o my_filename.zip localhost:18080/api/v1/applications/$app_id/logs
The spark api gives useful metrics for understanding the operation of a spark job and it would therefore be useful to keep this around.
[1] https://github.com/mozilla/telemetry-airflow/blob/master/dags/operators/emr_spark_operator.py#L114
[2] http://spark.apache.org/docs/latest/monitoring.html#rest-api
Updated•9 years ago
|
Points: --- → 2
Priority: -- → P3
Assignee | ||
Comment 1•8 years ago
|
||
We don't collect any history logs that are launched via EMR because they don't have access to the logging directory. This should probably be set via `spark.history.fs.logDirectory`. This would be very useful diagnostic material for failing, scheduled jobs.
Assignee | ||
Updated•8 years ago
|
Component: Metrics: Pipeline → Operations
Product: Cloud Services → Data Platform and Tools
QA Contact: jthomas
Assignee | ||
Updated•8 years ago
|
Summary: Store spark application logs for jobs scheduled on airflow in s3 → Store spark application and history logs on EMR
Assignee | ||
Comment 2•8 years ago
|
||
I've encountered this bug a few times, I'll go ahead an assign this to myself.
Assignee: nobody → amiyaguchi
Priority: P3 → P1
Updated•8 years ago
|
Priority: P1 → P3
Assignee | ||
Comment 3•8 years ago
|
||
This needs to be tested.
Updated•8 years ago
|
Priority: P3 → P2
Assignee | ||
Updated•8 years ago
|
Priority: P2 → P1
Updated•8 years ago
|
Whiteboard: [SvcOps] → [DataOps]
Assignee | ||
Comment 4•6 years ago
|
||
This is not a worthwhile effort now that we're migrating away from EMR. Logs are currently stored on Databricks.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
Updated•3 years ago
|
Product: Data Platform and Tools → Data Platform and Tools Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•