Closed Bug 1311664 Opened 9 years ago Closed 9 years ago

Scheduled jobs on atmo v1 are failing silently

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: marco, Assigned: frank)

References

Details

I have a scheduled job on atmo v1 which is not running anymore since 2016-10-17. The logs don't contain any error: Beginning job top-signatures-correlations ... Finished job top-signatures-correlations '' exited with code 0 The notebook name doesn't contain whitespaces.
> I have a scheduled job on atmo v1 which is not running anymore since 2016-10-17. The last successful run was 20161017001108, the first time it didn't do anything was 20161018001253.
Assignee: nobody → fbertsch
I'm not really sure what the exact *cause* of the error is, but I know where it's happening. Here's the error: >> 16/10/20 01:09:17 INFO DAGScheduler: ShuffleMapStage 47 (reduceByKey at crashcorrelations/crash_deviations.py:356) failed in 17.932 s >> 16/10/20 01:09:17 INFO DAGScheduler: Job 26 failed: collect at crashcorrelations/crash_deviations.py:356, took 18.112538 s so it's the reduceByKey in the crash_deviations.py:356 that's causing the issue. Now what the error actually is is this: >> java.lang.UnsupportedOperationException: Cannot evaluate expression: PythonUDF#<lambda>(input[6, ArrayType(StringType,true)]) Which is weird because UDFs are used for dataframes, not RDDs. There may be some coercing going on. I might also recommend using spark 2.0, which you can spin up here: https://atmo-prod.herokuapp.com/
Flags: needinfo?(mcastelluccio)
(In reply to Frank Bertsch [:frank] from comment #2) > I'm not really sure what the exact *cause* of the error is, but I know where > it's happening. Here's the error: > > >> 16/10/20 01:09:17 INFO DAGScheduler: ShuffleMapStage 47 (reduceByKey at crashcorrelations/crash_deviations.py:356) failed in 17.932 s > >> 16/10/20 01:09:17 INFO DAGScheduler: Job 26 failed: collect at crashcorrelations/crash_deviations.py:356, took 18.112538 s > > so it's the reduceByKey in the crash_deviations.py:356 that's causing the > issue. > > Now what the error actually is is this: > >> java.lang.UnsupportedOperationException: Cannot evaluate expression: PythonUDF#<lambda>(input[6, ArrayType(StringType,true)]) > > Which is weird because UDFs are used for dataframes, not RDDs. There may be > some coercing going on. I might also recommend using spark 2.0, which you > can spin up here: https://atmo-prod.herokuapp.com/ I'm using UDFs for a dataframe and I later use the underlying RDD, perhaps that's why. I did change something in my code on 2016-10-18, perhaps this is why it started to fail. Is it possible to make the error explicit in the logs? The job used not to return '0' on failure. I will try to switch to atmo v2 today.
Flags: needinfo?(mcastelluccio)
(In reply to Marco Castelluccio [:marco] from comment #3) > Is it possible to make the error explicit in the logs? The job used not to > return '0' on failure. IIRC, the iPython notebook was uploaded even in case of errors, so you could also see what happened from there.
Summary: Scheduled jobs on atmo v1 are not running anymore → Scheduled jobs on atmo v1 are failing silently
We've confirmed that this problem is because there wasn't enough room in the $HOME directory. Once we move ~/analyses to /mnt it should solve any future problems. The other issue (Spark task failing) was solved by moving to Spark 2.0, and is workload-specific.
Depends on: 1311708
We've fixed both issues that were causing jobs to fail.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.