Closed
Bug 1311664
Opened 9 years ago
Closed 9 years ago
Scheduled jobs on atmo v1 are failing silently
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: marco, Assigned: frank)
References
Details
I have a scheduled job on atmo v1 which is not running anymore since 2016-10-17.
The logs don't contain any error:
Beginning job top-signatures-correlations ...
Finished job top-signatures-correlations
'' exited with code 0
The notebook name doesn't contain whitespaces.
Reporter | ||
Comment 1•9 years ago
|
||
> I have a scheduled job on atmo v1 which is not running anymore since 2016-10-17.
The last successful run was 20161017001108, the first time it didn't do anything was 20161018001253.
![]() |
Assignee | |
Updated•9 years ago
|
Assignee: nobody → fbertsch
![]() |
Assignee | |
Comment 2•9 years ago
|
||
I'm not really sure what the exact *cause* of the error is, but I know where it's happening. Here's the error:
>> 16/10/20 01:09:17 INFO DAGScheduler: ShuffleMapStage 47 (reduceByKey at crashcorrelations/crash_deviations.py:356) failed in 17.932 s
>> 16/10/20 01:09:17 INFO DAGScheduler: Job 26 failed: collect at crashcorrelations/crash_deviations.py:356, took 18.112538 s
so it's the reduceByKey in the crash_deviations.py:356 that's causing the issue.
Now what the error actually is is this:
>> java.lang.UnsupportedOperationException: Cannot evaluate expression: PythonUDF#<lambda>(input[6, ArrayType(StringType,true)])
Which is weird because UDFs are used for dataframes, not RDDs. There may be some coercing going on. I might also recommend using spark 2.0, which you can spin up here: https://atmo-prod.herokuapp.com/
Flags: needinfo?(mcastelluccio)
Reporter | ||
Comment 3•9 years ago
|
||
(In reply to Frank Bertsch [:frank] from comment #2)
> I'm not really sure what the exact *cause* of the error is, but I know where
> it's happening. Here's the error:
>
> >> 16/10/20 01:09:17 INFO DAGScheduler: ShuffleMapStage 47 (reduceByKey at crashcorrelations/crash_deviations.py:356) failed in 17.932 s
> >> 16/10/20 01:09:17 INFO DAGScheduler: Job 26 failed: collect at crashcorrelations/crash_deviations.py:356, took 18.112538 s
>
> so it's the reduceByKey in the crash_deviations.py:356 that's causing the
> issue.
>
> Now what the error actually is is this:
> >> java.lang.UnsupportedOperationException: Cannot evaluate expression: PythonUDF#<lambda>(input[6, ArrayType(StringType,true)])
>
> Which is weird because UDFs are used for dataframes, not RDDs. There may be
> some coercing going on. I might also recommend using spark 2.0, which you
> can spin up here: https://atmo-prod.herokuapp.com/
I'm using UDFs for a dataframe and I later use the underlying RDD, perhaps that's why.
I did change something in my code on 2016-10-18, perhaps this is why it started to fail.
Is it possible to make the error explicit in the logs? The job used not to return '0' on failure.
I will try to switch to atmo v2 today.
Flags: needinfo?(mcastelluccio)
Reporter | ||
Comment 4•9 years ago
|
||
(In reply to Marco Castelluccio [:marco] from comment #3)
> Is it possible to make the error explicit in the logs? The job used not to
> return '0' on failure.
IIRC, the iPython notebook was uploaded even in case of errors, so you could also see what happened from there.
Reporter | ||
Updated•9 years ago
|
Summary: Scheduled jobs on atmo v1 are not running anymore → Scheduled jobs on atmo v1 are failing silently
![]() |
Assignee | |
Comment 5•9 years ago
|
||
We've confirmed that this problem is because there wasn't enough room in the $HOME directory. Once we move ~/analyses to /mnt it should solve any future problems.
The other issue (Spark task failing) was solved by moving to Spark 2.0, and is workload-specific.
Depends on: 1311708
![]() |
Assignee | |
Comment 6•9 years ago
|
||
We've fixed both issues that were causing jobs to fail.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•