Closed Bug 1311664 Opened 9 years ago Closed 9 years ago

Scheduled jobs on atmo v1 are failing silently

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: marco, Assigned: frank)

References

Details

Marco Castelluccio [:marco]

Reporter

Description

•

9 years ago

I have a scheduled job on atmo v1 which is not running anymore since 2016-10-17. The logs don't contain any error: Beginning job top-signatures-correlations ... Finished job top-signatures-correlations '' exited with code 0 The notebook name doesn't contain whitespaces.

Marco Castelluccio [:marco]

Reporter

Comment 1

•

9 years ago

> I have a scheduled job on atmo v1 which is not running anymore since 2016-10-17. The last successful run was 20161017001108, the first time it didn't do anything was 20161018001253.

Frank Bertsch [:frank]

Assignee

Updated

•

9 years ago

Assignee: nobody → fbertsch

Frank Bertsch [:frank]

Assignee

Comment 2

•

9 years ago

I'm not really sure what the exact *cause* of the error is, but I know where it's happening. Here's the error: >> 16/10/20 01:09:17 INFO DAGScheduler: ShuffleMapStage 47 (reduceByKey at crashcorrelations/crash_deviations.py:356) failed in 17.932 s >> 16/10/20 01:09:17 INFO DAGScheduler: Job 26 failed: collect at crashcorrelations/crash_deviations.py:356, took 18.112538 s so it's the reduceByKey in the crash_deviations.py:356 that's causing the issue. Now what the error actually is is this: >> java.lang.UnsupportedOperationException: Cannot evaluate expression: PythonUDF#<lambda>(input[6, ArrayType(StringType,true)]) Which is weird because UDFs are used for dataframes, not RDDs. There may be some coercing going on. I might also recommend using spark 2.0, which you can spin up here: https://atmo-prod.herokuapp.com/

Flags: needinfo?(mcastelluccio)

Marco Castelluccio [:marco]

Reporter

Comment 3

•

9 years ago

(In reply to Frank Bertsch [:frank] from comment #2) > I'm not really sure what the exact *cause* of the error is, but I know where > it's happening. Here's the error: > > >> 16/10/20 01:09:17 INFO DAGScheduler: ShuffleMapStage 47 (reduceByKey at crashcorrelations/crash_deviations.py:356) failed in 17.932 s > >> 16/10/20 01:09:17 INFO DAGScheduler: Job 26 failed: collect at crashcorrelations/crash_deviations.py:356, took 18.112538 s > > so it's the reduceByKey in the crash_deviations.py:356 that's causing the > issue. > > Now what the error actually is is this: > >> java.lang.UnsupportedOperationException: Cannot evaluate expression: PythonUDF#<lambda>(input[6, ArrayType(StringType,true)]) > > Which is weird because UDFs are used for dataframes, not RDDs. There may be > some coercing going on. I might also recommend using spark 2.0, which you > can spin up here: https://atmo-prod.herokuapp.com/ I'm using UDFs for a dataframe and I later use the underlying RDD, perhaps that's why. I did change something in my code on 2016-10-18, perhaps this is why it started to fail. Is it possible to make the error explicit in the logs? The job used not to return '0' on failure. I will try to switch to atmo v2 today.

Flags: needinfo?(mcastelluccio)

Marco Castelluccio [:marco]

Reporter

Comment 4

•

9 years ago

(In reply to Marco Castelluccio [:marco] from comment #3) > Is it possible to make the error explicit in the logs? The job used not to > return '0' on failure. IIRC, the iPython notebook was uploaded even in case of errors, so you could also see what happened from there.

Marco Castelluccio [:marco]

Reporter

Updated

•

9 years ago

Summary: Scheduled jobs on atmo v1 are not running anymore → Scheduled jobs on atmo v1 are failing silently

Frank Bertsch [:frank]

Assignee

Comment 5

•

9 years ago

We've confirmed that this problem is because there wasn't enough room in the $HOME directory. Once we move ~/analyses to /mnt it should solve any future problems. The other issue (Spark task failing) was solved by moving to Spark 2.0, and is workload-specific.

Depends on: 1311708

Frank Bertsch [:frank]

Assignee

Comment 6

•

9 years ago

We've fixed both issues that were causing jobs to fail.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

7 years ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Scheduled jobs on atmo v1 are failing silently

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect)

Tracking

(Not tracked)

People

(Reporter: marco, Assigned: frank)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Updated