1295359 - Spark clusters very slow

Reporter

Description

•

9 years ago

spenrose and ilana reported problems with using spark all day today. This is blocking test pilot work. (Also affected today's interview). I know robotblake is looking at the logs, creating a bug to discuss/track.

Wil Clouser [:clouserw]

Updated

•

9 years ago

Blocks: 1294564

Roberto Agostino Vitillo (:rvitillo)

Comment 1

•

9 years ago

Could someone please post the logs here with the steps to reproduce this error?

Flags: needinfo?(bimsland)

Roberto Agostino Vitillo (:rvitillo)

Comment 2

•

9 years ago

I did some manually testing and I could not reproduce the slowness so far. According to Airflow's logs the run-time of our scheduled jobs is not affected either.

Mauro Doglio [:mdoglio]

Comment 3

•

9 years ago

We just pushed a tentative fix for a spark issue we have seen, but we don't know if that will solve the issues spenrose and Ilana experienced. Ilana, which notebook where you using yesterday for the interview?

Flags: needinfo?(isegall)

Katie Parlante

Reporter

Comment 4

•

9 years ago

This is the interview notebook: https://gist.github.com/ilanasegall/f57a972e677811648cf106faadb9557b

Roberto Agostino Vitillo (:rvitillo)

Updated

•

9 years ago

Severity: blocker → normal

Thomas Huelbert

Updated

•

9 years ago

Priority: -- → P1

Ilana

Comment 5

•

9 years ago

Yesterday the issue appeared fixed, but today we're having issues again. The slowdown is in get_pings alone as far as I know (get_pings taking many hours when they normally take seconds/minutes), and robotblake indicated independently that it looks like an AWS issue. He commented "The hive logs have a ton of 404s in them," but we weren't sure where to go from there. Both Kamyar (cc'd) and I are still having these issues.

Flags: needinfo?(isegall)

Ilana

Comment 6

•

9 years ago

PS: Today I'm on ec2-54-149-75-43.us-west-2.compute.amazonaws.com

Roberto Agostino Vitillo (:rvitillo)

Updated

•

9 years ago

Severity: normal → blocker

Ryan VanderMeulen [:RyanVM]

Comment 7

•

9 years ago

Travis, given comment 5 this sounds like an AWS issue? My understanding is that the only code change that landed in the last week was reverted and the problems are persisting. Any thoughts for how to proceed with investigating?

Flags: needinfo?(tblow)

Travis Blow [:travis]

Updated

•

9 years ago

Flags: needinfo?(tblow)

Blake Imsland [:robotblake]

Comment 8

•

9 years ago

Hey :RyanVM, I still haven't been able to repro the issue, but I'm going to look into turning on some increased metrics / logging on the AWS services that the Spark clusters are using. Once that's done I'll touch base with Ilana and see if we can force the issue to happen again and then look through said metrics / logs.

Flags: needinfo?(bimsland)

Katie Parlante

Reporter

Comment 9

•

9 years ago

If anyone sees this again, please reopen and do the following: - don't kill your cluster, leave it up for debugging - let us know the instance in this bug, as in comment #6 https://bugzilla.mozilla.org/show_bug.cgi?id=1295359#c6 - notify one of the devops or data engineers in #datapipeline, so they can tag the instance to keep it around and/or get log files off of it for debugging Blake or Roberto, let us know if there are other steps people should follow that would be helpful.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → WORKSFORME

Frank Bertsch [:frank]

Comment 10

•

9 years ago

jgaunt has reported slowness in spark again. It seems related to the original issue Ilana reported. The cluster is: 6b78239f-b512-43cf-a829-26176506d472 I've put termination protection ON on the cluster, but I know from experience that still emails every hour to say it's terminated (after 24h), so if someone knows how to disable that it would be great. https://gist.github.com/ilanasegall/1dea80ff88647d8d98001bc46ee5f354 is the notebook that was being run.

Frank Bertsch [:frank]

Updated

•

9 years ago

Status: RESOLVED → REOPENED

Resolution: WORKSFORME → ---

Frank Bertsch [:frank]

Comment 11

•

9 years ago

gregglind also is reporting slowness. The notebook which took minutes before no takes hours. Not sure which cluster it is though. Script is below. Looks like it was also using get_pings. https://gist.github.com/gregglind/497af94f4de9df6aaeca9010a148d289

Roberto Agostino Vitillo (:rvitillo)

Comment 12

•

9 years ago

Blake, could you make sure jgaunt and gregglind's cluster are left untouched (by reaper and what not) until the European crew has time to have a look at it?

Flags: needinfo?(bimsland)

Roberto Agostino Vitillo (:rvitillo)

Updated

•

9 years ago

Depends on: 1304662

Roberto Agostino Vitillo (:rvitillo)

Updated

•

9 years ago

Depends on: 1304693

Roberto Agostino Vitillo (:rvitillo)

Updated

•

9 years ago

Flags: needinfo?(bimsland)

Mauro Doglio [:mdoglio]

Comment 13

•

9 years ago

Thanks everybody for all the details, we found the issue and fixed it in bug 1304693. This query Dataset.from_source('telemetry') \ .where(docType = 'heartbeat') \ .where(submissionDate = lambda x: x >= "20160801" and x <= "20160920") \ .where(appName = 'Firefox') \ .where(appUpdateChannel = "beta") \ .records(sc).count() went down from 4h to 4-5 minutes. If you are curious, here is the fix https://github.com/mozilla/python_moztelemetry/pull/81

No longer depends on: 1304693

Mauro Doglio [:mdoglio]

Updated

•

9 years ago

Status: REOPENED → RESOLVED

Closed: 9 years ago → 9 years ago

Resolution: --- → FIXED

Ilana

Comment 14

•

9 years ago

ec2-54-244-99-10.us-west-2.compute.amazonaws.com with script at https://gist.github.com/ilanasegall/ea911ae8fbfe2d708d7cb21f454d0cc1 even with 20 cores, over an hour couldn't print the first line of bucket = "telemetry-test-bucket" prefix = "addons/v1" %time dataset = sqlContext.read.load("s3://{}/{}".format(bucket, prefix), "parquet") which generally takes a few seconds.

Mauro Doglio [:mdoglio]

Comment 15

•

9 years ago

:ilana this is new. I tried to run that cell on a 1-node cluster and it completed in 20 seconds. I'll try to get access to your cluster to see what's going on.

Roberto Agostino Vitillo (:rvitillo)

Comment 16

•

9 years ago

Blake, could you please check the S3 logs for throttling/rate errors as well?

Flags: needinfo?(bimsland)

Mauro Doglio [:mdoglio]

Comment 17

•

9 years ago

The issue must have been temporary. I tried to run the same cell on the same notebook/cluster and it took 6 seconds. It smells like a request throttling problem, as suggested by rvitillo in the comment above. We really need a way to monitor the s3 requests rate against the limits imposed by amazon.

Blake Imsland [:robotblake]

Comment 18

•

9 years ago

Currently there isn't a good way to aggregate the S3 request logs and Amazon doesn't expose any metrics about when they're "throttling" us but I'll look around and see if I can find any good solutions to do so.

Flags: needinfo?(bimsland)

Roberto Agostino Vitillo (:rvitillo)

Comment 19

•

9 years ago

(In reply to Blake Imsland [:robotblake] from comment #18) > Currently there isn't a good way to aggregate the S3 request logs and Amazon > doesn't expose any metrics about when they're "throttling" us but I'll look > around and see if I can find any good solutions to do so. Are the logs dumped on S3? If so you could run a Spark job to aggregate/analyze the logs.

Roberto Agostino Vitillo (:rvitillo)

Updated

•

9 years ago

Severity: blocker → normal

BMO Automation

Updated

•

7 years ago

Product: Cloud Services → Cloud Services Graveyard