Closed Bug 1295359 Opened 9 years ago Closed 9 years ago

Spark clusters very slow

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kparlante, Unassigned)

References

Details

(Whiteboard: [SvcOps])

spenrose and ilana reported problems with using spark all day today. This is blocking test pilot work. (Also affected today's interview). I know robotblake is looking at the logs, creating a bug to discuss/track.
Blocks: 1294564
Could someone please post the logs here with the steps to reproduce this error?
Flags: needinfo?(bimsland)
I did some manually testing and I could not reproduce the slowness so far. According to Airflow's logs the run-time of our scheduled jobs is not affected either.
We just pushed a tentative fix for a spark issue we have seen, but we don't know if that will solve the issues spenrose and Ilana experienced. Ilana, which notebook where you using yesterday for the interview?
Flags: needinfo?(isegall)
Severity: blocker → normal
Priority: -- → P1
Yesterday the issue appeared fixed, but today we're having issues again. The slowdown is in get_pings alone as far as I know (get_pings taking many hours when they normally take seconds/minutes), and robotblake indicated independently that it looks like an AWS issue. He commented "The hive logs have a ton of 404s in them," but we weren't sure where to go from there. Both Kamyar (cc'd) and I are still having these issues.
Flags: needinfo?(isegall)
PS: Today I'm on ec2-54-149-75-43.us-west-2.compute.amazonaws.com
Severity: normal → blocker
Travis, given comment 5 this sounds like an AWS issue? My understanding is that the only code change that landed in the last week was reverted and the problems are persisting. Any thoughts for how to proceed with investigating?
Flags: needinfo?(tblow)
Flags: needinfo?(tblow)
Hey :RyanVM, I still haven't been able to repro the issue, but I'm going to look into turning on some increased metrics / logging on the AWS services that the Spark clusters are using. Once that's done I'll touch base with Ilana and see if we can force the issue to happen again and then look through said metrics / logs.
Flags: needinfo?(bimsland)
If anyone sees this again, please reopen and do the following: - don't kill your cluster, leave it up for debugging - let us know the instance in this bug, as in comment #6 https://bugzilla.mozilla.org/show_bug.cgi?id=1295359#c6 - notify one of the devops or data engineers in #datapipeline, so they can tag the instance to keep it around and/or get log files off of it for debugging Blake or Roberto, let us know if there are other steps people should follow that would be helpful.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
jgaunt has reported slowness in spark again. It seems related to the original issue Ilana reported. The cluster is: 6b78239f-b512-43cf-a829-26176506d472 I've put termination protection ON on the cluster, but I know from experience that still emails every hour to say it's terminated (after 24h), so if someone knows how to disable that it would be great. https://gist.github.com/ilanasegall/1dea80ff88647d8d98001bc46ee5f354 is the notebook that was being run.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
gregglind also is reporting slowness. The notebook which took minutes before no takes hours. Not sure which cluster it is though. Script is below. Looks like it was also using get_pings. https://gist.github.com/gregglind/497af94f4de9df6aaeca9010a148d289
Blake, could you make sure jgaunt and gregglind's cluster are left untouched (by reaper and what not) until the European crew has time to have a look at it?
Flags: needinfo?(bimsland)
Flags: needinfo?(bimsland)
Thanks everybody for all the details, we found the issue and fixed it in bug 1304693. This query Dataset.from_source('telemetry') \ .where(docType = 'heartbeat') \ .where(submissionDate = lambda x: x >= "20160801" and x <= "20160920") \ .where(appName = 'Firefox') \ .where(appUpdateChannel = "beta") \ .records(sc).count() went down from 4h to 4-5 minutes. If you are curious, here is the fix https://github.com/mozilla/python_moztelemetry/pull/81
No longer depends on: 1304693
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
ec2-54-244-99-10.us-west-2.compute.amazonaws.com with script at https://gist.github.com/ilanasegall/ea911ae8fbfe2d708d7cb21f454d0cc1 even with 20 cores, over an hour couldn't print the first line of bucket = "telemetry-test-bucket" prefix = "addons/v1" %time dataset = sqlContext.read.load("s3://{}/{}".format(bucket, prefix), "parquet") which generally takes a few seconds.
:ilana this is new. I tried to run that cell on a 1-node cluster and it completed in 20 seconds. I'll try to get access to your cluster to see what's going on.
Blake, could you please check the S3 logs for throttling/rate errors as well?
Flags: needinfo?(bimsland)
The issue must have been temporary. I tried to run the same cell on the same notebook/cluster and it took 6 seconds. It smells like a request throttling problem, as suggested by rvitillo in the comment above. We really need a way to monitor the s3 requests rate against the limits imposed by amazon.
Currently there isn't a good way to aggregate the S3 request logs and Amazon doesn't expose any metrics about when they're "throttling" us but I'll look around and see if I can find any good solutions to do so.
Flags: needinfo?(bimsland)
(In reply to Blake Imsland [:robotblake] from comment #18) > Currently there isn't a good way to aggregate the S3 request logs and Amazon > doesn't expose any metrics about when they're "throttling" us but I'll look > around and see if I can find any good solutions to do so. Are the logs dumped on S3? If so you could run a Spark job to aggregate/analyze the logs.
Severity: blocker → normal
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.