Closed
Bug 1295359
Opened 9 years ago
Closed 9 years ago
Spark clusters very slow
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: kparlante, Unassigned)
References
Details
(Whiteboard: [SvcOps])
spenrose and ilana reported problems with using spark all day today. This is blocking test pilot work. (Also affected today's interview).
I know robotblake is looking at the logs, creating a bug to discuss/track.
Comment 1•9 years ago
|
||
Could someone please post the logs here with the steps to reproduce this error?
Flags: needinfo?(bimsland)
Comment 2•9 years ago
|
||
I did some manually testing and I could not reproduce the slowness so far. According to Airflow's logs the run-time of our scheduled jobs is not affected either.
Comment 3•9 years ago
|
||
We just pushed a tentative fix for a spark issue we have seen, but we don't know if that will solve the issues spenrose and Ilana experienced. Ilana, which notebook where you using yesterday for the interview?
Flags: needinfo?(isegall)
Reporter | ||
Comment 4•9 years ago
|
||
This is the interview notebook: https://gist.github.com/ilanasegall/f57a972e677811648cf106faadb9557b
Updated•9 years ago
|
Severity: blocker → normal
Updated•9 years ago
|
Priority: -- → P1
Yesterday the issue appeared fixed, but today we're having issues again.
The slowdown is in get_pings alone as far as I know (get_pings taking many hours when they normally take seconds/minutes), and robotblake indicated independently that it looks like an AWS issue. He commented "The hive logs have a ton of 404s in them," but we weren't sure where to go from there.
Both Kamyar (cc'd) and I are still having these issues.
Flags: needinfo?(isegall)
Updated•9 years ago
|
Severity: normal → blocker
Comment 7•9 years ago
|
||
Travis, given comment 5 this sounds like an AWS issue? My understanding is that the only code change that landed in the last week was reverted and the problems are persisting. Any thoughts for how to proceed with investigating?
Flags: needinfo?(tblow)
Updated•9 years ago
|
Flags: needinfo?(tblow)
Comment 8•9 years ago
|
||
Hey :RyanVM, I still haven't been able to repro the issue, but I'm going to look into turning on some increased metrics / logging on the AWS services that the Spark clusters are using. Once that's done I'll touch base with Ilana and see if we can force the issue to happen again and then look through said metrics / logs.
Flags: needinfo?(bimsland)
Reporter | ||
Comment 9•9 years ago
|
||
If anyone sees this again, please reopen and do the following:
- don't kill your cluster, leave it up for debugging
- let us know the instance in this bug, as in comment #6 https://bugzilla.mozilla.org/show_bug.cgi?id=1295359#c6
- notify one of the devops or data engineers in #datapipeline, so they can tag the instance to keep it around and/or get log files off of it for debugging
Blake or Roberto, let us know if there are other steps people should follow that would be helpful.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
Comment 10•9 years ago
|
||
jgaunt has reported slowness in spark again. It seems related to the original issue Ilana reported.
The cluster is: 6b78239f-b512-43cf-a829-26176506d472
I've put termination protection ON on the cluster, but I know from experience that still emails every hour to say it's terminated (after 24h), so if someone knows how to disable that it would be great.
https://gist.github.com/ilanasegall/1dea80ff88647d8d98001bc46ee5f354 is the notebook that was being run.
Updated•9 years ago
|
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Comment 11•9 years ago
|
||
gregglind also is reporting slowness. The notebook which took minutes before no takes hours. Not sure which cluster it is though. Script is below. Looks like it was also using get_pings.
https://gist.github.com/gregglind/497af94f4de9df6aaeca9010a148d289
Comment 12•9 years ago
|
||
Blake, could you make sure jgaunt and gregglind's cluster are left untouched (by reaper and what not) until the European crew has time to have a look at it?
Flags: needinfo?(bimsland)
Updated•9 years ago
|
Flags: needinfo?(bimsland)
Comment 13•9 years ago
|
||
Thanks everybody for all the details, we found the issue and fixed it in bug 1304693. This query
Dataset.from_source('telemetry') \
.where(docType = 'heartbeat') \
.where(submissionDate = lambda x: x >= "20160801" and x <= "20160920") \
.where(appName = 'Firefox') \
.where(appUpdateChannel = "beta") \
.records(sc).count()
went down from 4h to 4-5 minutes. If you are curious, here is the fix https://github.com/mozilla/python_moztelemetry/pull/81
No longer depends on: 1304693
Updated•9 years ago
|
Status: REOPENED → RESOLVED
Closed: 9 years ago → 9 years ago
Resolution: --- → FIXED
Comment 14•9 years ago
|
||
ec2-54-244-99-10.us-west-2.compute.amazonaws.com
with script at https://gist.github.com/ilanasegall/ea911ae8fbfe2d708d7cb21f454d0cc1
even with 20 cores, over an hour couldn't print the first line of
bucket = "telemetry-test-bucket"
prefix = "addons/v1"
%time dataset = sqlContext.read.load("s3://{}/{}".format(bucket, prefix), "parquet")
which generally takes a few seconds.
Comment 15•9 years ago
|
||
:ilana this is new. I tried to run that cell on a 1-node cluster and it completed in 20 seconds. I'll try to get access to your cluster to see what's going on.
Comment 16•9 years ago
|
||
Blake, could you please check the S3 logs for throttling/rate errors as well?
Flags: needinfo?(bimsland)
Comment 17•9 years ago
|
||
The issue must have been temporary. I tried to run the same cell on the same notebook/cluster and it took 6 seconds.
It smells like a request throttling problem, as suggested by rvitillo in the comment above.
We really need a way to monitor the s3 requests rate against the limits imposed by amazon.
Comment 18•9 years ago
|
||
Currently there isn't a good way to aggregate the S3 request logs and Amazon doesn't expose any metrics about when they're "throttling" us but I'll look around and see if I can find any good solutions to do so.
Flags: needinfo?(bimsland)
Comment 19•9 years ago
|
||
(In reply to Blake Imsland [:robotblake] from comment #18)
> Currently there isn't a good way to aggregate the S3 request logs and Amazon
> doesn't expose any metrics about when they're "throttling" us but I'll look
> around and see if I can find any good solutions to do so.
Are the logs dumped on S3? If so you could run a Spark job to aggregate/analyze the logs.
Updated•9 years ago
|
Severity: blocker → normal
Updated•7 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•