Closed Bug 1309574 Opened 9 years ago Closed 8 years ago

Port executive report to use main_summary dataset

Categories

(Data Platform and Tools :: General, defect, P1)

defect
Points:
3

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mreid, Assigned: amiyaguchi)

References

Details

Attachments

(2 files)

Port code at [1] to use the main_summary dataset. We may not need "inactives" or "five_of_seven" columns initially, as they are expensive to compute and not used for anything at the moment.
This would (I believe) let us get rid of the redshift cluster as well as the scheduled job that populates it.
This job will also need to consume the crash data, but it can do that using get_pings + a join
Points: --- → 3
Priority: -- → P3
Blocks: 1309633
Assignee: nobody → amiyaguchi
I talked to Mark about this bug (and a few others). I'll be picking this up.
Priority: P3 → P1
Attached file topline_summary.ipynb
Attached is a pyspark notebook that outlines the general approach to porting the executive/topline summary. This notebook takes ~20 minutes with a day of data and ~2:20 with a week of data on a 5 machine cluster. It is prohibitively slow with any more data, and would probably over 10 hours to complete. For reference, the original script takes about 4 hours to run on the redshift cluster. :mreid and I suspect that most of the time is being spend on user defined functions. A benchmark on regexes between python and java shows a performance difference [1] would be very significant on a large dataset (say 30 days worth of main_summary data). Mark has also mentioned poor performance with python's date string conversions in the past. I will be rewriting this notebook in scala which will hopefully improve performance. Most of the notebook has been ported aside from collected search count numbers and some tests. Once this is done I can start validating that the numbers look right. [1] https://benchmarksgame.alioth.debian.org/u64q/python.html
Priority: P1 → P2
I've come at a roadblock with trying to run my implementation of the ToplineSummary. I've narrowed it down to a single function that is hard to unit test [1]. I know it's failing here because I've force the dataframe to collect through a call to .count() and watched it fail on the spark UI. The relevant stacktrace shows that it is probably failing on a null attribute somewhere. > Caused by: java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343) > at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) I think that this might be caused by the mapping of `messageToRow` over the `Dataset('telemetry')` rdd. My implementation is similar to the implementation in the MainSummaryView [2]. Any ideas on how to prod at this problem? [1] https://github.com/acmiyaguchi/telemetry-batch-view/blob/topline-report/src/main/scala/com/mozilla/telemetry/views/ToplineSummary.scala#L161-L199 [2] https://github.com/acmiyaguchi/telemetry-batch-view/blob/topline-report/src/main/scala/com/mozilla/telemetry/views/MainSummaryView.scala#L208
Flags: needinfo?(mreid)
It looks like this issue has something to do with accessing the spark session. [1] My bug follows the same pattern as the reproduced issue. I've implemented a proper singleton which fixes the issues above. [1] https://issues.apache.org/jira/browse/SPARK-16599
Flags: needinfo?(mreid)
Priority: P2 → P1
Depends on: 1329842
Depends on: 1329844
Blocks: 1320702
Blocks: 1352443
This bug is getting bumped up in light of a day outage this past week. Work will be tracked in bug 1329844. The ToplineSummary is current in the review process, but won't be deployable as is. It generates data that is too granular for the dashboard. A new job will be added to python_etl that performs the same role as reformat_v4.py. It computes 'ALL' rows, and aggregates 'Rest of World' in geo. This also uploads the resulting dataframe to the dashboard buckets. These two jobs will be scheduled on airflow. I'd like this to run in parallel to the existing job for at least 2 weekly cycles to make sure that it is performing correctly, before swapping over.
Depends on: 1357875
No longer depends on: 1357875
Component: Metrics: Pipeline → Datasets: General
Product: Cloud Services → Data Platform and Tools
This work has been completed in the context of python_mozetl at https://github.com/mozilla/python_mozetl/tree/master/mozetl/topline
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: