Closed Bug 1309574 Opened 9 years ago Closed 8 years ago

Port executive report to use main_summary dataset

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: mreid, Assigned: amiyaguchi)

References

Details

Attachments

(2 files)

topline_summary.ipynb 9 years ago Anthony Miyaguchi [:amiyaguchi] 22.22 KB, application/x-ipynb+json		Details
[telemetry-batch-view] acmiyaguchi:topline-report > mozilla:master 9 years ago GitHub Autolander Bot 56 bytes, text/x-github-pull-request		Details \| Review

Mark Reid [:mreid]

Reporter

Description

•

9 years ago

Port code at [1] to use the main_summary dataset. We may not need "inactives" or "five_of_seven" columns initially, as they are expensive to compute and not used for anything at the moment.

Mark Reid [:mreid]

Reporter

Comment 1

•

9 years ago

This would (I believe) let us get rid of the redshift cluster as well as the scheduled job that populates it.

Roberto Agostino Vitillo (:rvitillo)

Updated

•

9 years ago

Blocks: 1255755

Mark Reid [:mreid]

Reporter

Comment 2

•

9 years ago

This job will also need to consume the crash data, but it can do that using get_pings + a join

Thomas Huelbert

Updated

•

9 years ago

Points: --- → 3

Priority: -- → P3

Mark Reid [:mreid]

Reporter

Updated

•

9 years ago

Blocks: 1309633

Anthony Miyaguchi [:amiyaguchi]

Assignee

Updated

•

9 years ago

Assignee: nobody → amiyaguchi

Anthony Miyaguchi [:amiyaguchi]

Assignee

Comment 3

•

9 years ago

I talked to Mark about this bug (and a few others). I'll be picking this up.

Anthony Miyaguchi [:amiyaguchi]

Assignee

Updated

•

9 years ago

Priority: P3 → P1

Anthony Miyaguchi [:amiyaguchi]

Assignee

Comment 4

•

9 years ago

Attached file topline_summary.ipynb — Details

Attached is a pyspark notebook that outlines the general approach to porting the executive/topline summary. This notebook takes ~20 minutes with a day of data and ~2:20 with a week of data on a 5 machine cluster. It is prohibitively slow with any more data, and would probably over 10 hours to complete. For reference, the original script takes about 4 hours to run on the redshift cluster. :mreid and I suspect that most of the time is being spend on user defined functions. A benchmark on regexes between python and java shows a performance difference [1] would be very significant on a large dataset (say 30 days worth of main_summary data). Mark has also mentioned poor performance with python's date string conversions in the past. I will be rewriting this notebook in scala which will hopefully improve performance. Most of the notebook has been ported aside from collected search count numbers and some tests. Once this is done I can start validating that the numbers look right. [1] https://benchmarksgame.alioth.debian.org/u64q/python.html

Anthony Miyaguchi [:amiyaguchi]

Assignee

Comment 5

•

9 years ago

For easier viewing, here's a link via gist: https://gist.github.com/acmiyaguchi/503bfcccad19afe87bc9579e6e08bb9c

Anthony Miyaguchi [:amiyaguchi]

Assignee

Updated

•

9 years ago

Priority: P1 → P2

Anthony Miyaguchi [:amiyaguchi]

Assignee

Comment 6

•

9 years ago

I've come at a roadblock with trying to run my implementation of the ToplineSummary. I've narrowed it down to a single function that is hard to unit test [1]. I know it's failing here because I've force the dataframe to collect through a call to .count() and watched it fail on the spark UI. The relevant stacktrace shows that it is probably failing on a null attribute somewhere. > Caused by: java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343) > at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) I think that this might be caused by the mapping of `messageToRow` over the `Dataset('telemetry')` rdd. My implementation is similar to the implementation in the MainSummaryView [2]. Any ideas on how to prod at this problem? [1] https://github.com/acmiyaguchi/telemetry-batch-view/blob/topline-report/src/main/scala/com/mozilla/telemetry/views/ToplineSummary.scala#L161-L199 [2] https://github.com/acmiyaguchi/telemetry-batch-view/blob/topline-report/src/main/scala/com/mozilla/telemetry/views/MainSummaryView.scala#L208

Flags: needinfo?(mreid)

Anthony Miyaguchi [:amiyaguchi]

Assignee

Comment 7

•

9 years ago

It looks like this issue has something to do with accessing the spark session. [1] My bug follows the same pattern as the reproduced issue. I've implemented a proper singleton which fixes the issues above. [1] https://issues.apache.org/jira/browse/SPARK-16599

Flags: needinfo?(mreid)

Anthony Miyaguchi [:amiyaguchi]

Assignee

Updated

•

9 years ago

Priority: P2 → P1

GitHub Autolander Bot

Comment 8

•

9 years ago

Attached file [telemetry-batch-view] acmiyaguchi:topline-report > mozilla:master — Details

Anthony Miyaguchi [:amiyaguchi]

Assignee

Updated

•

9 years ago

Depends on: 1329842

Anthony Miyaguchi [:amiyaguchi]

Assignee

Updated

•

9 years ago

Depends on: 1329844

Mark Reid [:mreid]

Reporter

Updated

•

9 years ago

Blocks: 1320702

Mark Reid [:mreid]

Reporter

Updated

•

9 years ago

Blocks: 1352443

Anthony Miyaguchi [:amiyaguchi]

Assignee

Comment 9

•

8 years ago

This bug is getting bumped up in light of a day outage this past week. Work will be tracked in bug 1329844. The ToplineSummary is current in the review process, but won't be deployable as is. It generates data that is too granular for the dashboard. A new job will be added to python_etl that performs the same role as reformat_v4.py. It computes 'ALL' rows, and aggregates 'Rest of World' in geo. This also uploads the resulting dataframe to the dashboard buckets. These two jobs will be scheduled on airflow. I'd like this to run in parallel to the existing job for at least 2 weekly cycles to make sure that it is performing correctly, before swapping over.

Anthony Miyaguchi [:amiyaguchi]

Assignee

Updated

•

8 years ago

Depends on: 1357875

Anthony Miyaguchi [:amiyaguchi]

Assignee

Updated

•

8 years ago

No longer depends on: 1357875

Katie Parlante

Updated

•

8 years ago

Component: Metrics: Pipeline → Datasets: General

Product: Cloud Services → Data Platform and Tools

Mark Reid [:mreid]

Reporter

Comment 10

•

8 years ago

This work has been completed in the context of python_mozetl at https://github.com/mozilla/python_mozetl/tree/master/mozetl/topline

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

3 years ago

Component: Datasets: General → General

You need to log in before you can comment on or make changes to this bug.