Closed Bug 1300224 Opened 9 years ago Closed 9 years ago

New data source proposal: installed add-ons

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: jorgev, Assigned: mreid)

References

Details

Attachments

(3 files)

[telemetry-batch-view] mreid-moz:addons_view > mozilla:master 9 years ago GitHub Autolander Bot 56 bytes, text/x-github-pull-request		Details \| Review
[telemetry-airflow] mreid-moz:schedule_addons_bug1300224 > mozilla:master 9 years ago GitHub Autolander Bot 52 bytes, text/x-github-pull-request		Details \| Review
[telemetry-airflow] mreid-moz:addons_from_main_summary_bug1300224 > mozilla:master 9 years ago GitHub Autolander Bot 52 bytes, text/x-github-pull-request		Details \| Review

Jorge Villalobos [:jorgev] (he/him)

Reporter

Description

•

9 years ago

In the add-ons team we have a frequent need to get data about installed add-ons. The `longitudinal` table on Presto has this data, but it can be difficult to get to it because of all the nesting. I'm proposing a view that reflects the current state of add-on installations, where each row corresponds to an add-on. This query reflects what I'm thinking of (it probably needs tweaking): https://sql.telemetry.mozilla.org/queries/1128/source Each row would hold all the add-on metadata we get from the Add-ons Manager. Note there is at least one bit of add-on metadata (isSystem) that isn't on `longitudinal` at the moment. I added a few columns that I thought would be relevant: client id, release channel, os, and settings related to search engines, compatibility and blocklisting. There may be other bits that could be useful. Some possible queries we could run on this set: * Add-ons by installs (https://sql.telemetry.mozilla.org/queries/1131/source) * Add-ons by foreign install rate (https://sql.telemetry.mozilla.org/queries/1130/source) * Add-ons associated with certain non-default search engines * Add-ons installed with e10s on * Add-ons on Nightly. * Unsigned add-ons. * Add-ons installed per client.

Thomas Huelbert

Updated

•

9 years ago

Points: --- → 3

Priority: -- → P2

Thomas Huelbert

Comment 1

•

9 years ago

tagetting Q5 2016

Thomas Huelbert

Comment 2

•

9 years ago

uh, that would be "targeting Q4"

Mark Reid [:mreid]

Assignee

Comment 3

•

9 years ago

Jorge, are any of these fields functions of the addon itself, rather than specific to the user's configuration? Specifically, I'm thinking of: - description - scope - type - blocklisted - hasBinaryComponents - isSystem - signedState Can any/all of those be inferred from a (addon_id + version) or are they specific to a user's configuration?

Flags: needinfo?(jorge)

Jorge Villalobos [:jorgev] (he/him)

Reporter

Comment 4

•

9 years ago

(In reply to Mark Reid [:mreid] from comment #3) > - description Assuming an add-on is signed, this could be inferred from AMO data. > - scope This depends on how the add-on was installed, so it can't be inferred. > - type Assuming an add-on is signed, this could be inferred from AMO data. > - blocklisted We control the blocklist, but users or malicious add-ons can disable it. > - hasBinaryComponents Assuming an add-on is signed, this could be inferred from AMO data. > - isSystem There must be a way to infer this since we control the list of system add-ons, but I don't know what it involves. > - signedState Can't be inferred, since unsigned add-ons might not be on AMO.

Flags: needinfo?(jorge)

Mark Reid [:mreid]

Assignee

Updated

•

9 years ago

Assignee: nobody → mreid

Priority: P2 → P1

GitHub Autolander Bot

Comment 5

•

9 years ago

Attached file [telemetry-batch-view] mreid-moz:addons_view > mozilla:master — Details

Dave Zeber [:dzeber]

Comment 6

•

9 years ago

Nice work! I tried out some sample queries at [1]. This will let us answer the queries in Comment 0, although those that require other profile information (eg. channel) will require joining in data from another source. I tried this in the gist by joining to main_summary for a small subset. The data is structured as 1 row per active add-on per client/subsession. One class of queries we'd want to do involves viewing the data as a snapshot in time, in which each client has a fixed set of add-ons (eg. # clients with certain types of add-ons currently installed). We usually do this by looking at a client's current set of add-ons as reported in their most recent ping. I'd recommend adding `payload.info.profileSubsessionCounter` or something similar so that we can order a client's subsessions chronologically. This would also let us track enabling/disabling longitudinally for a given client. We may also want the client-side subsession datestamp `payload.info.subsessionStartDate` (although there is a discussion ongoing about this). That said, is there any reason why we wouldn't just want to add this add-on info into main_summary, as discussed in bug 1290181? Each row of the dataset is a client/subsession, and add-ons could be added as an array of maps, similar to search counts. This does still have a nested structure, but it would answer all the same queries, and make it easier to segment clients and sessions on other profile characteristics. [1] https://gist.github.com/dzeber/796c122ca6c9778277b4a2872bc9b82b

Comment 7

•

9 years ago

(In reply to Jorge Villalobos [:jorgev] from comment #0) > In the add-ons team we have a frequent need to get data about installed > add-ons. The `longitudinal` table on Presto has this data, but it can be > difficult to get to it because of all the nesting. What about using a view [1] so you can move the data in whatever form you like without the need to build a new dataset? [1] https://prestodb.io/docs/current/sql/create-view.html

Roberto Agostino Vitillo (:rvitillo)

Comment 8

•

9 years ago

(In reply to Dave Zeber [:dzeber] from comment #6) > That said, is there any reason why we wouldn't just want to add this add-on > info into main_summary, as discussed in bug 1290181? Each row of the dataset > is a client/subsession, and add-ons could be added as an array of maps, > similar to search counts. This does still have a nested structure, but it > would answer all the same queries, and make it easier to segment clients and > sessions on other profile characteristics. Agreed, see Bug 1290181. There are some scenarios for which we want 100% of add-on data and the longitudinal dataset isn't good enough, like when training the add-on recommender. For those use-cases I would rather have the data in main_summary than in a separate dataset though.

Jorge Villalobos [:jorgev] (he/him)

Reporter

Comment 9

•

9 years ago

(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #7) > What about using a view [1] so you can move the data in whatever form you > like without the need to build a new dataset? > > [1] https://prestodb.io/docs/current/sql/create-view.html That would require adding the view generation to each query, right? And adding the waiting time for the view to be generated for each query?

Roberto Agostino Vitillo (:rvitillo)

Comment 10

•

9 years ago

(In reply to Jorge Villalobos [:jorgev] from comment #9) > (In reply to Roberto Agostino Vitillo (:rvitillo) from comment #7) > > What about using a view [1] so you can move the data in whatever form you > > like without the need to build a new dataset? > > > > [1] https://prestodb.io/docs/current/sql/create-view.html > > That would require adding the view generation to each query, right? And > adding the waiting time for the view to be generated for each query? Yes, but Presto might cache the view I think. During development you can use a subset of the data to iterate quickly so it shouldn't really matter if running the query on the whole dataset is a bit slower.

Mark Reid [:mreid]

Assignee

Comment 11

•

9 years ago

My intention is to eventually put this structured addon data into the main_summary dataset. This is an interim dataset specifically designed to answer the above questions. Ideally, the addons information would be added to the main_summary dataset, then (if it's still needed) this addon-based dataset could be handled as a view or as a dependent dataset derived from main_summary (rather than by processing the raw data). I wanted to use this data to investigate performance of joining the addons data against main_summary. If that had worked Well Enough, we could have worked in the other direction (joining the addons data back to the environment info from main_summary). In my testing, however, it only worked well for very tiny datasets. Joining data for a few addons within a single day (something like 150,000 rows) with their matching documents in main_summary took only a couple of minutes on a 10-node cluster. Joining a full day's data, on the other hand, caused the 10-node cluster to run out of memory. To me, that means that a join-based approach is not workable for any real query. There was also a question of whether addon data from Main pings was even reliable at all. Ilana ran some Spark analyses using `get_pings`, which appeared to show about an order of magnitude too few clients using an example addon. I wanted to rule this out as a problem in the raw data before spending too much effort on integrating with the main_summary data. Quick tests using this dataset show that data from Main pings matches relatively well with the counts available from AMO, so the underlying data *does* appear to be reliable. That justifies spending more effort in building up datasets around Addons using Main pings as the source. So I still very much plan to address bug 1290181, but that will take a bit more time to ensure that we don't regress performance too much or encourage folks to blow up stmo with huge queries :)

Dave Zeber [:dzeber]

Comment 12

•

9 years ago

Ah, thanks for the clarifications - this approach makes a lot of sense :) One comment I have is that to answer questions about the "current state of add-ons", we generally want a single add-on record (ie. from a single ping) per client, usually the most recent one. That way we'd count add-on characteristcs by the number of clients that have it installed. At the moment I don't see a way to select the most recent ping per client from this dataset. If we add a column that indexes client pings chronologically (eg. payload.info.profileSubsessionCounter), that would let us not only select the most recent ping, but also collate a per-client longitudinal add-on history. (Granted, this may run into computation/resource issues.) This could give us some insight into enabling/disabling add-ons by seeing when an add-on drops off or appears in a client's longitudinal add-on history, something that we don't really have a view into currently.

GitHub Autolander Bot

Comment 13

•

9 years ago

Attached file [telemetry-airflow] mreid-moz:schedule_addons_bug1300224 > mozilla:master — Details

Mark Reid [:mreid]

Assignee

Updated

•

9 years ago

Blocks: 1305184

Mark Reid [:mreid]

Assignee

Updated

•

9 years ago

Depends on: 1306225

Mark Reid [:mreid]

Assignee

Comment 14

•

9 years ago

Dave, I submitted a PR against bug 1290181 that adds addon data to the main_summary dataset, and I'm hoping also addresses your concern in comment 12 about being able to order pings chronologically. Since payload.info.profileSubsessionCounter was not available in the main_summary dataset, I added the subsesionStartDate instead, hopefully that will suffice. Will that work for ordering the addon records?

Flags: needinfo?(dzeber)

GitHub Autolander Bot

Comment 15

•

9 years ago

Attached file [telemetry-airflow] mreid-moz:addons_from_main_summary_bug1300224 > mozilla:master — Details

Dave Zeber [:dzeber]

Comment 16

•

9 years ago

(In reply to Mark Reid [:mreid] from comment #14) > Will that work for ordering the addon records? Yes, that will be fine. Thanks!

Flags: needinfo?(dzeber)

Mark Reid [:mreid]

Assignee

Updated

•

9 years ago

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

7 years ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.