Closed
Bug 1300224
Opened 9 years ago
Closed 9 years ago
New data source proposal: installed add-ons
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jorgev, Assigned: mreid)
References
Details
Attachments
(3 files)
In the add-ons team we have a frequent need to get data about installed add-ons. The `longitudinal` table on Presto has this data, but it can be difficult to get to it because of all the nesting.
I'm proposing a view that reflects the current state of add-on installations, where each row corresponds to an add-on. This query reflects what I'm thinking of (it probably needs tweaking): https://sql.telemetry.mozilla.org/queries/1128/source
Each row would hold all the add-on metadata we get from the Add-ons Manager. Note there is at least one bit of add-on metadata (isSystem) that isn't on `longitudinal` at the moment. I added a few columns that I thought would be relevant: client id, release channel, os, and settings related to search engines, compatibility and blocklisting. There may be other bits that could be useful.
Some possible queries we could run on this set:
* Add-ons by installs (https://sql.telemetry.mozilla.org/queries/1131/source)
* Add-ons by foreign install rate (https://sql.telemetry.mozilla.org/queries/1130/source)
* Add-ons associated with certain non-default search engines
* Add-ons installed with e10s on
* Add-ons on Nightly.
* Unsigned add-ons.
* Add-ons installed per client.
Updated•9 years ago
|
Points: --- → 3
Priority: -- → P2
Comment 1•9 years ago
|
||
tagetting Q5 2016
Comment 2•9 years ago
|
||
uh, that would be "targeting Q4"
Assignee | ||
Comment 3•9 years ago
|
||
Jorge, are any of these fields functions of the addon itself, rather than specific to the user's configuration? Specifically, I'm thinking of:
- description
- scope
- type
- blocklisted
- hasBinaryComponents
- isSystem
- signedState
Can any/all of those be inferred from a (addon_id + version) or are they specific to a user's configuration?
Flags: needinfo?(jorge)
Reporter | ||
Comment 4•9 years ago
|
||
(In reply to Mark Reid [:mreid] from comment #3)
> - description
Assuming an add-on is signed, this could be inferred from AMO data.
> - scope
This depends on how the add-on was installed, so it can't be inferred.
> - type
Assuming an add-on is signed, this could be inferred from AMO data.
> - blocklisted
We control the blocklist, but users or malicious add-ons can disable it.
> - hasBinaryComponents
Assuming an add-on is signed, this could be inferred from AMO data.
> - isSystem
There must be a way to infer this since we control the list of system add-ons, but I don't know what it involves.
> - signedState
Can't be inferred, since unsigned add-ons might not be on AMO.
Flags: needinfo?(jorge)
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → mreid
Priority: P2 → P1
Comment 5•9 years ago
|
||
Comment 6•9 years ago
|
||
Nice work! I tried out some sample queries at [1]. This will let us answer the queries in Comment 0, although those that require other profile information (eg. channel) will require joining in data from another source. I tried this in the gist by joining to main_summary for a small subset.
The data is structured as 1 row per active add-on per client/subsession. One class of queries we'd want to do involves viewing the data as a snapshot in time, in which each client has a fixed set of add-ons (eg. # clients with certain types of add-ons currently installed). We usually do this by looking at a client's current set of add-ons as reported in their most recent ping. I'd recommend adding `payload.info.profileSubsessionCounter` or something similar so that we can order a client's subsessions chronologically. This would also let us track enabling/disabling longitudinally for a given client. We may also want the client-side subsession datestamp `payload.info.subsessionStartDate` (although there is a discussion ongoing about this).
That said, is there any reason why we wouldn't just want to add this add-on info into main_summary, as discussed in bug 1290181? Each row of the dataset is a client/subsession, and add-ons could be added as an array of maps, similar to search counts. This does still have a nested structure, but it would answer all the same queries, and make it easier to segment clients and sessions on other profile characteristics.
[1] https://gist.github.com/dzeber/796c122ca6c9778277b4a2872bc9b82b
See Also: → 1290181
Comment 7•9 years ago
|
||
(In reply to Jorge Villalobos [:jorgev] from comment #0)
> In the add-ons team we have a frequent need to get data about installed
> add-ons. The `longitudinal` table on Presto has this data, but it can be
> difficult to get to it because of all the nesting.
What about using a view [1] so you can move the data in whatever form you like without the need to build a new dataset?
[1] https://prestodb.io/docs/current/sql/create-view.html
Comment 8•9 years ago
|
||
(In reply to Dave Zeber [:dzeber] from comment #6)
> That said, is there any reason why we wouldn't just want to add this add-on
> info into main_summary, as discussed in bug 1290181? Each row of the dataset
> is a client/subsession, and add-ons could be added as an array of maps,
> similar to search counts. This does still have a nested structure, but it
> would answer all the same queries, and make it easier to segment clients and
> sessions on other profile characteristics.
Agreed, see Bug 1290181. There are some scenarios for which we want 100% of add-on data and the longitudinal dataset isn't good enough, like when training the add-on recommender. For those use-cases I would rather have the data in main_summary than in a separate dataset though.
Reporter | ||
Comment 9•9 years ago
|
||
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #7)
> What about using a view [1] so you can move the data in whatever form you
> like without the need to build a new dataset?
>
> [1] https://prestodb.io/docs/current/sql/create-view.html
That would require adding the view generation to each query, right? And adding the waiting time for the view to be generated for each query?
Comment 10•9 years ago
|
||
(In reply to Jorge Villalobos [:jorgev] from comment #9)
> (In reply to Roberto Agostino Vitillo (:rvitillo) from comment #7)
> > What about using a view [1] so you can move the data in whatever form you
> > like without the need to build a new dataset?
> >
> > [1] https://prestodb.io/docs/current/sql/create-view.html
>
> That would require adding the view generation to each query, right? And
> adding the waiting time for the view to be generated for each query?
Yes, but Presto might cache the view I think. During development you can use a subset of the data to iterate quickly so it shouldn't really matter if running the query on the whole dataset is a bit slower.
Assignee | ||
Comment 11•9 years ago
|
||
My intention is to eventually put this structured addon data into the main_summary dataset.
This is an interim dataset specifically designed to answer the above questions. Ideally, the addons information would be added to the main_summary dataset, then (if it's still needed) this addon-based dataset could be handled as a view or as a dependent dataset derived from main_summary (rather than by processing the raw data).
I wanted to use this data to investigate performance of joining the addons data against main_summary. If that had worked Well Enough, we could have worked in the other direction (joining the addons data back to the environment info from main_summary).
In my testing, however, it only worked well for very tiny datasets. Joining data for a few addons within a single day (something like 150,000 rows) with their matching documents in main_summary took only a couple of minutes on a 10-node cluster. Joining a full day's data, on the other hand, caused the 10-node cluster to run out of memory. To me, that means that a join-based approach is not workable for any real query.
There was also a question of whether addon data from Main pings was even reliable at all. Ilana ran some Spark analyses using `get_pings`, which appeared to show about an order of magnitude too few clients using an example addon. I wanted to rule this out as a problem in the raw data before spending too much effort on integrating with the main_summary data.
Quick tests using this dataset show that data from Main pings matches relatively well with the counts available from AMO, so the underlying data *does* appear to be reliable. That justifies spending more effort in building up datasets around Addons using Main pings as the source.
So I still very much plan to address bug 1290181, but that will take a bit more time to ensure that we don't regress performance too much or encourage folks to blow up stmo with huge queries :)
Comment 12•9 years ago
|
||
Ah, thanks for the clarifications - this approach makes a lot of sense :)
One comment I have is that to answer questions about the "current state of add-ons", we generally want a single add-on record (ie. from a single ping) per client, usually the most recent one. That way we'd count add-on characteristcs by the number of clients that have it installed. At the moment I don't see a way to select the most recent ping per client from this dataset.
If we add a column that indexes client pings chronologically (eg. payload.info.profileSubsessionCounter), that would let us not only select the most recent ping, but also collate a per-client longitudinal add-on history. (Granted, this may run into computation/resource issues.) This could give us some insight into enabling/disabling add-ons by seeing when an add-on drops off or appears in a client's longitudinal add-on history, something that we don't really have a view into currently.
Comment 13•9 years ago
|
||
Assignee | ||
Comment 14•9 years ago
|
||
Dave, I submitted a PR against bug 1290181 that adds addon data to the main_summary dataset, and I'm hoping also addresses your concern in comment 12 about being able to order pings chronologically.
Since payload.info.profileSubsessionCounter was not available in the main_summary dataset, I added the subsesionStartDate instead, hopefully that will suffice.
Will that work for ordering the addon records?
Flags: needinfo?(dzeber)
Comment 15•9 years ago
|
||
Comment 16•9 years ago
|
||
(In reply to Mark Reid [:mreid] from comment #14)
> Will that work for ordering the addon records?
Yes, that will be fine. Thanks!
Flags: needinfo?(dzeber)
Assignee | ||
Updated•9 years ago
|
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•