Open
Bug 1195160
Opened 10 years ago
Updated 3 years ago
Investigate TELEMETRY_PENDING_LOAD_FAILURE_PARSE failures per-client
Categories
(Toolkit :: Telemetry, defect, P3)
Tracking
()
NEW
People
(Reporter: vladan, Unassigned)
References
Details
(Whiteboard: [measurement:client])
Attachments
(2 files)
Saw these errors in my browser console and when I checked the ping file, it was all binary 0's except for a cr+lf at the end. I changed the ping filename for privacy.
1439777412123 Toolkit.Telemetry ERROR TelemetrySend::sendPersistedPings - failed to send ping xxxxxxxx-yyyy-zzzz-aaaa-bbbbbbbbbbbb: PingParseError: JSON.parse: unexpected character at line 1 column 1 of the JSON data JS Stack trace: PingParseError@TelemetryStorage.jsm:94:15 < TelemetryStorageImpl.loadPingFile<@TelemetryStorage.jsm:1458:13 < TaskImpl_run@Task.jsm:314:40 < Handler.prototype.process@Promise-backend.js:934:23 < this.PromiseWalker.walkerLoop@Promise-backend.js:813:7 < Promise*this.PromiseWalker.scheduleWalkerLoop@Promise-backend.js:744:11 < this.PromiseWalker.schedulePromise@Promise-backend.js:776:7 < this.PromiseWalker.completePromise@Promise-backend.js:711:7 < onSuccess@osfile_native.jsm:63:7 Log.jsm:749:0
1439777472113 Toolkit.Telemetry ERROR TelemetrySend::sendPersistedPings - failed to send ping xxxxxxxx-yyyy-zzzz-aaaa-bbbbbbbbbbbb: PingParseError: JSON.parse: unexpected character at line 1 column 1 of the JSON data JS Stack trace: PingParseError@TelemetryStorage.jsm:94:15 < TelemetryStorageImpl.loadPingFile<@TelemetryStorage.jsm:1458:13 < TaskImpl_run@Task.jsm:314:40 < Handler.prototype.process@Promise-backend.js:934:23 < this.PromiseWalker.walkerLoop@Promise-backend.js:813:7 < Promise*this.PromiseWalker.scheduleWalkerLoop@Promise-backend.js:744:11 < this.PromiseWalker.schedulePromise@Promise-backend.js:776:7 < this.PromiseWalker.completePromise@Promise-backend.js:711:7 < onSuccess@osfile_native.jsm:63:7 Log.jsm:749:0
1439777532114 Toolkit.Telemetry ERROR TelemetrySend::sendPersistedPings - failed to send ping xxxxxxxx-yyyy-zzzz-aaaa-bbbbbbbbbbbb: PingParseError: JSON.parse: unexpected character at line 1 column 1 of the JSON data JS Stack trace: PingParseError@TelemetryStorage.jsm:94:15 < TelemetryStorageImpl.loadPingFile<@TelemetryStorage.jsm:1458:13 < TaskImpl_run@Task.jsm:314:40 < Handler.prototype.process@Promise-backend.js:934:23 < this.PromiseWalker.walkerLoop@Promise-backend.js:813:7 < Promise*this.PromiseWalker.scheduleWalkerLoop@Promise-backend.js:744:11 < this.PromiseWalker.schedulePromise@Promise-backend.js:776:7 < this.PromiseWalker.completePromise@Promise-backend.js:711:7 < onSuccess@osfile_native.jsm:63:7 Log.jsm:749:0
1439777592115 Toolkit.Telemetry ERROR TelemetrySend::sendPersistedPings - failed to send ping xxxxxxxx-yyyy-zzzz-aaaa-bbbbbbbbbbbb: PingParseError: JSON.parse: unexpected character at line 1 column 1 of the JSON data JS Stack trace: PingParseError@TelemetryStorage.jsm:94:15 < TelemetryStorageImpl.loadPingFile<@TelemetryStorage.jsm:1458:13 < TaskImpl_run@Task.jsm:314:40 < Handler.prototype.process@Promise-backend.js:934:23 < this.PromiseWalker.walkerLoop@Promise-backend.js:813:7 < Promise*this.PromiseWalker.scheduleWalkerLoop@Promise-backend.js:744:11 < this.PromiseWalker.schedulePromise@Promise-backend.js:776:7 < this.PromiseWalker.completePromise@Promise-backend.js:711:7 < onSuccess@osfile_native.jsm:63:7 Log.jsm:749:0
Flags: needinfo?(gfritzsche)
Reporter | ||
Comment 1•10 years ago
|
||
This was with Aurora 42 on Windows 7 with the Aug 14th build.
The ping was crated on Friday, August 07, 2015, 2:00:07 AM local Toronto time.
Based on my Firefox update history, the ping was created from an Aurora 41 build from August 6th.
I looked in the Browser Console and saw this exception because the Firefox app menu wasn't opening in one of my windows. I had noticed this issue before, could it be caused by this exception somehow?
Comment 2•10 years ago
|
||
This just logs the exception, that part is fine.
The problem is that we do get bad pings written out, i don't know yet why.
This shows up in our Telemetry with significant numbers: http://mzl.la/1fk5Kjy
Note that this is biased until we have bug 1189425 (counting affected pings multiple times).
Blocks: 1122482
Flags: needinfo?(gfritzsche)
Updated•10 years ago
|
Summary: Telemetry writes bad data in persisted ping, throws exception when trying to read it → Telemetry writes bad data in persisted ping
Comment 3•10 years ago
|
||
Comment 4•10 years ago
|
||
Yoric, do you have ideas what could cause this?
My understanding was that OS.File has blockers in place, preventing
this from happening?
Even crashing mid-write should be fine because of the use of a temporary
file?
Flags: needinfo?(dteller)
Reporter | ||
Comment 5•10 years ago
|
||
The dashboard link in your fhr-dev post indicates that only 3 sessions reported any unreadable pings, although 2 of those sessions had hundreds of unreadable pings (or read attempts?).
The situation with unparseable pings in Aurora/Nightly 42 looks a lot worse, with many more sessions reporting unreadable pings: http://mzl.la/1K1HU90
Note that the above histogram increments the count for every failed parse attempt, even if it's always for the same ping file.
37.02k Nightly 42 sessions reporting at least one unparseable ping
4.92M Nightly 42 sessions
This is roughly 0.75% of sessions. This could explain some of the broken subsession chain problems
Comment 6•10 years ago
|
||
Looking at this as % of sessions we also overcount a lot, i think we need to break it down by no. of affected clients for now.
Comment 7•10 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] from comment #4)
> Yoric, do you have ideas what could cause this?
>
> My understanding was that OS.File has blockers in place, preventing
> this from happening?
> Even crashing mid-write should be fine because of the use of a temporary
> file?
David says:
> OS.File has AsyncShutdown blockers that should prevent this from happening.
> Also, `writeAtomic` makes half-writes very unlikely, unless we have an OS/hardware-level crash.
Comment 8•10 years ago
|
||
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #5)
> The dashboard link in your fhr-dev post indicates that only 3 sessions
> reported any unreadable pings, although 2 of those sessions had hundreds of
> unreadable pings (or read attempts?).
Note that those report 2 different errors:
* loading pings from disk - TELEMETRY_PENDING_LOAD_FAILURE_READ, http://mzl.la/1JcLMDj
* parsing pings to json after loading them from disk - TELEMETRY_PENDING_LOAD_FAILURE_PARSE, http://mzl.la/1fk5Kjy
Comment 9•10 years ago
|
||
If `writeAtomic` is called with a `tmpPath` option, it first writes to a temporary file, which should make sure that we cannot overwrite a "good" file with a bad file in case of Firefox crash or disk full. This does not protect against OS-level, battery or hardware crash.
Also, OS.File also has blockers to make sure that data is written timely. If, for some reason, web-workers-shutdown is not respected by some shutdown path, these could be completely skipped, though.
Flags: needinfo?(dteller)
Comment 10•10 years ago
|
||
(In reply to David Rajchenbach-Teller [:Yoric] (use "needinfo") from comment #9)
> If `writeAtomic` is called with a `tmpPath` option, it first writes to a
> temporary file, which should make sure that we cannot overwrite a "good"
> file with a bad file in case of Firefox crash or disk full. This does not
> protect against OS-level, battery or hardware crash.
Right, we use that:
https://dxr.mozilla.org/mozilla-central/rev/0876695d1abdeb363a780bda8b6cc84f20ba51c9/toolkit/components/telemetry/TelemetryStorage.jsm#1157
> Also, OS.File also has blockers to make sure that data is written timely.
> If, for some reason, web-workers-shutdown is not respected by some shutdown
> path, these could be completely skipped, though.
What kind of scenario are you thinking about here?
Comment 11•10 years ago
|
||
This came up on IRC, checking tomorrow:
> so JSON.stringify could fail or TexteEncoder.cencode() or Lz4.compressFileContent (unlikely in this path though)
Flags: needinfo?(gfritzsche)
Comment 12•10 years ago
|
||
Tracing the code path here (note that we don't currently compress pending pings):
(1) cloneInto()
(2) JSON.stringify()
- https://dxr.mozilla.org/mozilla-central/rev/29b2df16e961fbe9a379362ecba6f888d1754bc3/js/src/json.cpp#864
- might return `undefined`, but passing `undefined` to writeAtomic throws: "TypeError: Expecting a Typed Array"
- should throw on failures
(3) OS.File.writeAtomic()
- https://dxr.mozilla.org/mozilla-central/rev/29b2df16e961fbe9a379362ecba6f888d1754bc3/toolkit/components/osfile/modules/osfile_async_front.jsm#1163
- throwing for buffer argument being `undefined`, `0`, `{}`, `[]`, `[0]`
- pass message to worker Type.void_t.in_ptr.toMsg to the buffer
(4) AbstractFile.writeAtomic()
- https://dxr.mozilla.org/mozilla-central/rev/29b2df16e961fbe9a379362ecba6f888d1754bc3/toolkit/components/osfile/modules/osfile_shared_front.jsm#396
(5) TextEncoder("utf-8").encode()
- https://dxr.mozilla.org/mozilla-central/rev/29b2df16e961fbe9a379362ecba6f888d1754bc3/toolkit/components/osfile/modules/osfile_shared_front.jsm#411
- returns Uint8Array
(6) writing out data to the file:
- https://dxr.mozilla.org/mozilla-central/rev/29b2df16e961fbe9a379362ecba6f888d1754bc3/toolkit/components/osfile/modules/osfile_shared_front.jsm#445
I don't see an obvious problem here. My best guess so far is that we end up with TextEncoder().encode() ending up returning an array of 0s or we have some bug around the buffers that get passed around.
Flags: needinfo?(gfritzsche)
(In reply to Georg Fritzsche [:gfritzsche] from comment #10)
> > Also, OS.File also has blockers to make sure that data is written timely.
> > If, for some reason, web-workers-shutdown is not respected by some shutdown
> > path, these could be completely skipped, though.
>
> What kind of scenario are you thinking about here?
I don't think that there is any shutdown path that doesn't respect web-workers-shutdown, but I have never checked how we shutdown when we update Firefox, for instance.
Updated•10 years ago
|
Priority: -- → P2
Georg, could it be that for some reason the data either cannot be converted to JSON or results in JSON with somehow invalid UTF data?
Flags: needinfo?(gfritzsche)
Comment 15•10 years ago
|
||
Alessio has been rechecking my reading and stress testing a bit with UTF data, leaving that for him to comment.
Vladan, i was thinking that you should have that ping archived - can you check and let us access it or note some basics here (especially whether it was shutdown, totalTime/sessionLength/..., subsessionCount, aborts, ...).
Flags: needinfo?(vdjeric)
Comment 16•10 years ago
|
||
I've been playing with this a bit, feeding writeAtomic and JSON.stringify with weird UTF-8 data from [0]. "Unfortunately", they seem to cope well with the data from that page.
Since passing an Uint8Array filled with nulls to writeAtomic correctly writes a file full of NULLs, I guess we're either stumbling on a TextEncoder edge case (not covered by the data in [0]) or in some shutdown madness.
[0] - http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
Flags: needinfo?(gfritzsche)
Comment 17•10 years ago
|
||
One thing i was wondering here:
Is the string buffer here kept referenced via the closure argument to Scheduler.post()?
https://dxr.mozilla.org/mozilla-central/rev/29b2df16e961fbe9a379362ecba6f888d1754bc3/toolkit/components/osfile/modules/osfile_async_front.jsm#1178
Yes, that's what the closure argument is for.
I realize that we are not keeping alive the `path`, though.
Reporter | ||
Comment 19•10 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] [away Sep 5 - 13] from comment #15)
> Vladan, i was thinking that you should have that ping archived - can you
> check and let us access it or note some basics here (especially whether it
> was shutdown, totalTime/sessionLength/..., subsessionCount, aborts, ...).
I do have the ping archived, but I don't know which one it is anymore :/ I think your patch deleted the bad ping :( I'll check at home if i wrote down the real ping ID anywhere
Flags: needinfo?(vdjeric)
Reporter | ||
Comment 20•10 years ago
|
||
This is the ping
Reporter | ||
Comment 21•10 years ago
|
||
It looks like session uptime was only 4 minutes.
Let me know if you need more fields from the ping
Comment 22•10 years ago
|
||
(In reply to David Rajchenbach-Teller [:Yoric] (use "needinfo") from comment #18)
> Yes, that's what the closure argument is for.
> I realize that we are not keeping alive the `path`, though.
Yeah, but that couldn't really cause the issues we see here, right?
Comment 23•10 years ago
|
||
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #20)
> Created attachment 8657914 [details]
> archived problem ping.json
>
> This is the ping
So at least it confirms that this was at shutdown, this seems most likely to cause any races or other failure conditions.
Comment 24•10 years ago
|
||
Per the evolution view this now looks much saner after bug 1189425: http://mzl.la/1ObZp8X
Comparing this to a once-per-session measurement like BROWSER_IS_USER_DEFAULT this doesn't seem so bad now (0.1%?):
* http://mzl.la/1ObZucZ
* http://mzl.la/1ObZOs0
I'm not sure if looking at those two measurements for proportions makes sense though.
Updated•10 years ago
|
Points: --- → 1
Whiteboard: [measurement:client]
Updated•10 years ago
|
Priority: P2 → P3
Comment 25•10 years ago
|
||
We have the following error histograms that cover this:
* TELEMETRY_PENDING_LOAD_FAILURE_READ
* TELEMETRY_PENDING_LOAD_FAILURE_PARSE
We should check if the error rates there are sane and decided further actions from there (investigating or closing).
Comment 26•9 years ago
|
||
Both measures are relatively low:
- https://mzl.la/2kbf68T
- https://mzl.la/2kbdGeG
The _PARSE one has ~60k submissions though. We should run the numbers per-client to see if this is just noise, heavily affected clients, ...
Updated•9 years ago
|
Summary: Telemetry writes bad data in persisted ping → Investigate TELEMETRY_PENDING_LOAD_FAILURE_PARSE failures per-client
Updated•3 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•