Open Bug 1195160 Opened 10 years ago Updated 3 years ago

Investigate TELEMETRY_PENDING_LOAD_FAILURE_PARSE failures per-client

Categories

(Toolkit :: Telemetry, defect, P3)

42 Branch
defect
Points:
1

Tracking

()

People

(Reporter: vladan, Unassigned)

References

Details

(Whiteboard: [measurement:client])

Attachments

(2 files)

Saw these errors in my browser console and when I checked the ping file, it was all binary 0's except for a cr+lf at the end. I changed the ping filename for privacy. 1439777412123 Toolkit.Telemetry ERROR TelemetrySend::sendPersistedPings - failed to send ping xxxxxxxx-yyyy-zzzz-aaaa-bbbbbbbbbbbb: PingParseError: JSON.parse: unexpected character at line 1 column 1 of the JSON data JS Stack trace: PingParseError@TelemetryStorage.jsm:94:15 < TelemetryStorageImpl.loadPingFile<@TelemetryStorage.jsm:1458:13 < TaskImpl_run@Task.jsm:314:40 < Handler.prototype.process@Promise-backend.js:934:23 < this.PromiseWalker.walkerLoop@Promise-backend.js:813:7 < Promise*this.PromiseWalker.scheduleWalkerLoop@Promise-backend.js:744:11 < this.PromiseWalker.schedulePromise@Promise-backend.js:776:7 < this.PromiseWalker.completePromise@Promise-backend.js:711:7 < onSuccess@osfile_native.jsm:63:7 Log.jsm:749:0 1439777472113 Toolkit.Telemetry ERROR TelemetrySend::sendPersistedPings - failed to send ping xxxxxxxx-yyyy-zzzz-aaaa-bbbbbbbbbbbb: PingParseError: JSON.parse: unexpected character at line 1 column 1 of the JSON data JS Stack trace: PingParseError@TelemetryStorage.jsm:94:15 < TelemetryStorageImpl.loadPingFile<@TelemetryStorage.jsm:1458:13 < TaskImpl_run@Task.jsm:314:40 < Handler.prototype.process@Promise-backend.js:934:23 < this.PromiseWalker.walkerLoop@Promise-backend.js:813:7 < Promise*this.PromiseWalker.scheduleWalkerLoop@Promise-backend.js:744:11 < this.PromiseWalker.schedulePromise@Promise-backend.js:776:7 < this.PromiseWalker.completePromise@Promise-backend.js:711:7 < onSuccess@osfile_native.jsm:63:7 Log.jsm:749:0 1439777532114 Toolkit.Telemetry ERROR TelemetrySend::sendPersistedPings - failed to send ping xxxxxxxx-yyyy-zzzz-aaaa-bbbbbbbbbbbb: PingParseError: JSON.parse: unexpected character at line 1 column 1 of the JSON data JS Stack trace: PingParseError@TelemetryStorage.jsm:94:15 < TelemetryStorageImpl.loadPingFile<@TelemetryStorage.jsm:1458:13 < TaskImpl_run@Task.jsm:314:40 < Handler.prototype.process@Promise-backend.js:934:23 < this.PromiseWalker.walkerLoop@Promise-backend.js:813:7 < Promise*this.PromiseWalker.scheduleWalkerLoop@Promise-backend.js:744:11 < this.PromiseWalker.schedulePromise@Promise-backend.js:776:7 < this.PromiseWalker.completePromise@Promise-backend.js:711:7 < onSuccess@osfile_native.jsm:63:7 Log.jsm:749:0 1439777592115 Toolkit.Telemetry ERROR TelemetrySend::sendPersistedPings - failed to send ping xxxxxxxx-yyyy-zzzz-aaaa-bbbbbbbbbbbb: PingParseError: JSON.parse: unexpected character at line 1 column 1 of the JSON data JS Stack trace: PingParseError@TelemetryStorage.jsm:94:15 < TelemetryStorageImpl.loadPingFile<@TelemetryStorage.jsm:1458:13 < TaskImpl_run@Task.jsm:314:40 < Handler.prototype.process@Promise-backend.js:934:23 < this.PromiseWalker.walkerLoop@Promise-backend.js:813:7 < Promise*this.PromiseWalker.scheduleWalkerLoop@Promise-backend.js:744:11 < this.PromiseWalker.schedulePromise@Promise-backend.js:776:7 < this.PromiseWalker.completePromise@Promise-backend.js:711:7 < onSuccess@osfile_native.jsm:63:7 Log.jsm:749:0
Flags: needinfo?(gfritzsche)
This was with Aurora 42 on Windows 7 with the Aug 14th build. The ping was crated on ‎Friday, ‎August ‎07, ‎2015, ‏‎2:00:07 AM local Toronto time. Based on my Firefox update history, the ping was created from an Aurora 41 build from August 6th. I looked in the Browser Console and saw this exception because the Firefox app menu wasn't opening in one of my windows. I had noticed this issue before, could it be caused by this exception somehow?
This just logs the exception, that part is fine. The problem is that we do get bad pings written out, i don't know yet why. This shows up in our Telemetry with significant numbers: http://mzl.la/1fk5Kjy Note that this is biased until we have bug 1189425 (counting affected pings multiple times).
Blocks: 1122482
Flags: needinfo?(gfritzsche)
Summary: Telemetry writes bad data in persisted ping, throws exception when trying to read it → Telemetry writes bad data in persisted ping
Yoric, do you have ideas what could cause this? My understanding was that OS.File has blockers in place, preventing this from happening? Even crashing mid-write should be fine because of the use of a temporary file?
Flags: needinfo?(dteller)
The dashboard link in your fhr-dev post indicates that only 3 sessions reported any unreadable pings, although 2 of those sessions had hundreds of unreadable pings (or read attempts?). The situation with unparseable pings in Aurora/Nightly 42 looks a lot worse, with many more sessions reporting unreadable pings: http://mzl.la/1K1HU90 Note that the above histogram increments the count for every failed parse attempt, even if it's always for the same ping file. 37.02k Nightly 42 sessions reporting at least one unparseable ping 4.92M Nightly 42 sessions This is roughly 0.75% of sessions. This could explain some of the broken subsession chain problems
Looking at this as % of sessions we also overcount a lot, i think we need to break it down by no. of affected clients for now.
(In reply to Georg Fritzsche [:gfritzsche] from comment #4) > Yoric, do you have ideas what could cause this? > > My understanding was that OS.File has blockers in place, preventing > this from happening? > Even crashing mid-write should be fine because of the use of a temporary > file? David says: > OS.File has AsyncShutdown blockers that should prevent this from happening. > Also, `writeAtomic` makes half-writes very unlikely, unless we have an OS/hardware-level crash.
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #5) > The dashboard link in your fhr-dev post indicates that only 3 sessions > reported any unreadable pings, although 2 of those sessions had hundreds of > unreadable pings (or read attempts?). Note that those report 2 different errors: * loading pings from disk - TELEMETRY_PENDING_LOAD_FAILURE_READ, http://mzl.la/1JcLMDj * parsing pings to json after loading them from disk - TELEMETRY_PENDING_LOAD_FAILURE_PARSE, http://mzl.la/1fk5Kjy
If `writeAtomic` is called with a `tmpPath` option, it first writes to a temporary file, which should make sure that we cannot overwrite a "good" file with a bad file in case of Firefox crash or disk full. This does not protect against OS-level, battery or hardware crash. Also, OS.File also has blockers to make sure that data is written timely. If, for some reason, web-workers-shutdown is not respected by some shutdown path, these could be completely skipped, though.
Flags: needinfo?(dteller)
(In reply to David Rajchenbach-Teller [:Yoric] (use "needinfo") from comment #9) > If `writeAtomic` is called with a `tmpPath` option, it first writes to a > temporary file, which should make sure that we cannot overwrite a "good" > file with a bad file in case of Firefox crash or disk full. This does not > protect against OS-level, battery or hardware crash. Right, we use that: https://dxr.mozilla.org/mozilla-central/rev/0876695d1abdeb363a780bda8b6cc84f20ba51c9/toolkit/components/telemetry/TelemetryStorage.jsm#1157 > Also, OS.File also has blockers to make sure that data is written timely. > If, for some reason, web-workers-shutdown is not respected by some shutdown > path, these could be completely skipped, though. What kind of scenario are you thinking about here?
See Also: → 1195426
See Also: → 1189425
This came up on IRC, checking tomorrow: > so JSON.stringify could fail or TexteEncoder.cencode() or Lz4.compressFileContent (unlikely in this path though)
Flags: needinfo?(gfritzsche)
Tracing the code path here (note that we don't currently compress pending pings): (1) cloneInto() (2) JSON.stringify() - https://dxr.mozilla.org/mozilla-central/rev/29b2df16e961fbe9a379362ecba6f888d1754bc3/js/src/json.cpp#864 - might return `undefined`, but passing `undefined` to writeAtomic throws: "TypeError: Expecting a Typed Array" - should throw on failures (3) OS.File.writeAtomic() - https://dxr.mozilla.org/mozilla-central/rev/29b2df16e961fbe9a379362ecba6f888d1754bc3/toolkit/components/osfile/modules/osfile_async_front.jsm#1163 - throwing for buffer argument being `undefined`, `0`, `{}`, `[]`, `[0]` - pass message to worker Type.void_t.in_ptr.toMsg to the buffer (4) AbstractFile.writeAtomic() - https://dxr.mozilla.org/mozilla-central/rev/29b2df16e961fbe9a379362ecba6f888d1754bc3/toolkit/components/osfile/modules/osfile_shared_front.jsm#396 (5) TextEncoder("utf-8").encode() - https://dxr.mozilla.org/mozilla-central/rev/29b2df16e961fbe9a379362ecba6f888d1754bc3/toolkit/components/osfile/modules/osfile_shared_front.jsm#411 - returns Uint8Array (6) writing out data to the file: - https://dxr.mozilla.org/mozilla-central/rev/29b2df16e961fbe9a379362ecba6f888d1754bc3/toolkit/components/osfile/modules/osfile_shared_front.jsm#445 I don't see an obvious problem here. My best guess so far is that we end up with TextEncoder().encode() ending up returning an array of 0s or we have some bug around the buffers that get passed around.
Flags: needinfo?(gfritzsche)
(In reply to Georg Fritzsche [:gfritzsche] from comment #10) > > Also, OS.File also has blockers to make sure that data is written timely. > > If, for some reason, web-workers-shutdown is not respected by some shutdown > > path, these could be completely skipped, though. > > What kind of scenario are you thinking about here? I don't think that there is any shutdown path that doesn't respect web-workers-shutdown, but I have never checked how we shutdown when we update Firefox, for instance.
Priority: -- → P2
Georg, could it be that for some reason the data either cannot be converted to JSON or results in JSON with somehow invalid UTF data?
Flags: needinfo?(gfritzsche)
Alessio has been rechecking my reading and stress testing a bit with UTF data, leaving that for him to comment. Vladan, i was thinking that you should have that ping archived - can you check and let us access it or note some basics here (especially whether it was shutdown, totalTime/sessionLength/..., subsessionCount, aborts, ...).
Flags: needinfo?(vdjeric)
I've been playing with this a bit, feeding writeAtomic and JSON.stringify with weird UTF-8 data from [0]. "Unfortunately", they seem to cope well with the data from that page. Since passing an Uint8Array filled with nulls to writeAtomic correctly writes a file full of NULLs, I guess we're either stumbling on a TextEncoder edge case (not covered by the data in [0]) or in some shutdown madness. [0] - http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
Flags: needinfo?(gfritzsche)
One thing i was wondering here: Is the string buffer here kept referenced via the closure argument to Scheduler.post()? https://dxr.mozilla.org/mozilla-central/rev/29b2df16e961fbe9a379362ecba6f888d1754bc3/toolkit/components/osfile/modules/osfile_async_front.jsm#1178
Yes, that's what the closure argument is for. I realize that we are not keeping alive the `path`, though.
(In reply to Georg Fritzsche [:gfritzsche] [away Sep 5 - 13] from comment #15) > Vladan, i was thinking that you should have that ping archived - can you > check and let us access it or note some basics here (especially whether it > was shutdown, totalTime/sessionLength/..., subsessionCount, aborts, ...). I do have the ping archived, but I don't know which one it is anymore :/ I think your patch deleted the bad ping :( I'll check at home if i wrote down the real ping ID anywhere
Flags: needinfo?(vdjeric)
This is the ping
It looks like session uptime was only 4 minutes. Let me know if you need more fields from the ping
(In reply to David Rajchenbach-Teller [:Yoric] (use "needinfo") from comment #18) > Yes, that's what the closure argument is for. > I realize that we are not keeping alive the `path`, though. Yeah, but that couldn't really cause the issues we see here, right?
Depends on: 1204834
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #20) > Created attachment 8657914 [details] > archived problem ping.json > > This is the ping So at least it confirms that this was at shutdown, this seems most likely to cause any races or other failure conditions.
Per the evolution view this now looks much saner after bug 1189425: http://mzl.la/1ObZp8X Comparing this to a once-per-session measurement like BROWSER_IS_USER_DEFAULT this doesn't seem so bad now (0.1%?): * http://mzl.la/1ObZucZ * http://mzl.la/1ObZOs0 I'm not sure if looking at those two measurements for proportions makes sense though.
Points: --- → 1
Whiteboard: [measurement:client]
Priority: P2 → P3
We have the following error histograms that cover this: * TELEMETRY_PENDING_LOAD_FAILURE_READ * TELEMETRY_PENDING_LOAD_FAILURE_PARSE We should check if the error rates there are sane and decided further actions from there (investigating or closing).
Both measures are relatively low: - https://mzl.la/2kbf68T - https://mzl.la/2kbdGeG The _PARSE one has ~60k submissions though. We should run the numbers per-client to see if this is just noise, heavily affected clients, ...
Summary: Telemetry writes bad data in persisted ping → Investigate TELEMETRY_PENDING_LOAD_FAILURE_PARSE failures per-client
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: