Closed
Bug 1347283
Opened 9 years ago
Closed 9 years ago
Dataset API fails on improper utf8 string
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: frank, Unassigned)
References
Details
Attachments
(2 files)
> Dataset.from_source('telemetry').where(docType='OTHER').records(sc, sample=.01).count()
Errors with:
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 74109: invalid continuation byte
Example is attached.
Comment 1•9 years ago
|
||
mdoglio, mreid, thoughts on fixing this?
Flags: needinfo?(mreid)
Flags: needinfo?(mdoglio)
Comment 2•9 years ago
|
||
It looks like we have some bogus data that the heka message parser cant'read. I'll ni whd to see if he can find where it is. We could catch the exception and move on, but I would prefer to know what's the root cause first.
Flags: needinfo?(mdoglio) → needinfo?(whd)
Comment 3•9 years ago
|
||
Actually, let me write a quick fix while Wes investigates.
Comment 4•9 years ago
|
||
Attachment #8847665 -
Flags: review?(whd)
Comment 5•9 years ago
|
||
Comment on attachment 8847665 [details] [review]
PR 116 on python_moztelemetry
I'll dig through the data to find the bad UTF8. The new pipeline until https://github.com/mozilla-services/lua_sandbox_extensions/pull/84/commits/3f8c5bf040ec9ff8dce10430c86536b398e77a03 was deployed produced client-generated binary strings that could have a specific kind of UTF8 corruption (these are considered errors now). As long as the corruption happened within a string in the json (not part of the structure of the json document) and not in a string used as part of schema validation for the particular docType, the new pipeline would consider it valid and produce it as-is. If the corruption happened as part of the structure of the document then it would simply be sent to the errors stream.
We "fixed" this in the scala bindings by removing the valid UTF8 check and observing that the UTF8 decode can be done lossily with bad data. Unless we have a lossy decode option in python :mdoglio's "try except" may be the best we can do here.
Another thing we might want to do is take whatever data is going to OTHER and reprocess it into its own doc type, as OTHER pings are probably the ones most likely to have issues like this.
Leaving the NI until I've found the data.
Attachment #8847665 -
Flags: review?(whd) → review+
Comment 6•9 years ago
|
||
https://docs.python.org/2/library/codecs.html#codec-base-classes makes it look like we can do lossy decoding, so I'll file a PR that does that.
Comment 7•9 years ago
|
||
I've filed https://github.com/mozilla/python_moztelemetry/pull/117 with a fix and a test case including some of the data form the OTHER bucket that failed. It now replaces the bad utf8 characters with the replacement character per the unicode spec.
Flags: needinfo?(whd)
Comment 9•9 years ago
|
||
I've merged the above PR. Once :mreid or :rvitillo pushes an updated version to pypi we should be able to mark this as fixed.
Comment 10•9 years ago
|
||
I deployed the last version to pypi. New clusters will get it automatically.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Comment 11•9 years ago
|
||
This was fixed by https://github.com/mozilla/python_moztelemetry/pull/117
Updated•7 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•