Closed Bug 1347283 Opened 9 years ago Closed 9 years ago

Dataset API fails on improper utf8 string

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: frank, Unassigned)

References

Details

Attachments

(2 files)

failing_dataset_example.ipynb 9 years ago Frank Bertsch [:frank] 74.97 KB, text/plain		Details
PR 116 on python_moztelemetry 9 years ago Mauro Doglio [:mdoglio] 55 bytes, text/x-github-pull-request	whd : review+	Details \| Review

Frank Bertsch [:frank]

Reporter

Description

•

9 years ago

Attached file failing_dataset_example.ipynb — Details

> Dataset.from_source('telemetry').where(docType='OTHER').records(sc, sample=.01).count() Errors with: > UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 74109: invalid continuation byte Example is attached.

Eric Rescorla (:ekr)

Comment 1

•

9 years ago

mdoglio, mreid, thoughts on fixing this?

Flags: needinfo?(mreid)

Flags: needinfo?(mdoglio)

Mauro Doglio [:mdoglio]

Comment 2

•

9 years ago

It looks like we have some bogus data that the heka message parser cant'read. I'll ni whd to see if he can find where it is. We could catch the exception and move on, but I would prefer to know what's the root cause first.

Flags: needinfo?(mdoglio) → needinfo?(whd)

Mauro Doglio [:mdoglio]

Comment 3

•

9 years ago

Actually, let me write a quick fix while Wes investigates.

Mauro Doglio [:mdoglio]

Comment 4

•

9 years ago

Attached file PR 116 on python_moztelemetry — Details

Attachment #8847665 - Flags: review?(whd)

Wesley Dawson [:whd]

Comment 5

•

9 years ago

Comment on attachment 8847665 [details] [review] PR 116 on python_moztelemetry I'll dig through the data to find the bad UTF8. The new pipeline until https://github.com/mozilla-services/lua_sandbox_extensions/pull/84/commits/3f8c5bf040ec9ff8dce10430c86536b398e77a03 was deployed produced client-generated binary strings that could have a specific kind of UTF8 corruption (these are considered errors now). As long as the corruption happened within a string in the json (not part of the structure of the json document) and not in a string used as part of schema validation for the particular docType, the new pipeline would consider it valid and produce it as-is. If the corruption happened as part of the structure of the document then it would simply be sent to the errors stream. We "fixed" this in the scala bindings by removing the valid UTF8 check and observing that the UTF8 decode can be done lossily with bad data. Unless we have a lossy decode option in python :mdoglio's "try except" may be the best we can do here. Another thing we might want to do is take whatever data is going to OTHER and reprocess it into its own doc type, as OTHER pings are probably the ones most likely to have issues like this. Leaving the NI until I've found the data.

Attachment #8847665 - Flags: review?(whd) → review+

Wesley Dawson [:whd]

Comment 6

•

9 years ago

https://docs.python.org/2/library/codecs.html#codec-base-classes makes it look like we can do lossy decoding, so I'll file a PR that does that.

Wesley Dawson [:whd]

Comment 7

•

9 years ago

I've filed https://github.com/mozilla/python_moztelemetry/pull/117 with a fix and a test case including some of the data form the OTHER bucket that failed. It now replaces the bad utf8 characters with the replacement character per the unicode spec.

Flags: needinfo?(whd)

Mark Reid [:mreid]

Comment 8

•

9 years ago

Looks good to me.

Flags: needinfo?(mreid)

Wesley Dawson [:whd]

Comment 9

•

9 years ago

I've merged the above PR. Once :mreid or :rvitillo pushes an updated version to pypi we should be able to mark this as fixed.

Mauro Doglio [:mdoglio]

Comment 10

•

9 years ago

I deployed the last version to pypi. New clusters will get it automatically.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Mauro Doglio [:mdoglio]

Comment 11

•

9 years ago

This was fixed by https://github.com/mozilla/python_moztelemetry/pull/117

BMO Automation

Updated

•

7 years ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Dataset API fails on improper utf8 string

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect)

Tracking

(Not tracked)

People

(Reporter: frank, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Attachment

General

Description

File Name

Content Type