Closed Bug 1355154 Opened 8 years ago Closed 8 years ago

Lazy Json means missing fields for ujson.dumps

Categories

(Data Platform and Tools :: General, enhancement, P2)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: frank, Unassigned)

References

Details

I ran into the problem where I have a bunch of records retrieved via Dataset API. If I run: >> pings = Dataset.from_source('telemetry').where(submission_date = '20170301').records(sc, sample=.0001) >> pings.map(lambda x: ujson.dumps(x)) The dumped ping ends up missing a bunch of fields (for example, all histograms).
Ok, I had a lot of fun trying to figure out what was going on, in the end the problem is kind of obvious. Python uses an optimized set of functions for serializing json written in C, which are defined here: https://github.com/python/cpython/blob/2.7/Modules/_json.c And called from here: https://github.com/python/cpython/blob/2.7/Lib/json/encoder.py#L10 The optimized functions will not be able to use duck typing to handle what's inside and will just default to treating these objects as type 'dict'. There seems to be a bit of tension here between making things easy-to-use vs. fast. The simplest solution that comes to mind is adding a class method to dump the contents of a heka message to JSON. Would that be acceptable?
Assignee: nobody → wlachance
(In reply to William Lachance (:wlach) (use needinfo!) from comment #1) > Ok, I had a lot of fun trying to figure out what was going on, in the end > the problem is kind of obvious. > > Python uses an optimized set of functions for serializing json written in C, > which are defined here: > > https://github.com/python/cpython/blob/2.7/Modules/_json.c > > And called from here: > > https://github.com/python/cpython/blob/2.7/Lib/json/encoder.py#L10 > > The optimized functions will not be able to use duck typing to handle what's > inside and will just default to treating these objects as type 'dict'. See also https://github.com/mozilla/python_moztelemetry/issues/8 > There seems to be a bit of tension here between making things easy-to-use > vs. fast. The simplest solution that comes to mind is adding a class method > to dump the contents of a heka message to JSON. Would that be acceptable? This is just one of the ways this bug manifests itself (see issue above). It would great to figure out a way to make this work seemingly for the user.
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #2) > (In reply to William Lachance (:wlach) (use needinfo!) from comment #1) > > There seems to be a bit of tension here between making things easy-to-use > > vs. fast. The simplest solution that comes to mind is adding a class method > > to dump the contents of a heka message to JSON. Would that be acceptable? > > This is just one of the ways this bug manifests itself (see issue above). It > would great to figure out a way to make this work seemingly for the user. I suspect that there isn't really any easy solution here, short of modifying the python interpreter. Yesterday Frank found a hack to run `copy.deepcopy` on the object before passing to json.dumps worked. That almost seems as good as any other. Maybe we could just add some kind of shortcut method which does exactly that.
Component: Metrics: Pipeline → Telemetry APIs for Analysis
Priority: -- → P2
Product: Cloud Services → Data Platform and Tools
I'm not working on this right now.
Assignee: wlachance → nobody
See Also: → 1376028
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Component: Telemetry APIs for Analysis → General
You need to log in before you can comment on or make changes to this bug.