ThreadSanitizer: data race [@ moz_task::TaskRunnable::Release] vs. [@ nsThreadPool::Run]
Categories
(Core :: XPCOM, defect)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox84 | --- | fixed |
People
(Reporter: Gankra, Assigned: Gankra)
References
(Blocks 1 open bug)
Details
Attachments
(4 files, 1 obsolete file)
I am seeing this in my WIP branch to properly instrument the rust stdlib.
The race involves this task and this runnable. It's possible this is just a non-atomic refcounting issue somewhere, but I'm not confident. Things are a bit too virtual/macro-y to have a clear picture.
Marked as security since this is ostensibly a potential Use-After-Free involving a lot of virtual methods.
General information about TSan reports
Why fix races?
Data races are undefined behavior and can cause crashes as well as correctness issues. Compiler optimizations can cause racy code to have unpredictable and hard-to-reproduce behavior.
Rating
If you think this race can cause crashes or correctness issues, it would be great to rate the bug appropriately as P1/P2 and/or indicating this in the bug. This makes it a lot easier for us to assess the actual impact that these reports make and if they are helpful to you.
False Positives / Benign Races
Typically, races reported by TSan are not false positives [1], but it is possible that the race is benign. Even in this case it would be nice to come up with a fix if it is easily doable and does not regress performance. Every race that we cannot fix will have to remain on the suppression list and slows down the overall TSan performance. Also note that seemingly benign races can possibly be harmful (also depending on the compiler, optimizations and the architecture) [2][3].
[1] One major exception is the involvement of uninstrumented code from third-party libraries.
[2] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[3] How to miscompile programs with "benign" data races: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf
Suppressing unfixable races
If the bug cannot be fixed, then a runtime suppression needs to be added in mozglue/build/TsanOptions.cpp. The suppressions match on the full stack, so it should be picked such that it is unique to this particular race. The bug number of this bug should also be included so we have some documentation on why this suppression was added.
Updated•4 years ago
|
Comment 1•4 years ago
|
||
I remember us discussing a while ago that tsan doesn't handle atomic fences super well. I wonder if the cause of this issue is actually the Acquire fence in the xpcom atomic refcounting code here: https://searchfox.org/mozilla-central/rev/16d30bafd4e5276d6d3c632fb52a6c71e739cc44/xpcom/rust/xpcom/src/refptr.rs#301-302
Our C++ implementation of atomic release actually has config options to perform a less efficient operation because tsan doesn't understand the fence operation: https://searchfox.org/mozilla-central/rev/16d30bafd4e5276d6d3c632fb52a6c71e739cc44/xpcom/base/nsISupportsImpl.h#363-374. As does the rust standard library: https://github.com/rust-lang/rust/blob/338f939a8d77061896cd0a1ca87a2c6d1f4ec359/library/alloc/src/sync.rs#L41-L56.
I think the fix to this issue might just be to perform a similar check within refptr.rs.
Comment 2•4 years ago
|
||
(In reply to Nika Layzell [:nika] (ni? for response) from comment #1)
I think the fix to this issue might just be to perform a similar check within refptr.rs.
Thanks a lot Nika, I believe this is exactly what we have been looking for!
Alexis, do you feel comfortable for attempting a fix here?
| Assignee | ||
Comment 3•4 years ago
|
||
Yeah should be fine,
Comment 4•4 years ago
|
||
This is TSan not understanding our synchronization, apparently, so I'm unhiding.
| Assignee | ||
Comment 5•4 years ago
|
||
| Assignee | ||
Comment 6•4 years ago
|
||
Oh boy, the cfg(sanitize="thread") check is behind an unstable feature(cfg_sanitize), time to do some messy nonsense.
| Assignee | ||
Comment 7•4 years ago
|
||
This creates a more principled way for us to set all the things our rust
code needs for tsan. In the future, this will also be used to set -Zbuild-std
and the infra it needs. In this patch set, it's needed to set a feature
flag on the xpcom crate.
| Assignee | ||
Comment 8•4 years ago
|
||
Two green (for this issue) try builds, one with just these changes, and one on top of my rust-vendor branch:
just these changes: https://treeherder.mozilla.org/jobs?repo=try&revision=df2ceb83144c273a50f532aa8ab978b1a173a5e4&selectedTaskRun=flIdSdthQ5C3GjrwH2SjXA.0
with instrumented std: https://treeherder.mozilla.org/jobs?repo=try&revision=aa595b5b0e89ce0c6ab3c6049aa58268264e88ac&selectedTaskRun=TD3h7w79RcqqVniM-C6zFA.0
| Assignee | ||
Comment 9•4 years ago
|
||
This makes --enable-thread-sanitizer turn on Rust tsan (-Zsanitizer=thread).
This requires changing SpiderMonkey tsan to use the tsan rust nightly.
In future changes, more Rust tsan integration will key off of MOZ_TSAN.
Updated•4 years ago
|
Comment 10•4 years ago
|
||
Comment 11•4 years ago
|
||
Comment 12•4 years ago
|
||
Failure log: https://treeherder.mozilla.org/logviewer?job_id=321277563&repo=autoland
Backout link: https://hg.mozilla.org/integration/autoland/rev/65bcbd8c456a4b7e8ca78bd06cf22a6023c8cd23
Comment 13•4 years ago
|
||
| Assignee | ||
Comment 14•4 years ago
|
||
These already-flakey tests seem to get pushed over the edge to "almost always
busted" with more instrumented Rust code. Disabling seems like the right approach
to get more coverage elsewhere (and the other similarly-flakey tests suggest
we're probably still hitting the root cause).
Depends on D95950
Comment 15•4 years ago
|
||
| Assignee | ||
Updated•4 years ago
|
Comment 16•4 years ago
|
||
| bugherder | ||
https://hg.mozilla.org/mozilla-central/rev/378d2e32026b
https://hg.mozilla.org/mozilla-central/rev/342247a735ac
https://hg.mozilla.org/mozilla-central/rev/0644208f1414
Description
•