Closed Bug 1796753 Opened 3 years ago Closed 3 years ago

Linux webrender tsan opt xpcshell frequent retries that end up as exception

Categories

(Core :: Sanitizers, defect)

defect

Tracking

()

RESOLVED FIXED
109 Branch
Tracking Status
firefox-esr102 --- unaffected
firefox107 --- wontfix
firefox108 --- wontfix
firefox109 --- fixed

People

(Reporter: CosminS, Assigned: jstutte)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: intermittent-failure, regression)

Attachments

(1 file)

There are frequent retries on tsan xpcshell tests like that often end up as an exception.

These Test groups are ran when these occur:

    browser/components/tests/unit/xpcshell.ini
    browser/extensions/formautofill/test/unit/heuristics/third_party/xpcshell.ini
    chrome/test/unit/xpcshell.ini
    devtools/client/application/test/xpcshell/xpcshell.ini
    devtools/client/shared/remote-debugging/test/xpcshell/xpcshell.ini
    devtools/client/webconsole/test/xpcshell/xpcshell.ini
    dom/encoding/test/unit/xpcshell.ini
    dom/indexedDB/test/unit/xpcshell-child-process.ini
    dom/media/webvtt/test/xpcshell/xpcshell.ini
    dom/promise/tests/unit/xpcshell.ini
    intl/strres/tests/unit/xpcshell.ini
    modules/libmar/tests/unit/xpcshell.ini
    modules/libpref/test/unit_ipc/xpcshell.ini
    netwerk/dns/tests/unit/xpcshell.ini
    remote/shared/messagehandler/test/xpcshell/xpcshell.ini
    services/sync/tests/unit/xpcshell.ini
    storage/test/unit/xpcshell.ini
    toolkit/components/autocomplete/tests/unit/xpcshell.ini
    toolkit/components/backgroundtasks/tests/xpcshell/xpcshell.ini
    toolkit/components/contentprefs/tests/unit_cps2/xpcshell.ini
    toolkit/components/contextualidentity/tests/unit/xpcshell.ini
    toolkit/components/crashes/tests/xpcshell/xpcshell.ini
    toolkit/components/crashmonitor/test/unit/xpcshell.ini
    toolkit/components/ctypes/tests/unit/xpcshell.ini
    toolkit/components/messaging-system/targeting/test/unit/xpcshell.ini
    toolkit/components/places/tests/queries/xpcshell.ini
    toolkit/components/places/tests/sync/xpcshell.ini
    toolkit/components/thumbnails/test/xpcshell.ini
    toolkit/components/url-classifier/tests/unit/xpcshell.ini
    toolkit/components/windowcreator/tests/unit/xpcshell.ini
    toolkit/content/tests/unit/xpcshell.ini
    toolkit/crashreporter/test/unit_ipc/xpcshell-phc.ini
    toolkit/mozapps/extensions/test/xpcshell/rs-blocklist/xpcshell.ini
    toolkit/mozapps/update/tests/unit_background_update/xpcshell.ini
    widget/tests/unit/xpcshell.ini

decoder, is this something you could have a look over? Thank you.

Flags: needinfo?(choller)
Summary: linux webrender tsan opt xpcshell frequent retries that end up as exception → Linux webrender tsan opt xpcshell frequent retries that end up as exception
No longer blocks: asan-maintenance
Blocks: tsan
Component: Security → Sanitizers

When did this start? Since we still don't upload log artifacts for jobs that fail with an exception, this is difficult to debug. The most likely cause is that the machine swaps/runs OOM for some reason and it didn't before. Did we change anything about machine configurations? Are the failing test groups new? What did change around the time when this started?

Flags: needinfo?(choller) → needinfo?(csabou)

Started with this push on October 14th. The test groups are the same for the X6 task on the previous push for which it succeeded on the first attempt.

Flags: needinfo?(csabou)

That push has a lot of AWS -> GCP commits in it. Did this job move as well? Did the machine configuration change in any way?

(In reply to Christian Holler (:decoder) from comment #7)

That push has a lot of AWS -> GCP commits in it. Did this job move as well? Did the machine configuration change in any way?

These jobs were not part of the changes. I will bisect on Try.

Backfills point to bug 1774462 as the first push affected with frequent automatic retries of Linux TSan xpcshell failures because the tasks encounter issues. The affected task runs the tests in dom/indexedDB/test/unit/xpcshell-child-process.ini, among others.

Flags: needinfo?(jstutte)
Regressed by: 1774462

Hmm, it is not easy to look at anything here, as all the tasks I looked at did not finish the log parsing and trying to access the log through the task itself gives me a network error like this. In comment 0 I see dom/indexedDB/test/unit/xpcshell-child-process.ini, too, and that runs dom/indexedDB/test/test_keys.html, IIUC.
I wonder if we just end up with an OOM here given that test_keys.html allocates a huge array which bites us for some reason only in this constellation in tsan. We could try to avoid running this test in tsan (and other memory sensitive constellations).

Flags: needinfo?(jstutte)

(In reply to Jens Stutte [:jstutte] from comment #12)

I wonder if we just end up with an OOM here given that test_keys.html allocates a huge array which bites us for some reason only in this constellation in tsan. We could try to avoid running this test in tsan (and other memory sensitive constellations).

I think this is exactly what happens and I agree, we should not run this test in configurations that require more memory (e.g. sanitizers).

Set release status flags based on info from the regressing bug 1774462

(In reply to Christian Holler (:decoder) from comment #13)

I think this is exactly what happens and I agree, we should not run this test in configurations that require more memory (e.g. sanitizers).

I assume we do some different OOM handling in those builds? As the test wants to account for a fallible allocation of that array, mostly to exclude 32Bit systems, but that seems not to help here.

Hmm, could that specific test just check for AppConstants.TSAN || AppConstants.ASAN ? We could make us just skip the one key with large allocation, somehow.

(In reply to Jens Stutte [:jstutte] from comment #15)

(In reply to Christian Holler (:decoder) from comment #13)

I think this is exactly what happens and I agree, we should not run this test in configurations that require more memory (e.g. sanitizers).

I assume we do some different OOM handling in those builds? As the test wants to account for a fallible allocation of that array, mostly to exclude 32Bit systems, but that seems not to help here.

Our sanitizers are configured to allow for fallible allocations (for TSan in particular here). But it is possible that this fails in edge cases. We've seen this happen in fuzzing with ASan many times that if we hit the exact right spot to OOM, we might hit an infallible allocation in the ASan internals.

(In reply to Jens Stutte [:jstutte] from comment #16)

Hmm, could that specific test just check for AppConstants.TSAN || AppConstants.ASAN ? We could make us just skip the one key with large allocation, somehow.

I don't know if this is available inside this this particular test, but if it is, then I'd also prefer that option.

Let me try that then.

Assignee: nobody → jstutte
Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/470093aeb138 Exlude keys known to consume a huge amount of memory for TSAN and ASAN. r=decoder

Ah, it also runs as mochitest and there we do not have AppConstants ?

Flags: needinfo?(jstutte)

I think the problem is that ChromeUtils is not available in mochitests.
Maybe you can get AppConstants using SpecialPowers in that case.
https://searchfox.org/mozilla-central/rev/2fc2ccf960c2f7c419262ac7215715c5235948db/dom/animation/test/document-timeline/test_document-timeline.html#30

OK, so the same test_keys.js is run in three different contextes, two times in the mochitest test_keys.html (one normal, one as worker) and one time standalone as xpcshell test. Unfortunately all three environments need different ways of accessing AppConstants. But as we have that triple coverage, and after talking with :janv and :asuth, we think we can just disable the entire test for xpcshell with ASAN/TSAN - unless we know, that we should expect problems also from mochitests with similar OOMs. :decoder?

Flags: needinfo?(choller)

In general it would be favorable to somewhat separate OOM-like tests from regular tests, but I am not aware of other tests right now causing problems and if the quickest way to move forward is to disable this test, we should just do that :)

Flags: needinfo?(choller)
Attachment #9306215 - Attachment description: Bug 1796753 - Exlude keys known to consume a huge amount of memory for TSAN and ASAN. r?decoder → Bug 1796753 - Disable test_keys.js for xpcshell tests under TSAN and ASAN. r?decoder
Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/499c02c864ff Disable test_keys.js for xpcshell tests under TSAN and ASAN. r=decoder
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 109 Branch

The patch landed in nightly and beta is affected.
:jstutte, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval.
  • If no, please set status-firefox108 to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(jstutte)

Do we run TSAN tests frequently on beta/release versions?

Flags: needinfo?(jstutte) → needinfo?(choller)

I believe we run them as often as other tests, there is nothing special about sanitizers there afaik.

Flags: needinfo?(choller)

TSan XPCshell doesn't run on beta and release.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: