1796753 - Linux webrender tsan opt xpcshell frequent retries that end up as exception

Cosmin Sabou [:CosminS]

Reporter

Description

•

3 years ago

•

Edited

There are frequent retries on tsan xpcshell tests like that often end up as an exception.

These Test groups are ran when these occur:

    browser/components/tests/unit/xpcshell.ini
    browser/extensions/formautofill/test/unit/heuristics/third_party/xpcshell.ini
    chrome/test/unit/xpcshell.ini
    devtools/client/application/test/xpcshell/xpcshell.ini
    devtools/client/shared/remote-debugging/test/xpcshell/xpcshell.ini
    devtools/client/webconsole/test/xpcshell/xpcshell.ini
    dom/encoding/test/unit/xpcshell.ini
    dom/indexedDB/test/unit/xpcshell-child-process.ini
    dom/media/webvtt/test/xpcshell/xpcshell.ini
    dom/promise/tests/unit/xpcshell.ini
    intl/strres/tests/unit/xpcshell.ini
    modules/libmar/tests/unit/xpcshell.ini
    modules/libpref/test/unit_ipc/xpcshell.ini
    netwerk/dns/tests/unit/xpcshell.ini
    remote/shared/messagehandler/test/xpcshell/xpcshell.ini
    services/sync/tests/unit/xpcshell.ini
    storage/test/unit/xpcshell.ini
    toolkit/components/autocomplete/tests/unit/xpcshell.ini
    toolkit/components/backgroundtasks/tests/xpcshell/xpcshell.ini
    toolkit/components/contentprefs/tests/unit_cps2/xpcshell.ini
    toolkit/components/contextualidentity/tests/unit/xpcshell.ini
    toolkit/components/crashes/tests/xpcshell/xpcshell.ini
    toolkit/components/crashmonitor/test/unit/xpcshell.ini
    toolkit/components/ctypes/tests/unit/xpcshell.ini
    toolkit/components/messaging-system/targeting/test/unit/xpcshell.ini
    toolkit/components/places/tests/queries/xpcshell.ini
    toolkit/components/places/tests/sync/xpcshell.ini
    toolkit/components/thumbnails/test/xpcshell.ini
    toolkit/components/url-classifier/tests/unit/xpcshell.ini
    toolkit/components/windowcreator/tests/unit/xpcshell.ini
    toolkit/content/tests/unit/xpcshell.ini
    toolkit/crashreporter/test/unit_ipc/xpcshell-phc.ini
    toolkit/mozapps/extensions/test/xpcshell/rs-blocklist/xpcshell.ini
    toolkit/mozapps/update/tests/unit_background_update/xpcshell.ini
    widget/tests/unit/xpcshell.ini

decoder, is this something you could have a look over? Thank you.

Flags: needinfo?(choller)

Cosmin Sabou [:CosminS]

Reporter

Updated

•

3 years ago

Blocks: asan-maintenance

Summary: linux webrender tsan opt xpcshell frequent retries that end up as exception → Linux webrender tsan opt xpcshell frequent retries that end up as exception

Comment hidden (Intermittent Failures Robot)

Cosmin Sabou [:CosminS]

Reporter

Updated

•

3 years ago

No longer blocks: asan-maintenance

Cosmin Sabou [:CosminS]

Reporter

Updated

•

3 years ago

Blocks: tsan

Component: Security → Sanitizers

Comment hidden (Intermittent Failures Robot)

Christian Holler (:decoder)

Comment 5

•

3 years ago

When did this start? Since we still don't upload log artifacts for jobs that fail with an exception, this is difficult to debug. The most likely cause is that the machine swaps/runs OOM for some reason and it didn't before. Did we change anything about machine configurations? Are the failing test groups new? What did change around the time when this started?

Flags: needinfo?(choller) → needinfo?(csabou)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 6

•

3 years ago

Started with this push on October 14th. The test groups are the same for the X6 task on the previous push for which it succeeded on the first attempt.

Flags: needinfo?(csabou)

Christian Holler (:decoder)

Comment 7

•

3 years ago

That push has a lot of AWS -> GCP commits in it. Did this job move as well? Did the machine configuration change in any way?

Comment hidden (Intermittent Failures Robot)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 10

•

3 years ago

(In reply to Christian Holler (:decoder) from comment #7)

That push has a lot of AWS -> GCP commits in it. Did this job move as well? Did the machine configuration change in any way?

These jobs were not part of the changes. I will bisect on Try.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 11

•

3 years ago

Backfills point to bug 1774462 as the first push affected with frequent automatic retries of Linux TSan xpcshell failures because the tasks encounter issues. The affected task runs the tests in dom/indexedDB/test/unit/xpcshell-child-process.ini, among others.

Flags: needinfo?(jstutte)

Regressed by: 1774462

Jens Stutte [:jstutte]

Assignee

Comment 12

•

3 years ago

•

Edited

Hmm, it is not easy to look at anything here, as all the tasks I looked at did not finish the log parsing and trying to access the log through the task itself gives me a network error like this. In comment 0 I see dom/indexedDB/test/unit/xpcshell-child-process.ini, too, and that runs dom/indexedDB/test/test_keys.html, IIUC.
I wonder if we just end up with an OOM here given that test_keys.html allocates a huge array which bites us for some reason only in this constellation in tsan. We could try to avoid running this test in tsan (and other memory sensitive constellations).

Flags: needinfo?(jstutte)

Christian Holler (:decoder)

Comment 13

•

3 years ago

(In reply to Jens Stutte [:jstutte] from comment #12)

I wonder if we just end up with an OOM here given that test_keys.html allocates a huge array which bites us for some reason only in this constellation in tsan. We could try to avoid running this test in tsan (and other memory sensitive constellations).

I think this is exactly what happens and I agree, we should not run this test in configurations that require more memory (e.g. sanitizers).

BugBot [:suhaib / :marco/ :calixte]

Comment 14

•

3 years ago

Set release status flags based on info from the regressing bug 1774462

status-firefox107: --- → affected

status-firefox108: --- → affected

status-firefox109: --- → affected

status-firefox-esr102: --- → unaffected

Keywords: regression

Jens Stutte [:jstutte]

Assignee

Comment 15

•

3 years ago

(In reply to Christian Holler (:decoder) from comment #13)

I think this is exactly what happens and I agree, we should not run this test in configurations that require more memory (e.g. sanitizers).

I assume we do some different OOM handling in those builds? As the test wants to account for a fallible allocation of that array, mostly to exclude 32Bit systems, but that seems not to help here.

Jens Stutte [:jstutte]

Assignee

Comment 16

•

3 years ago

Hmm, could that specific test just check for AppConstants.TSAN || AppConstants.ASAN ? We could make us just skip the one key with large allocation, somehow.

Christian Holler (:decoder)

Comment 17

•

3 years ago

(In reply to Jens Stutte [:jstutte] from comment #15)

(In reply to Christian Holler (:decoder) from comment #13)

I think this is exactly what happens and I agree, we should not run this test in configurations that require more memory (e.g. sanitizers).

I assume we do some different OOM handling in those builds? As the test wants to account for a fallible allocation of that array, mostly to exclude 32Bit systems, but that seems not to help here.

Our sanitizers are configured to allow for fallible allocations (for TSan in particular here). But it is possible that this fails in edge cases. We've seen this happen in fuzzing with ASan many times that if we hit the exact right spot to OOM, we might hit an infallible allocation in the ASan internals.

(In reply to Jens Stutte [:jstutte] from comment #16)

Hmm, could that specific test just check for AppConstants.TSAN || AppConstants.ASAN ? We could make us just skip the one key with large allocation, somehow.

I don't know if this is available inside this this particular test, but if it is, then I'd also prefer that option.

Jens Stutte [:jstutte]

Assignee

Comment 18

•

3 years ago

Let me try that then.

Assignee: nobody → jstutte

Jens Stutte [:jstutte]

Assignee

Comment 19

•

3 years ago

Attached file Bug 1796753 - Disable test_keys.js for xpcshell tests under TSAN and ASAN. r?decoder — Details

Pulsebot

Comment 20

•

3 years ago

Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/470093aeb138 Exlude keys known to consume a huge amount of memory for TSAN and ASAN. r=decoder

Atila Butkovits

Comment 21

•

3 years ago

Backed out for causing failures at test_keys.html.

Backout link: https://hg.mozilla.org/integration/autoland/rev/42a01ff3077c18e976e57d973e20f91ab412ae3a

Push with failures: https://treeherder.mozilla.org/jobs?repo=autoland&selectedTaskRun=YUQhURHJS3O_d56EDApWTQ.0&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel&revision=470093aeb138bc2065b0e8d7fb1a95f4353d64ad

Failure log: https://treeherder.mozilla.org/logviewer?job_id=398331914&repo=autoland&lineNumber=2067

Flags: needinfo?(jstutte)

Jens Stutte [:jstutte]

Assignee

Comment 22

•

3 years ago

Ah, it also runs as mochitest and there we do not have AppConstants ?

Flags: needinfo?(jstutte)

Jan Varga [:janv]

Comment 23

•

3 years ago

I think the problem is that ChromeUtils is not available in mochitests.
Maybe you can get AppConstants using SpecialPowers in that case.
https://searchfox.org/mozilla-central/rev/2fc2ccf960c2f7c419262ac7215715c5235948db/dom/animation/test/document-timeline/test_document-timeline.html#30

Jens Stutte [:jstutte]

Assignee

Comment 24

•

3 years ago

OK, so the same test_keys.js is run in three different contextes, two times in the mochitest test_keys.html (one normal, one as worker) and one time standalone as xpcshell test. Unfortunately all three environments need different ways of accessing AppConstants. But as we have that triple coverage, and after talking with :janv and :asuth, we think we can just disable the entire test for xpcshell with ASAN/TSAN - unless we know, that we should expect problems also from mochitests with similar OOMs. :decoder?

Flags: needinfo?(choller)

Christian Holler (:decoder)

Comment 25

•

3 years ago

In general it would be favorable to somewhat separate OOM-like tests from regular tests, but I am not aware of other tests right now causing problems and if the quickest way to move forward is to disable this test, we should just do that :)

Flags: needinfo?(choller)

Phabricator Automation

Updated

•

3 years ago

Attachment #9306215 - Attachment description: Bug 1796753 - Exlude keys known to consume a huge amount of memory for TSAN and ASAN. r?decoder → Bug 1796753 - Disable test_keys.js for xpcshell tests under TSAN and ASAN. r?decoder

Donal Meehan [:dmeehan]

Updated

•

3 years ago

status-firefox107: affected → wontfix

Pulsebot

Comment 26

•

3 years ago

Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/499c02c864ff Disable test_keys.js for xpcshell tests under TSAN and ASAN. r=decoder

Cosmin Sabou [:CosminS]

Reporter

Comment 27

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/499c02c864ff

Status: NEW → RESOLVED

Closed: 3 years ago

status-firefox109: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → 109 Branch

Comment hidden (Intermittent Failures Robot)

BugBot [:suhaib / :marco/ :calixte]

Comment 29

•

3 years ago

The patch landed in nightly and beta is affected.
:jstutte, is this bug important enough to require an uplift?

If yes, please nominate the patch for beta approval.
If no, please set status-firefox108 to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(jstutte)

Jens Stutte [:jstutte]

Assignee

Comment 30

•

3 years ago

Do we run TSAN tests frequently on beta/release versions?

Flags: needinfo?(jstutte) → needinfo?(choller)

Christian Holler (:decoder)

Comment 31

•

3 years ago

I believe we run them as often as other tests, there is nothing special about sanitizers there afaik.

Flags: needinfo?(choller)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 32

•

3 years ago

TSan XPCshell doesn't run on beta and release.

Jens Stutte [:jstutte]

Assignee

Updated

•

3 years ago

status-firefox108: affected → wontfix