Closed Bug 683734 Opened 14 years ago Closed 13 years ago

Repurpose Rev3 10.6 machines

Categories

(Release Engineering :: General, defect, P3)

x86
macOS
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jhford, Assigned: armenzg)

References

Details

Attachments

(11 files)

Once we have replaced our 10.6 test machines with Rev4 hardware, we should use the freed rev3 hardware for other platforms.
Depends on: 690236
Blocks: 681748
Summary: Repurpose Rev3 machines when Rev4 machines are in production → Repurpose Rev3 machines when Rev4 machines are declared authoritative
Depends on: 693918
Depends on: 694251
Depends on: 695976
Depends on: 695979
Depends on: 696417
Depends on: 696453
This patch directs leopard slaves to the new talos_osx class implemented for the 10.6 rev4 and 10.7 rev4 machines. This shouldn't be landed until the rev3 10.6 machines are turned off. I have tested this on talos-r3-leopard-001 and the only change on boot was the com.apple.dock.plist going from 600 to 644 and the file being replaced. This exact same change happens on every single reboot, as evidenced by: talos-r3-leopard-001:~ cltbld$ grep e6bbe59dfd61a20cd007c0608729fac5 /var/puppet/log/puppet.out | wc -l 7119. PUPPET OUTPUT FROM BOOT notice: Starting catalog run notice: //Node[talos-r3-leopard-001]/talos_osx_rev4/buildslave::cleanup/Exec[find /tmp/* -mmin +15 -print | xargs -n1 rm -rf]/returns: executed successfully notice: //Node[talos-r3-leopard-001]/talos_osx_rev4/Exec[remove-index]/returns: executed successfully notice: //Node[talos-r3-leopard-001]/talos_osx_rev4/Exec[disable-indexing]/returns: executed successfully notice: //Node[talos-r3-leopard-001]/talos_osx_rev4/File[/Users/cltbld/Library/Preferences/com.apple.dock.plist]/checksum: checksum changed '{md5}e6bbe59dfd61a20cd007c0608729fac5' to '{md5}8c117cfb1046e4fd8b2cb872cd8a84da' notice: //Node[talos-r3-leopard-001]/talos_osx_rev4/File[/Users/cltbld/Library/Preferences/com.apple.dock.plist]/source: replacing from source puppet://staging-puppet.build.mozilla.org/staging/darwin9-i386/test/Users/cltbld/Library/Preferences/com.apple.dock.plist with contents {md5}e6bbe59dfd61a20cd007c0608729fac5 notice: //Node[talos-r3-leopard-001]/talos_osx_rev4/File[/Users/cltbld/Library/Preferences/com.apple.dock.plist]/mode: mode changed '600' to '644' notice: Finished catalog run in 6.85 seconds
Attachment #569211 - Flags: review?(coop)
I generated the patch by doing: cd puppet-manifests/os cp talos_osx_rev4.pp talos_osx.pp hg rm talos_osx_rev4.pp then using sed to replace 'talos_osx_rev4' with 'talosslave' for the rev4 machines.
I killed the buildslave on this machine and ran puppet with --test and --debug. This is the output. most of the things it is doing are just checking that what is installed is correct.
Depends on: 696959
Comment on attachment 569211 [details] [diff] [review] puppet-manifests v1 Review of attachment 569211 [details] [diff] [review]: ----------------------------------------------------------------- r+, assuming class name is correct (or gets corrected). ::: os/talos_osx.pp @@ +2,2 @@ > > +class talos_osx_rev4 { You're not changing talosslave.pp AFAICT, so doesn't this need to stay as talos_osx (same for header above)?
Attachment #569211 - Flags: review?(coop) → review+
(In reply to Chris Cooper [:coop] from comment #4) > Comment on attachment 569211 [details] [diff] [review] [diff] [details] [review] > puppet-manifests v1 > > Review of attachment 569211 [details] [diff] [review] [diff] [details] [review]: > ----------------------------------------------------------------- > > r+, assuming class name is correct (or gets corrected). > > ::: os/talos_osx.pp > @@ +2,2 @@ > > > > +class talos_osx_rev4 { > > You're not changing talosslave.pp AFAICT, so doesn't this need to stay as > talos_osx (same for header above)? I'll correct the class name, it should remain 'talos_osx'
This turns off the rev3 10.6 test machines. It also removes snowleopard-r4 and moves the r4 snowleopard platforms to the 'snowleopard' slave platform. Not sure if this is reconfig safe on the test-masters.
Attachment #571205 - Flags: review?(coop)
Comment on attachment 571205 [details] [diff] [review] buildbot-configs v1 Review of attachment 571205 [details] [diff] [review]: ----------------------------------------------------------------- Can I also ask that we clean out the unused rev4 10.6 slaves (81-160) from slavealloc once this lands?
Attachment #571205 - Flags: review?(coop) → review+
(In reply to Chris Cooper [:coop] from comment #7) > Can I also ask that we clean out the unused rev4 10.6 slaves (81-160) from > slavealloc once this lands? Yep, i cleaned them out a couple weeks ago
when I am done turning rev3-10.6 off, I'll kick over to Armen
Assignee: nobody → jhford
Depends on: 700503
No longer depends on: 700503
Here is some data to add to the discussion! 0.04% of tests are failing because of unresolved intermittent errors related to rev4 machines. Given that, I think the trade off of a slight amount of random-failure is worth accepting for significant improvements to wait times on other platforms. We aren't giving up on the random failure, either. Bug 700672 is tracking some possible fixes to the resolution issues like a new dongle design and boxes that simulate DVI monitors. I am going to change the dependencies for these intermittent failure issues to 700672 so work in this bug can proceed.
No longer depends on: 693918
No longer depends on: 696417
No longer depends on: 696453
This is an interim patch so that we stop scheduling 10.6 rev3 jobs. My understanding is that because I am removing them from the scheduler master, running builds won't be interrupted but new jobs won't queue up and make self serve useless. This patch won't be around for long. Part of landing the patch to rename snowleopard-r4 to snowleopard will be to revert this patch.
Attachment #573947 - Flags: review?(catlee)
This landed in this morning's reconfig.
Comment on attachment 573947 [details] [diff] [review] stop scheduling 10.6 rev3 jobs Review of attachment 573947 [details] [diff] [review]: ----------------------------------------------------------------- please look into doing this in config.py or related files on the test scheduler master
Attachment #573947 - Flags: review?(catlee)
Shifting bug to Armen as we discussed. I am going to hold off on the puppet portion of this bug until we actuall turn off the rev3 10.6 machines. I think we should let these machines exist in their current state (but disabled in slavealloc) until Thursday Nov 24.
Assignee: jhford → armenzg
Summary: Repurpose Rev3 machines when Rev4 machines are declared authoritative → Repurpose Rev3 10.6 machines
Rev3 MacOS Snow are not running anymore as per: https://tbpl.mozilla.org/?jobname=Rev3%20MacOSX%20Snow&rev=e7d5dd9efeca The change landed with: http://hg.mozilla.org/build/buildbot-configs/rev/1420fec41822 I have to removed these 3 pending jobs: try 39857d1faeb7 Rev3 MacOSX Snow Leopard 10.6.2 try debug test xpcshell try 39857d1faeb7 Rev3 MacOSX Snow Leopard 10.6.2 try debug test mochitest-other try 39857d1faeb7 Rev3 MacOSX Snow Leopard 10.6.2 try debug test reftest As jhford mentions we'll hold off until the 24th "just in case" and move from there. TODO: * file bugs for IT to re-purpose machines (not to be done before the 24th) ** remove from DNS/nagios * determine how to distribute these slaves for the other OSes * disable from slave-alloc * remove from puppet * remove from buildapi (util.py) * anything else?
Priority: P4 → P3
There are 59 r3-snow machines without including the ref machine. There are 5 OSes to distribute these slaves to. I grabbed data from mozilla-inbound which includes the 4 sets of PGO builds that get triggered for mozilla-central and mozilla-inbound. I've got some data but I'm still not sure this is the right way of breaking it down. # of suites Total SUM (secs) Percentage 59 * percentage Fedora 44 41817 20.36% 12.01 Fedora64 44 40178 19.57% 11.54 Leopard 29 22955 11.18% 6.60 Win7 44 53350 25.98% 15.33 Xp 44 47051 22.91% 13.52 205351 59 (SUM) I will think it this through a little more and have an answer tomorrow. https://docs.google.com/spreadsheet/ccc?key=0ApOCAHvaMQSFdGlkYmpfOUpHcHBxTFNUbi1ILTY1bVE#gid=10
Priority: P3 → P2
I have filed a bug to have a tool to help us do distributions like this at a later time. I have enough data to do a somehow informed decision but not using the best data as I could have gathered. OS\Data sources No PGO Both Waitimes Armen Fedora 11.27 11.91 13.12 13 Fedora64 10.66 11.95 11.88 12 Leopard 9.55 6.54 10.87 7 Win7 14.51 15.20 11.93 14 Xp 13.00 13.40 11.19 13 PGO only happens in mozilla-central and mozilla-integration 4 times a day. Using the source from the waitimes report is a little better since on the try server we don't always trigger all builds and all tests. The try server represents 52% of the load. Adding the number I saw for each silo we would have these many production slaves: fed - 76-3=74 -> 21.08% fed64 - 71-3=68 -> 19.37% leopard - 66-3=63 -> 17.95% w7 - 79-4=75 -> 21.37% xp - 75-4=71 -> 20.23% TOTAL - 351 rev3 prod slaves This distribution is very similar to the distribution from the wait times report of a week worth's of data. Fedora 8946 22.24% Fedora64 8101 20.14% Leopard 7409 18.42% Win7 8134 20.23% Xp 7626 18.96% Said all that here is what I think is the list of slaves we want: 01- talos-r3-fed-064 02- talos-r3-fed-065 03- talos-r3-fed-066 04- talos-r3-fed-067 05- talos-r3-fed-068 06- talos-r3-fed-069 07- talos-r3-fed-070 08- talos-r3-fed-071 09- talos-r3-fed-072 10- talos-r3-fed-073 11- talos-r3-fed-074 12- talos-r3-fed-075 13- talos-r3-fed-076 01- talos-r3-fed64-060 02- talos-r3-fed64-061 03- talos-r3-fed64-062 04- talos-r3-fed64-063 05- talos-r3-fed64-064 06- talos-r3-fed64-065 07- talos-r3-fed64-066 08- talos-r3-fed64-067 09- talos-r3-fed64-068 10- talos-r3-fed64-069 11- talos-r3-fed64-070 12- talos-r3-fed64-071 01- talos-r3-leopard-060 02- talos-r3-leopard-061 03- talos-r3-leopard-062 04- talos-r3-leopard-063 05- talos-r3-leopard-064 06- talos-r3-leopard-065 07- talos-r3-leopard-066 01- talos-r3-w7-066 02- talos-r3-w7-067 03- talos-r3-w7-068 04- talos-r3-w7-069 05- talos-r3-w7-070 06- talos-r3-w7-071 07- talos-r3-w7-072 08- talos-r3-w7-073 09- talos-r3-w7-074 10- talos-r3-w7-075 11- talos-r3-w7-076 12- talos-r3-w7-077 13- talos-r3-w7-078 14- talos-r3-w7-079 01- talos-r3-xp-063 02- talos-r3-xp-064 03- talos-r3-xp-065 04- talos-r3-xp-066 05- talos-r3-xp-067 06- talos-r3-xp-068 07- talos-r3-xp-069 08- talos-r3-xp-070 09- talos-r3-xp-071 10- talos-r3-xp-072 11- talos-r3-xp-073 12- talos-r3-xp-074 13- talos-r3-xp-075
Depends on: 705352
Work left in here: - buildbot-configs - slavealloc - puppet work - OPSI work
Status: NEW → ASSIGNED
Priority: P2 → P4
Whiteboard: waiting on IT's re-imaging work on bug 705352
Also, all of the Windows slaves need additional work done to them that the ref machine got only after these machines were imaged. This comment details exactly what needs to be done: https://bugzilla.mozilla.org/show_bug.cgi?id=704578#c17 Please let me know if there's any confusion.
per RelEng/IT mtg, bug#705352 now fixed.
Whiteboard: waiting on IT's re-imaging work on bug 705352
In case anyone ping me again about it I already know I can get started. I spoke about it with arr on IRC already.
Attached patch config changesSplinter Review
Attachment #580215 - Flags: review?(coop)
Attachment #580217 - Flags: review?(coop)
I added this to the production DB and locked the slaves to my staging master.
Attachment #580215 - Flags: review?(coop) → review+
Attachment #580216 - Flags: review?(coop) → review+
Attachment #580217 - Flags: review?(coop) → review+
Attachment #580215 - Flags: checked-in+
Attachment #580216 - Flags: checked-in+
Attachment #580217 - Flags: checked-in+
Armen -- When you update puppet manifests, you *must* (a) hg pull -u on all masters (including master-puppet1) (b) watch /var/log/messages for a while to make sure the changes work Bug 709591 is from a simple typo in 415eae655ad4. While I was landing it, I also unintentionally added a number of other changesets (8 on mpt-production-puppet, for example) that had not been updated on the other masters -- any of which could have unexpected effects at an unexpected time.
(In reply to Dustin J. Mitchell [:dustin] from comment #27) > Armen -- > > When you update puppet manifests, you *must* > (a) hg pull -u on all masters (including master-puppet1) > (b) watch /var/log/messages for a while to make sure the changes work > Bug 709591 is from a simple typo in 415eae655ad4. > > While I was landing it, I also unintentionally added a number of other > changesets (8 on mpt-production-puppet, for example) that had not been > updated on the other masters -- any of which could have unexpected effects > at an unexpected time. I'm very sorry about that. I created this section for future reference: https://wiki.mozilla.org/ReleaseEngineering/Puppet/Usage#Deploy_changes What is master-puppet1? Do you have an idea on how to prevent pushing things like this live?
I reconfig-ed with the landed changes from this bug today.
Followed instructions in: https://wiki.mozilla.org/ReleaseEngineering/How_To/Set_Up_a_Freshly_Imaged_Slave For leopard slaves I had to: * run the following: scutil --set HostName XXX * switch to root and run twice: > puppetd --test --server scl-production-puppet.build.scl1.mozilla.com For XP slaves I had to: NOTE: VNC in fullscreen on Lion does not allow you to type * change the computer name * add the DNS suffix and reboot * I had to do the extra steps below because these machines got re-imaged before they got deployed to the ref image Bug found: autoit is not installed on 64, 70, 71, 72, 73, 74 & 75 http://grab.by/bpXX For Fedora and Fedora64 slaves: * Some slaves were not reachable by ssh and I made mention on bug 705352. * I fixed the hostname (/etc/sysconfig/network) and rebooted * run, wait & run until got signed certificate: > puppetd --test --server scl-production-puppet.build.scl1.mozilla.com The Windows 7 slaves need to be activated in bug 705352. == EXTRA STEPS == cd c:\ wget -O installservice.bat --no-check-certificate http://people.mozilla.com/~bhearsum/installservice.bat runas /user:administrator "schtasks /create /tn mozillamaintenance /tr \"c:\\windows\\system32\\cmd.exe /c \\\"c:\\installservice.bat\\\"\" /sc ONSTART /ru SYSTEM" wget -O keys.reg --no-check-certificate https://bugzilla.mozilla.org/attachment.cgi?id=577617 regedit /s keys.reg wget -O MozRoot.crt --no-check-certificate https://bugzilla.mozilla.org/attachment.cgi?id=577619 * Browse to download location * Right click cert, choose "Install certificate * Choose "Trusted Root Certificate Authorities" as the install location
Depends on: 712004
slave w7-065 is added on bug 676155.
Attachment #582885 - Flags: review?(coop)
Attached file sql statement for IT
rhelmer told me not to worry to write a DELETE statement for the talos-r3-snow slaves. I can remove it from the previous patch as well if you want to.
Attachment #582890 - Flags: review?(coop)
Attachment #582885 - Flags: review?(coop) → review+
Depends on: 712131
Attachment #582890 - Flags: review?(coop) → review+
Priority: P4 → P2
I have put the Win7 slaves on staging. I will be putting the slaves from bug 712004 into staging as well. Meanwhile I have moved these slaves to production and announced on dev.tree-management: talos-r3-fed-064 talos-r3-fed-065 talos-r3-fed-066 talos-r3-fed-067 talos-r3-fed-068 talos-r3-fed-069 talos-r3-fed-071 talos-r3-fed-075 talos-r3-fed-076 talos-r3-fed64-060 talos-r3-fed64-061 talos-r3-fed64-062 talos-r3-fed64-063 talos-r3-fed64-065 talos-r3-fed64-066 talos-r3-fed64-067 talos-r3-fed64-068 talos-r3-fed64-069 talos-r3-fed64-070 talos-r3-leopard-060 talos-r3-leopard-061 talos-r3-leopard-062 talos-r3-leopard-063 talos-r3-leopard-065 talos-r3-leopard-066 talos-r3-leopard-074 talos-r3-xp-063 talos-r3-xp-064 talos-r3-xp-065 talos-r3-xp-066 talos-r3-xp-067 talos-r3-xp-068 talos-r3-xp-069 talos-r3-xp-070 talos-r3-xp-071 talos-r3-xp-072 talos-r3-xp-073 talos-r3-xp-074
bhearsum asked me to double check the XP machines as he wasn't sure I picked up the latest changes. I had to run these missing steps: wget -O installservice.bat --no-check-certificate https://bug704578.bugzilla.mozilla.org/attachment.cgi?id=579099 wget -O add_cert.msc --no-check-certificate https://bugzilla.mozilla.org/attachment.cgi?id=579191 start add_cert.msc * From menu: Action -> All Tasks -> Import... launches Certificate Import Wizard * Click Next * Browse and use C:\MozRoot.crt * Next, Next, Finish * Close the MMC window
Status update: * all slaves are in production except: talos-r3-fed-072 - bug 712004 talos-r3-w7-066 talos-r3-w7-067 talos-r3-w7-068 talos-r3-w7-069 talos-r3-w7-070 talos-r3-w7-071 talos-r3-w7-072 talos-r3-w7-073 talos-r3-w7-074 talos-r3-w7-075 talos-r3-w7-076 talos-r3-w7-077
We had a couple of slaves that had trouble: bug 713326 - Please get talos-r3-xp-067 and talos-r3-xp-066 out of production I have to have a look at the win7 slaves's runs look like.
Depends on: 713326
More than a couple, because there was also bug 714392 and bug 714561, but they got the reimage treatment rather than just awaiting your return.
bhearsum I would like to add the Windows 7 slaves into the pool but I would like to know if the steps I have to run are these: https://wiki.mozilla.org/ReferencePlatforms/Test/Win7#Mozilla_maintenance_service.2C_associated_registry_keys.2C_Mozilla_test_CA_root
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #38) > bhearsum I would like to add the Windows 7 slaves into the pool but I would > like to know if the steps I have to run are these: > https://wiki.mozilla.org/ReferencePlatforms/Test/ > Win7#Mozilla_maintenance_service.2C_associated_registry_keys. > 2C_Mozilla_test_CA_root Depends on which image they were cloned from. If they were cloned after the latest images in https://bugzilla.mozilla.org/show_bug.cgi?id=706344, no. You can find this out by for "Mozilla Maintenance Service" in services.msc.
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #30) > Bug found: autoit is not installed on 64, 70, 71, 72, 73, 74 & 75 > http://grab.by/bpXX Filed as bug 717955.
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #17) > 01- talos-r3-w7-066 > 02- talos-r3-w7-067 > 03- talos-r3-w7-068 > 04- talos-r3-w7-069 > 05- talos-r3-w7-070 > 06- talos-r3-w7-071 > 07- talos-r3-w7-072 > 08- talos-r3-w7-073 > 09- talos-r3-w7-074 > 10- talos-r3-w7-075 > 11- talos-r3-w7-076 > 12- talos-r3-w7-077 > 13- talos-r3-w7-078 > 14- talos-r3-w7-079 All of these slaves got the "Mozilla Maintenance Service" installed. They're now taking jobs on my development master to verify one last time.
I put the following slaves into the production pool: * talos-r3-xp-035 * talos-r3-xp-066 * talos-r3-xp-067 * talos-r3-xp-070 * talos-r3-xp-075 I'm waiting on bug 705352 for talos-r3-w7-072 to be activated.
Priority: P2 → P4
talos-r3-fed64-065 was waiting for a reboot from IT but no one had enabled it in slave alloc (bug 715786). talos-r3-xp-066 and talos-r3-xp-067 were synced with OPSI but I completely missed adding the maintenance service manually. They're now on staging again. I'm waiting on bug 705352 for talos-r3-w7-072 to be activated.
Priority: P4 → P2
Priority: P2 → P3
Depends on: 718922
talos-r3-xp-070 - back to the pool after re-installing the certificate talos-r3-xp-075 - back to the pool after re-installing the certificate talos-r3-xp-066 - IT debugging file permission issues - bug 719892 talos-r3-xp-067 - IT debugging file permission issues - bug 719892 talos-r3-xp-068 - on preproduction; I will put it back on Monday talos-r3-w7-072 - back to the pool
Depends on: 719892
No longer depends on: 718922
No longer depends on: 719892
IT will fix talos-r3-xp-066 and talos-r3-xp-067 in bug 719892. Nothing left to be done.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: