Hardware Bulletin
From Clust-doc
Current State of Affairs
nodes 086-103+142 are used as DGrid components
Log (starting 25.03.2006)
07.02.2010
node004 (12:10)
HARDWARE ERROR CPU 0: Machine Check Exception: 4 Bank 4: f667200209080813 TSC 4b9ed2d49e7c ADDR 1b8b87e78 This is not a software problem! Kernel panic - not syncing: Machine check Jobs affected: 1924949, 1924956, 1924959, and 1925721
xx.09.2009
node043 (xx:xx)
HARDWARE ERROR CPU 0: Machine Check Exception: 4 Bank 4: f662a00190080813 TSC 19ac353bd73a30 ADDR 5b7773a0 This is not a software problem! Jobs affected: 1839019, 1837862, 1839038, and 1837890
27.08.2009
node019 (12:16)
HARDWARE ERROR CPU 0: Machine Check Exception: 4 Bank 4: f664200214080813 TSC 7ff5af125d7c64 ADDR 125e5eb80 This is not a software problem! Jobs affected: 1826453, 1827277, 1827279, and 1827376
(07-09-2009) Test jobs and stability tests did not reveal any errors, no traces in the logfiles that point to a hardware error -> reintegrated into the cluster again
09.08.2009
node043 (05:05)
HARDWARE ERROR CPU 0: Machine Check Exception: 7 Bank 4: f400200058080a13 RIP 33:<0000000000406d82> TSC 1d3faf175c6824 ADDR b4f02530 This is not a software problem!
Jobs affected: 1818400, 1818401, 1818526, and 1819431
memory module replaced
28.07.2009
node057 (09:36)
HARDWARE ERROR CPU 2: Machine Check Exception: 4 Bank 4: b63aa00127080813 TSC 3a9a15522f22d ADDR 166024d70 This is not a software problem!
Jobs affected: 1807549, 1807550, 1807551, and 1807603
several tests conducted, a combination of memory and cpu intensive jobs could crash the node, could confine the flaw to a single memory module, replaced this module.
23.07.2009
node071 (15:06) Spontaneous Reboot
Jobs affected: 1802006, 1803381, 1802054, and 1802985 crash reproducible, memory module replaced
17.07.2009
node057 (12:24)
HARDWARE ERROR CPU 2: Machine Check Exception: 4 Bank 4: f660a001f1080813 TSC 66db883bc283f6 ADDR 169926268 Kernel panic - not syncing: Machine check
Jobs affected: 1785578, 1799333, 1799351, and 1797191
error was not reproducible, node survived stress test, rejoined into cluster
16.06.2009
node043 (13:19)
HARDWARE ERROR CPU 0: Machine Check Exception: 4 Bank 4: f658a00104080813 TSC 4580b1409998e ADDR 5c078b78 Kernel panic - not syncing: Machine check
Jobs affected: 1777425, 1777441, 1777471, and 1767074.61382
-> memory module replaced
28.02.2009
node048 (00:32)
HARDWARE ERROR CPU 2: Machine Check Exception: 4 Bank 4: f640200180080813 TSC 117673b774dab7 ADDR 1066a4780 Kernel panic - not syncing: Machine check
Jobs affected: 1722242, 1711243, and 1711280
-> memory module replaced
18.02.2009
node026 (21:20)
HARDWARE ERROR CPU 2: Machine Check Exception: 4 Bank 4: f61520024f080813 TSC 4b4524bd3be5f7 ADDR 137530ed0 Kernel panic - not syncing: Machine check
Jobs affected: 1710830, 1709865, 1706950, and 1710751
node072 (22:05)
HARDWARE ERROR CPU 0: Machine Check Exception: 7 Bank 4: f443a0008a080a13 RIP 33:<000000000087946c> TSC 986746eb53f43 ADDR 44682b2bc0 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor Kernel panic - not syncing: Uncorrected machine check
Jobs affected: 1707408, 1707416, 1706923, and 1706987
18.09.2008
node075 (07:32)
CPU 2: Machine Check Exception: 4 Bank 4: f600200258080813 TSC 4fc31778596294 ADDR 1b8f42c10 Kernel panic - not syncing: Machine check
Jobs affected: 1605171, 1605598, 1589782, and 1589789
-> memory module replaced
20.03.2008
node068 (12:37)
CPU 0: Machine Check Exception: 4 Bank 4: f6782001fc080813 TSC b484ae09713a2a ADDR c5fe5c40 Kernel panic - not syncing: Machine check
Jobs affected: 1499093, 1499096, 1499106, and 1499889
-> memory module replaced
11.03.2008
node068 (12:23)
CPU 0: Machine Check Exception: 4 Bank 4: f62920014c080813 TSC 161fc9c2a285a2 ADDR c5fe5b60 Kernel panic - not syncing: Machine check
Jobs affected: 1493233, 1492272, 1492783, and 1492934
(EDAC MC0: CE page 0xc5fe5, offset 0xec0, grain 8, syndrome 0x20e8, row 3, channel 0, label "": k8_edac)
29.02.2008
node064 (00:44)
CPU 2: Machine Check Exception: 4 Bank 4: f677a00222080813 TSC 10ea332e9d560c ADDR d7f0b780 Kernel panic - not syncing: Machine check
Jobs affected: 1481503, 1481682, 1481758, and 1481874
11.12.2007
node010 (12:20)
HARDWARE ERROR CPU 0: Machine Check Exception: 4 Bank 4: f65820029b080813 TSC 45173041c1d585 ADDR 2ddfa8040 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor
Job affected: 1414900, 1415080, 1415070, and 1414570
10.12.2007
node107 (21:13)
crashed, blank screen, no
Job affected: 1407881, 1414558, 1414566, and 1414544
24.11.2007
node007 (16:32)
Machine Check Exception: 4 Bank 4: f67c200114080813 TSC 5b6386fbbe051b ADDR 1a5fec840 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor Kernel panic - not syncing: Machine check
Job affected: 1407900, 1407815, 1407817, and 1407285
30.10.2007
node036 (20:12)
HARDWARE ERROR
CPU 0: Machine Check Exception: 7 Bank 4: f4152000c3080a13
RIP 10:<ffffffff80107b0d> {default_idle+0x2d/0x60}
TSC 1628dd9d31ae2f ADDR c0185c90
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor
Kernel panic - not syncing: Uncorrected machine check
Job affected: 1391079, 1390843, and 1390844
21.10.2007
node130 (15:27)
HARDWARE ERROR CPU 3: Machine Check Exception: 4 Bank 0: f606200000000833 TSC 3abed3455dac8d ADDR 43d629e80 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor Kernel panic - not syncing: Machine check
Job affected: 1388131 and 1387620(?) Seems to be a CPU internal Data Cache problem.
10.09.2007
node018 (09:11)
HARDWARE ERROR CPU 1: Machine Check Exception: 4 Bank 4: f662a00290080813 TSC 8aa5e210fb5138 ADDR 47de28a0 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor Kernel panic - not syncing: Machine check
Job affected: 1370463, 1371549, 1371903, and 1371713. Obviously a memory module is flawed.
08.07.2007
node010 (05:07)
HARDWARE ERROR CPU 0: Machine Check Exception: 4 Bank 4: f6712002e3080813 TSC 5093c6efa43d75 ADDR 2e2f98d40 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor Kernel panic - not syncing: Machine check
Job affected: 1299009, 1299410, 1300723, and 1300713
27.06.2007
node007 (22:46)
HARDWARE ERROR CPU 0: Machine Check Exception: 4 Bank 4: f653a002d4080813 TSC 5544b34464e89b ADDR 132fbed00 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor Kernel panic - not syncing: Machine check
Job affected: 1296236, 1296267, 1296401, and 1296399
25.04.2007
node063 (01:42)
HARDWARE ERROR CPU 0: Machine Check Exception: 4 Bank 4: f60aa00205080813 TSC 35efd62bc3569f ADDR 11285eb00 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor Kernel panic - not syncing: Machine check
Job affected: 1202243, 1202401, 1202132, and 1201456
23.04.2007
node127
Hard disk failure / Update (25.04.07): replacement disk also fails; the source of the problem seems to be elsewhere
Job affected: 1201513, 1201514, and 1201594
22.04.2007
node091
Memory module out of order, node crashed
Job affected: 1198690
18.04.2007
node019 (Feb 25 03:01:19)
HARDWARE ERROR
CPU 0: Machine Check Exception: 7 Bank 4: f467200050080a13
RIP 10:<ffffffff80107b0d> {default_idle+0x2d/0x60}
TSC 31b98f8bc1358d ADDR 203d50b50
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor
Kernel panic - not syncing: Uncorrected machine check
Jobs affected: 1192928 and 1197365
25.02.2007
node010 (Feb 25 12:56:11)
HARDWARE ERROR CPU 0: Machine Check Exception: 4 Bank 4: f6722001c4080813 TSC 1256ecb928a4bf ADDR 19db6ed40 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor Kernel panic - not syncing: Machine check
Jobs affected: 1167521, 1167145, 1167469, 1166717
08.02.2007
node007 (02:13)
frozen
Jobs affected: 1148519, 1148523, 1148548, and 1148593
22.12.2006
node019 (00:02)
frozen
Jobs affected: 1124891, 1125525, 1125415, and 1125399
08.11.2006
node064 (01:04)
CPU 2: Machine Check Exception: 4 Bank 4: f623a001a6080813 TSC 48a9e411b943c ADDR 15fb92600 Kernel panic - not syncing: Machine check
Jobs affected: 1110845, 1110558, 1110809, and 1110560
25.10.2006
node064 (11:53)
CPU 2: Machine Check Exception: 4 Bank 4: f659a001c5080813 TSC 307950a9d0e40 ADDR 19773f970 Kernel panic - not syncing: Machine check
Jobs affected: 1100289, 1100319, 1100320, and 1100453
16.10.2006
node063 (23:43)
CPU 0: Machine Check Exception: 4 Bank 4: f631a00103080813 TSC 7a441b0696bf0 ADDR 1032a5bc0 Kernel panic - not syncing: Machine check
Jobs affected: 1094883, 1094886, 1094956, and 1094880
node038 (18:44)
CPU 2: Machine Check Exception: 4 Bank 4: b60820022c080813 TSC 3cb7c4f84909a ADDR 1a01923e0 Kernel panic - not syncing: Machine check
Jobs affected: 1093446, 1094908, 1094946, and 1094843
10.10.2006
node010 (15:32)
CPU 0: Machine Check Exception: 4 Bank 4: f602a0020d080813 TSC cd0ac4ab4137e ADDR 1f09b5b00 Kernel panic - not syncing: Machine check
Jobs affected: 1087036, 1086506, 1087038, and 1087037
21.09.2006
node063 (15:14)
CPU 0: Machine Check Exception: 4 Bank 4: f634a001b8080813 TSC 13117a06283a8 ADDR 5b29fb00 Kernel panic - not syncing: Machine check
Jobs affected: 1073797, 1074089, 1074098, and 1074175
21.09.2006
node064 (10:39)
CPU 2: Machine Check Exception: 4 Bank 4: f617a0020d080813 TSC 1138760759f01 ADDR 1d263d340 Kernel panic - not syncing: Machine check
Jobs affected: 1074045, 1074058, 1074087, and 1074097
20.09.2006
node019 (20:53)
CPU 0: Machine Check Exception: 4 Bank 4: f64ca00289080813 TSC b996c182c8ad ADDR 101ba9bc0 Kernel panic - not syncing: Machine check
Jobs affected: 1073564 and 1073826
17.09.2006
node038 (20:13)
CPU 2: Machine Check Exception: 4 Bank 4: f60820022c080813 TSC 85fd743b80cbe1 ADDR 189791dc0 Kernel panic - not syncing: Machine check
Jobs affected: 1065102 and 1073481
01.09.2006
node010 (00:32)
Spontaneous reboot
Jobs affected: crashme jobs solely
24.08.2006
node001 (09:03)
CPU 0: Machine Check Exception: 4 Bank 4: f62ca0016c080813 TSC 77366d7911cf36 ADDR 1e9bc2700 Kernel panic - not syncing: Machine check
Jobs affected: 1002001, 1002071, 1002736, and 1003623
06.08.2006
node064 (02:58)
CPU 2: Machine Check Exception: 4 Bank 4: b618200130080813 TSC 2c38fbd2768d2 ADDR 1c4187f70 Kernel panic - not syncing: Machine check
no user jobs affected
01.08.2006
node064 (11:35)
CPU 2: Machine Check Exception: 4 Bank 4: b640200180080813 TSC 2fa034a4a751a6 ADDR 1c4185f70 Kernel panic - not syncing: Machine check
Jobs affected: 996581, 996606, 996932, and 996922
30.07.2006
Fileserver (12:00)
Several Kernel level services ( Kernel, NFS server, ....) freezed Several thousands of (corrected) ECC Errors logged
29.07.2006
node010 (08:35)
CPU 0: Machine Check Exception: 4 Bank 4: f621200294080813 TSC 63fd82dcabab ADDR 1fdff2c80 Kernel panic - not syncing: Machine check
Jobs affected: 993048, 993051, 993054, and 993055
node033 (02:05)
freezed, no error message at all
Jobs affected: 992993, 987539, 992677, and 987536
23.07.2006
node063 (07:30)
CPU 2: Machine Check Exception: 7 Bank 4: f41da00012080a13
RIP 10:<ffffffff8020d422> {copy_user_generic_c+0x8/0x26}
TSC 5d74dbaa4c31e7 ADDR 11f13fbd8
Kernel panic - not syncing: Uncorrected machine check
Jobs affected: 982642, 982645, 982645, and 982492
19.07.2006
node007 (07:55)
CPU 0: Machine Check Exception: 4 Bank 4: f621200203080813 TSC 2ef0832e7c4c95 ADDR 10ab93e80 Kernel panic - not syncing: Machine check
Jobs affected: 951979, 951990, 951980, and 951978
16.07.2006
node010 (11:40)
CPU 0: Machine Check Exception: 7 Bank 4: f467200033080a13
RIP 10:<ffffffff8010bb1a> {default_idle+0x3a/0x90}}
TSC 2d309c33893f24 ADDR 1af198fa0
Kernel panic - not syncing: Uncorrected machine check
Jobs affected: 948107 and 939152
15.07.2006
node030 (11:55)
Spontaneous reboot, reason unknown
Jobs affected: 9939215, 948090, 949417, and 949414
12.07.2006
node029 (19:45)
CPU 0: Machine Check Exception: 4 Bank 4: b67e200204080813 TSC 1531b02295e ADDR e9eda00 Kernel panic - not syncing: Machine check
Jobs affected: 933745, 933746, 933631, and 933764
10.07.2006
node069 (16:55)
Crashed, cause to be investigated
Jobs affected: 933745, 933746, 933631, and 933764
node065 (11:50)
Spontaneous reboot after a couple of ECC errors
Jobs affected: 930651, 930162, 930085, and 930202
09.07.2006
node019 (14:45)
Spontaneous reboot
Jobs affected: "crashme" jobs" only
07.07.2006
node019 (13:30)
Spontaneous reboot after uncorrected ECC error
MCE 0
CPU 0 4 northbridge TSC 1fd38de714ea88
RIP 33:446940 ADDR 7ba0b140
Northbridge Chipkill ECC error
Chipkill ECC syndrome = d44c
bit45 = uncorrected ecc error
bit61 = error uncorrected
bit62 = error overflow (multiple errors)
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS f4262000d4080a13 MCGSTATUS 7
Jobs affected: 886578, 858858, 897244, 897226
03.07.2006
Head Node (15:10)
CPU 0: Machine Check Exception: 7 Bank 4: f411200040080a13
RIP 10:<ffffffff8010bb1a> {default_idle+0x3a/0x90}}
TSC 1fbbd0799fe1a ADDR 7a6a0000
Kernel panic - not syncing: Uncorrected machine check
30.06.2006
Head Node (06:25)
CPU 0: Machine Check Exception: 4 Bank 4: f61c200179080813 TSC 5566b29c92b9c ADDR 7a8a2000 Kernel panic - not syncing: Machine check
24.06.2006
node042 (07:05)
freezed (mcelog indicates faulty memory)
Jobs affected: 792320, 800378, 800431, and 802236
04.06.2006
node012 (11:05)
CPU 0: Machine Check Exception: 4 Bank 4: f623a0024f080813 TSC e4651eedea4bc ADDR 1ee555000 Kernel panic - not syncing: Machine check
Jobs affected: 710106,706216,706215, and 706237
15.05.2006
node064 (14:40)
CPU 2: Machine Check Exception
Jobs affected: 610242, 623889, 623184, 623608
13.05.2006
node012 (21:05)
CPU 0: Machine Check Exception
Jobs affected: 610009, 610165, 610166, 610167
10.05.2006
node052 (19:5)
CPU 2: Machine Check Exception, Node doesn't boot
Jobs affected: 599675, 599266, 599351, 599479
25.04.2006
node042 (08:20)
CPU 0: Machine Check Exception
Jobs affected:
21.04.2006
node019 (21:25)
CPU 0: Machine Check Exception
Jobs affected:
19.04.2006
node042 (18:55)
CPU 0: Machine Check Exception
Jobs affected: 506599, 4506600
18.04.2006
node042 (14:30)
CPU 0: Machine Check Exception
Jobs affected: 434023, 434024
11.04.2006
- CPUs exchanged on nodes 007, 010, 012, and 019
- All memory modules exchanged on nodes 046 and 064
08.04.2006
node007 (12:57)
Spontaneous reboot
Jobs affected: crashme jobs solely
node010 (06:45)
CPU 0: Machine Check Exception: 4 Bank 4: f60ca00206080813 TSC 2fddc62050021 ADDR 1e8bf44c0 Kernel panic - not syncing: Machine check
Jobs affected: crashme jobs solely
06.04.2006
Longest error message so far:
node007 (06:05)
CPU 0: Machine Check Exception: 4 Bank 4: f6042001fe080813
TSC fe0cb9b07691 ADDR 7ed65d0
Kernel panic - not syncing: Machine check
NMI Watchdog detected LOCKUP on CPU 0
CPU 0
Modules linked in: nfs lockd nfs_acl sunrpc af_packet dm_mod w83627hf eeprom lm85 hwmon_vid
i2c_isa i2c_amd756 i2c_core genrtc
Pid: 17369, comm: newcode25.x Tainted: G M 2.6.15.3 #1
RIP: 0010:[<ffffffff80118fe8>] <ffffffff80118fe8>{__smp_call_function+104}
RSP: 0000:ffffffff80491b68 EFLAGS: 00000097
RAX: 0000000000000002 RBX: 0000000000000003 RCX: 0000000000000004
RDX: 0000ffff0000ffff RSI: 0000000000000000 RDI: ffffffff804d8830
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff801190c0
R13: 0000000000000000 R14: 0000fe0cb9b069cc R15: ffffffff8039c8a5
FS: 00002aaaab5634a0(0000) GS:ffffffff804d3800(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00002aaaab738000 CR3: 00000001f8625000 CR4: 00000000000006e0
Process newcode25.x (pid: 17369, threadinfo ffff8101f099c000, task ffff8101f8081120)
Stack: ffffffff801190c0 0000000000000000 0000000000000002 0000000000000000
0000000000000096 0000000000000000 0000000000000000 0000000000000000
ffffffff804070c0 ffffffff80119120
Call Trace: <#MC> <ffffffff801190c0>{smp_really_stop_cpu+0} <ffffffff80119120>{smp_send_stop+64}
<ffffffff8013445b>{panic+203} <ffffffff8010f5b6>{oops_begin+102}
<ffffffff8011523f>{print_mce+159} <ffffffff80115316>{mce_panic+166}
<ffffffff80115718>{do_machine_check+968} <ffffffff8010eeb3>{machine_check+127}
<ffffffff8015d678>{bad_range+40} <EOE> <ffffffff8015dd6c>{__rmqueue+188}
<ffffffff8015de1b>{rmqueue_bulk+75} <ffffffff8015e36e>{buffered_rmqueue+94}
<ffffffff8015e653>{get_page_from_freelist+163} <ffffffff8015e6e0>{__alloc_pages+80}
<ffffffff8016cd77>{do_anonymous_page+71} <ffffffff8016d54e>{__handle_mm_fault+414}
<ffffffff8011ee0c>{do_page_fault+540} <ffffffff8016effb>{vma_merge+331}
<ffffffff8016f8c4>{do_mmap_pgoff+1636} <ffffffff803754ed>{schedule+365}
<ffffffff8010e8a5>{error_exit+0}
Code: f3 90 8b 44 24 10 39 d8 75 f6 85 ed 74 1a 8b 44 24 14 39 d8
console shuts up ...
<0>Kernel panic - not syncing: Aiee, killing interrupt handler!
NMI Watchdog detected LOCKUP on CPU 1
CPU 1
Modules linked in: nfs lockd nfs_acl sunrpc af_packet dm_mod w83627hf eeprom lm85 hwmon_vid
i2c_isa i2c_amd756 i2c_core genrtc
Pid: 17325, comm: newcode25.x Tainted: G M 2.6.15.3 #1
RIP: 0010:[<ffffffff803771fb>] <ffffffff803771fb>{.text.lock.spinlock+20}
RSP: 0018:ffff8101f8a25d18 EFLAGS: 00000086
RAX: ffff810000019680 RBX: ffff810007e73338 RCX: 0000000000000000
RDX: ffff81020002e3d0 RSI: 000000000000001f RDI: ffff810000019680
RBP: ffff81020002e3c0 R08: 0000000000000001 R09: 00000001f5ee5067
R10: ffff8101fe587188 R11: ffff8101fe4cdda8 R12: ffff810000019600
R13: ffff810207061a58 R14: ffff810000019600 R15: ffff8101f8a25de8
FS: 00002aaaab5634a0(0000) GS:ffffffff804d3880(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00002aaaab73f000 CR3: 00000001f0804000 CR4: 00000000000006e0
Process newcode25.x (pid: 17325, threadinfo ffff8101f8a24000, task ffff8101fe540a30)
Stack: 0000000000000096 ffffffff8015d87a ffff81000001ac10 ffff810000019680
00000000000280d2 ffff8101fe540a30 00000000fa357ad8 ffff81020002e3d0
0000001f00019600 ffff810007e73338
Call Trace:<ffffffff8015d87a>{free_pages_bulk+58} <ffffffff8015e2db>{free_hot_cold_page+235}
<ffffffff8015ea40>{__pagevec_free+32} <ffffffff80165c92>{release_pages+322}
<ffffffff80174c64>{free_pages_and_swap_cache+116} <ffffffff801701af>{unmap_region+287}
<ffffffff80170538>{do_munmap+408} <ffffffff801705ad>{sys_munmap+77}
<ffffffff8010db6a>{system_call+126}
Code: f3 90 83 3f 00 7e f9 e9 92 fd ff ff f3 90 83 3f 00 7e f9 e9
console shuts up ...
Jobs affected: crashme jobs solely
05.04.2006
Megware announces to replace the CPUs on the unstable nodes (probably at Tue, 11.04.2006)
04.04.2006
node007 (15:00)
CPU 0: Machine Check Exception: 4 Bank 4: f646a0021e080813 TSC 22fbfcb51e6d ADDR 1eededcc0 Kernel panic - not syncing: Machine check
Jobs affected: crashme jobs solely
Cluster sysadmin sent a somewhat more distinct Email to Megware
node007 (05:35)
CPU 0: Machine Check Exception: 4 Bank 4: f658a002ec080813 TSC 83cbde955a64 ADDR 1f7eee840 Kernel panic - not syncing: Machine check
Jobs affected: crashme jobs solely
03.04.2006
node046 (16:15)
CPU 2: Machine Check Exception: 4 Bank 4: b604a00112080813 TSC 1fe06d6bde5bf8 ADDR 15ccfc000 Kernel panic - not syncing: Machine check
Jobs affected: 355638, 355676, 355677
This is probably an error caused by bad memory, as indicated by many mcelog entries of the type
MCE 0
CPU 2 4 northbridge TSC 1fdf53e3730d58
ADDR 15ccfc4a0
Northbridge Chipkill ECC error
Chipkill ECC syndrome = be01
bit32 = err cpu0
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node origin, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d400c001be080813 MCGSTATUS 0
These errors had previously (27.03.06) been reported to Megware. No response yet.
02.04.2006
node007 (19:10)
CPU 0: Machine Check Exception: 4 Bank 4: f64da001e3080813 TSC 637b2ce888c32 ADDR 1f4dbaab0 Kernel panic - not syncing: Machine check
Jobs affected: crashme jobs solely
01.04.2006
node010 (08:35)
CPU 0: Machine Check Exception: 4 Bank 4: f645a002d5080813 TSC 9890070606e2 ADDR 1eedbd900 Kernel panic - not syncing: Machine check
Jobs affected: crashme jobs solely
31.03.2006
10:05: hard disk in the head node replaced, RAID mirror is rebuilding
node010 (02:35)
CPU 0: Machine Check Exception: 4 Bank 4: f60520024f080813 TSC 23bb7236ab260 ADDR 1fdbe5c40 Kernel panic - not syncing: Machine check
Jobs affected: crashme jobs solely
30.03.2006
FUJITSU drive reverted to "FAILED" again. Megware ships replacement disk.
29.03.2006
AMD is able to reproduce a system freeze in a test environment. That test system is equipped with the Opteron CPUs which were previously dismantled from our Node011. Investigation continues ...
Cluster sysadmin graciously offers to support the investigation by assenting to exchange the CPUs of all the flawed nodes.
28.03.2006
Megware suggested to try to rebuild the RAID mirror/ successfully done: [03/28/2006 (17:14:25)]:
Adapter 1 Channel 1 Target 2:
Physical Drive[FUJITSU MAT3073NC 0109] is Changed to ONLINE.
Hard disk error (Head node) [03/28/2006 (12:21:04)]:
Adapter 1 Channel 1 Target 2:
Physical Drive[FUJITSU MAT3073NC 0109] is Changed to FAILED.
running in degraded mode (disk is was part of a RAID mirror)
27.03.2006
node010 (09:55)
CPU 0: Machine Check Exception: 4 Bank 4: f67da0010e080813 TSC 619e5a532a87c ADDR 1ecbc1e40 Kernel panic - not syncing: Machine check
Jobs affected: crashme jobs solely
25.03.2006
node064 (21:45)
CPU 2: Machine Check Exception: 4 Bank 4: f6502002f3080813 TSC 1a8d1675e00788 ADDR 19fdbf2c0 Kernel panic - not syncing: Machine check
Jobs affected: 239264,241015,241066,243328
