Hardware Bulletin

From Clust-doc

Jump to: navigation, search

Current State of Affairs

nodes 086-103+142 are used as DGrid components

Log (starting 25.03.2006)

07.02.2010

node004 (12:10)

   HARDWARE ERROR
   CPU 0: Machine Check Exception:                4 Bank 4: f667200209080813
   TSC 4b9ed2d49e7c ADDR 1b8b87e78
   This is not a software problem!
   Kernel panic - not syncing: Machine check
   Jobs affected: 1924949, 1924956, 1924959, and 1925721

xx.09.2009

node043 (xx:xx)

   HARDWARE ERROR
   CPU 0: Machine Check Exception:                4 Bank 4: f662a00190080813
   TSC 19ac353bd73a30 ADDR 5b7773a0
   This is not a software problem!
   Jobs affected: 1839019, 1837862, 1839038, and 1837890

27.08.2009

node019 (12:16)

   HARDWARE ERROR
   CPU 0: Machine Check Exception:                4 Bank 4: f664200214080813
   TSC 7ff5af125d7c64 ADDR 125e5eb80
   This is not a software problem!
   Jobs affected: 1826453, 1827277, 1827279, and 1827376

(07-09-2009) Test jobs and stability tests did not reveal any errors, no traces in the logfiles that point to a hardware error -> reintegrated into the cluster again


09.08.2009

node043 (05:05)

   HARDWARE ERROR
   CPU 0: Machine Check Exception:                7 Bank 4: f400200058080a13
   RIP 33:<0000000000406d82>
   TSC 1d3faf175c6824 ADDR b4f02530
   This is not a software problem!
   Jobs affected: 1818400, 1818401, 1818526, and 1819431
   memory module replaced

28.07.2009

node057 (09:36)

   HARDWARE ERROR
   CPU 2: Machine Check Exception:                4 Bank 4: b63aa00127080813
   TSC 3a9a15522f22d ADDR 166024d70
   This is not a software problem!
   Jobs affected: 1807549, 1807550, 1807551, and 1807603
   several tests conducted, a combination of memory and cpu intensive jobs could
   crash the node, could confine the flaw to a single memory module, replaced this
   module.

23.07.2009

node071 (15:06) Spontaneous Reboot

   Jobs affected: 1802006, 1803381, 1802054, and 1802985
   
   crash reproducible, memory module replaced

17.07.2009

node057 (12:24)

  HARDWARE ERROR
  CPU 2: Machine Check Exception:                4 Bank 4: f660a001f1080813
  TSC 66db883bc283f6 ADDR 169926268
  Kernel panic - not syncing: Machine check
  Jobs affected: 1785578, 1799333, 1799351, and 1797191
  error was not reproducible, node survived stress test, rejoined into cluster 

16.06.2009

node043 (13:19)

  HARDWARE ERROR
  CPU 0: Machine Check Exception:                4 Bank 4: f658a00104080813
  TSC 4580b1409998e ADDR 5c078b78
  Kernel panic - not syncing: Machine check

Jobs affected: 1777425, 1777441, 1777471, and 1767074.61382

-> memory module replaced


28.02.2009

node048 (00:32)

  HARDWARE ERROR
  CPU 2: Machine Check Exception:                4 Bank 4: f640200180080813
  TSC 117673b774dab7 ADDR 1066a4780
  Kernel panic - not syncing: Machine check

Jobs affected: 1722242, 1711243, and 1711280

-> memory module replaced


18.02.2009

node026 (21:20)

  HARDWARE ERROR
  CPU 2: Machine Check Exception:                4 Bank 4: f61520024f080813
  TSC 4b4524bd3be5f7 ADDR 137530ed0
  Kernel panic - not syncing: Machine check

Jobs affected: 1710830, 1709865, 1706950, and 1710751

node072 (22:05)

  HARDWARE ERROR
  CPU 0: Machine Check Exception:                7 Bank 4: f443a0008a080a13
  RIP 33:<000000000087946c>
  TSC 986746eb53f43 ADDR 44682b2bc0
  This is not a software problem!
  Run through mcelog --ascii to decode and contact your hardware vendor
  Kernel panic - not syncing: Uncorrected machine check

Jobs affected: 1707408, 1707416, 1706923, and 1706987

18.09.2008

node075 (07:32)

 CPU 2: Machine Check Exception:                4 Bank 4: f600200258080813
 TSC 4fc31778596294 ADDR 1b8f42c10
 Kernel panic - not syncing: Machine check

Jobs affected: 1605171, 1605598, 1589782, and 1589789

-> memory module replaced


20.03.2008

node068 (12:37)

 CPU 0: Machine Check Exception:                4 Bank 4: f6782001fc080813
 TSC b484ae09713a2a ADDR c5fe5c40
 Kernel panic - not syncing: Machine check

Jobs affected: 1499093, 1499096, 1499106, and 1499889

-> memory module replaced


11.03.2008

node068 (12:23)

 CPU 0: Machine Check Exception:                4 Bank 4: f62920014c080813
 TSC 161fc9c2a285a2 ADDR c5fe5b60
 Kernel panic - not syncing: Machine check

Jobs affected: 1493233, 1492272, 1492783, and 1492934

(EDAC MC0: CE page 0xc5fe5, offset 0xec0, grain 8, syndrome 0x20e8, row 3, channel 0, label "": k8_edac)


29.02.2008

node064 (00:44)

 CPU 2: Machine Check Exception:                4 Bank 4: f677a00222080813
 TSC 10ea332e9d560c ADDR d7f0b780
 Kernel panic - not syncing: Machine check

Jobs affected: 1481503, 1481682, 1481758, and 1481874


11.12.2007

node010 (12:20)

  HARDWARE ERROR
  CPU 0: Machine Check Exception:                4 Bank 4: f65820029b080813
  TSC 45173041c1d585 ADDR 2ddfa8040
  This is not a software problem!
  Run through mcelog --ascii to decode and contact your hardware vendor

Job affected: 1414900, 1415080, 1415070, and 1414570


10.12.2007

node107 (21:13)

crashed, blank screen, no

Job affected: 1407881, 1414558, 1414566, and 1414544


24.11.2007

node007 (16:32)

  Machine Check Exception:                4 Bank 4: f67c200114080813
  TSC 5b6386fbbe051b ADDR 1a5fec840
  This is not a software problem!
  Run through mcelog --ascii to decode and contact your hardware vendor
  Kernel panic - not syncing: Machine check

Job affected: 1407900, 1407815, 1407817, and 1407285


30.10.2007

node036 (20:12)

  HARDWARE ERROR
  CPU 0: Machine Check Exception:                7 Bank 4: f4152000c3080a13
  RIP 10:<ffffffff80107b0d> {default_idle+0x2d/0x60}
  TSC 1628dd9d31ae2f ADDR c0185c90
  This is not a software problem!
  Run through mcelog --ascii to decode and contact your hardware vendor
  Kernel panic - not syncing: Uncorrected machine check

Job affected: 1391079, 1390843, and 1390844


21.10.2007

node130 (15:27)

  HARDWARE ERROR
  CPU 3: Machine Check Exception:                4 Bank 0: f606200000000833
  TSC 3abed3455dac8d ADDR 43d629e80
  This is not a software problem!
  Run through mcelog --ascii to decode and contact your hardware vendor
  Kernel panic - not syncing: Machine check

Job affected: 1388131 and 1387620(?) Seems to be a CPU internal Data Cache problem.


10.09.2007

node018 (09:11)

  HARDWARE ERROR
  CPU 1: Machine Check Exception:                4 Bank 4: f662a00290080813
  TSC 8aa5e210fb5138 ADDR 47de28a0
  This is not a software problem!
  Run through mcelog --ascii to decode and contact your hardware vendor
  Kernel panic - not syncing: Machine check

Job affected: 1370463, 1371549, 1371903, and 1371713. Obviously a memory module is flawed.


08.07.2007

node010 (05:07)

  HARDWARE ERROR
  CPU 0: Machine Check Exception:                4 Bank 4: f6712002e3080813
  TSC 5093c6efa43d75 ADDR 2e2f98d40
  This is not a software problem!
  Run through mcelog --ascii to decode and contact your hardware vendor
  Kernel panic - not syncing: Machine check

Job affected: 1299009, 1299410, 1300723, and 1300713


27.06.2007

node007 (22:46)

  HARDWARE ERROR
  CPU 0: Machine Check Exception:                4 Bank 4: f653a002d4080813
  TSC 5544b34464e89b ADDR 132fbed00
  This is not a software problem!
  Run through mcelog --ascii to decode and contact your hardware vendor
  Kernel panic - not syncing: Machine check

Job affected: 1296236, 1296267, 1296401, and 1296399


25.04.2007

node063 (01:42)

  HARDWARE ERROR
  CPU 0: Machine Check Exception:                4 Bank 4: f60aa00205080813
  TSC 35efd62bc3569f ADDR 11285eb00
  This is not a software problem!
  Run through mcelog --ascii to decode and contact your hardware vendor
  Kernel panic - not syncing: Machine check

Job affected: 1202243, 1202401, 1202132, and 1201456


23.04.2007

node127

  Hard disk failure / Update (25.04.07): replacement disk also fails; the source of the problem seems to be elsewhere  

Job affected: 1201513, 1201514, and 1201594


22.04.2007

node091

  Memory module out of order, node crashed

Job affected: 1198690


18.04.2007

node019 (Feb 25 03:01:19)

  HARDWARE ERROR
  CPU 0: Machine Check Exception:                7 Bank 4: f467200050080a13
  RIP 10:<ffffffff80107b0d> {default_idle+0x2d/0x60}
  TSC 31b98f8bc1358d ADDR 203d50b50
  This is not a software problem!
  Run through mcelog --ascii to decode and contact your hardware vendor
  Kernel panic - not syncing: Uncorrected machine check

Jobs affected: 1192928 and 1197365


25.02.2007

node010 (Feb 25 12:56:11)

  HARDWARE ERROR
  CPU 0: Machine Check Exception:                4 Bank 4: f6722001c4080813
  TSC 1256ecb928a4bf ADDR 19db6ed40
  This is not a software problem!
  Run through mcelog --ascii to decode and contact your hardware vendor
  Kernel panic - not syncing: Machine check

Jobs affected: 1167521, 1167145, 1167469, 1166717


08.02.2007

node007 (02:13)

 frozen

Jobs affected: 1148519, 1148523, 1148548, and 1148593


22.12.2006

node019 (00:02)

 frozen

Jobs affected: 1124891, 1125525, 1125415, and 1125399


08.11.2006

node064 (01:04)

 CPU 2: Machine Check Exception:                4 Bank 4: f623a001a6080813
 TSC 48a9e411b943c ADDR 15fb92600
 Kernel panic - not syncing: Machine check

Jobs affected: 1110845, 1110558, 1110809, and 1110560


25.10.2006

node064 (11:53)

 CPU 2: Machine Check Exception:                4 Bank 4: f659a001c5080813
 TSC 307950a9d0e40 ADDR 19773f970
 Kernel panic - not syncing: Machine check

Jobs affected: 1100289, 1100319, 1100320, and 1100453


16.10.2006

node063 (23:43)

 CPU 0: Machine Check Exception:                4 Bank 4: f631a00103080813
 TSC 7a441b0696bf0 ADDR 1032a5bc0
 Kernel panic - not syncing: Machine check

Jobs affected: 1094883, 1094886, 1094956, and 1094880


node038 (18:44)

 CPU 2: Machine Check Exception:                4 Bank 4: b60820022c080813
 TSC 3cb7c4f84909a ADDR 1a01923e0
 Kernel panic - not syncing: Machine check

Jobs affected: 1093446, 1094908, 1094946, and 1094843


10.10.2006

node010 (15:32)

 CPU 0: Machine Check Exception:                4 Bank 4: f602a0020d080813
 TSC cd0ac4ab4137e ADDR 1f09b5b00
 Kernel panic - not syncing: Machine check

Jobs affected: 1087036, 1086506, 1087038, and 1087037


21.09.2006

node063 (15:14)

 CPU 0: Machine Check Exception:                4 Bank 4: f634a001b8080813
 TSC 13117a06283a8 ADDR 5b29fb00
 Kernel panic - not syncing: Machine check

Jobs affected: 1073797, 1074089, 1074098, and 1074175


21.09.2006

node064 (10:39)

 CPU 2: Machine Check Exception:                4 Bank 4: f617a0020d080813
 TSC 1138760759f01 ADDR 1d263d340
 Kernel panic - not syncing: Machine check

Jobs affected: 1074045, 1074058, 1074087, and 1074097


20.09.2006

node019 (20:53)

 CPU 0: Machine Check Exception:                4 Bank 4: f64ca00289080813
 TSC b996c182c8ad ADDR 101ba9bc0
 Kernel panic - not syncing: Machine check

Jobs affected: 1073564 and 1073826


17.09.2006

node038 (20:13)

 CPU 2: Machine Check Exception:                4 Bank 4: f60820022c080813
 TSC 85fd743b80cbe1 ADDR 189791dc0
 Kernel panic - not syncing: Machine check

Jobs affected: 1065102 and 1073481


01.09.2006

node010 (00:32)

Spontaneous reboot

Jobs affected: crashme jobs solely


24.08.2006

node001 (09:03)

 CPU 0: Machine Check Exception:                4 Bank 4: f62ca0016c080813
 TSC 77366d7911cf36 ADDR 1e9bc2700
 Kernel panic - not syncing: Machine check

Jobs affected: 1002001, 1002071, 1002736, and 1003623


06.08.2006

node064 (02:58)

 CPU 2: Machine Check Exception:                4 Bank 4: b618200130080813
 TSC 2c38fbd2768d2 ADDR 1c4187f70
 Kernel panic - not syncing: Machine check

no user jobs affected


01.08.2006

node064 (11:35)

 CPU 2: Machine Check Exception:                4 Bank 4: b640200180080813
 TSC 2fa034a4a751a6 ADDR 1c4185f70
 Kernel panic - not syncing: Machine check

Jobs affected: 996581, 996606, 996932, and 996922


30.07.2006

Fileserver (12:00)

 Several Kernel level services ( Kernel, NFS server, ....) freezed
 Several thousands of (corrected) ECC Errors logged

29.07.2006

node010 (08:35)

 CPU 0: Machine Check Exception:                4 Bank 4: f621200294080813
 TSC 63fd82dcabab ADDR 1fdff2c80
 Kernel panic - not syncing: Machine check

Jobs affected: 993048, 993051, 993054, and 993055


node033 (02:05)

freezed, no error message at all

Jobs affected: 992993, 987539, 992677, and 987536


23.07.2006

node063 (07:30)

 CPU 2: Machine Check Exception:                7 Bank 4: f41da00012080a13
 RIP 10:<ffffffff8020d422> {copy_user_generic_c+0x8/0x26}
 TSC 5d74dbaa4c31e7   ADDR 11f13fbd8
 Kernel panic - not syncing: Uncorrected machine check 

Jobs affected: 982642, 982645, 982645, and 982492


19.07.2006

node007 (07:55)

 CPU 0: Machine Check Exception:                4 Bank 4: f621200203080813
 TSC 2ef0832e7c4c95 ADDR 10ab93e80
 Kernel panic - not syncing: Machine check

Jobs affected: 951979, 951990, 951980, and 951978


16.07.2006

node010 (11:40)

 CPU 0: Machine Check Exception:                7 Bank 4: f467200033080a13
 RIP 10:<ffffffff8010bb1a> {default_idle+0x3a/0x90}}
 TSC 2d309c33893f24  ADDR 1af198fa0
 Kernel panic - not syncing: Uncorrected machine check 

Jobs affected: 948107 and 939152


15.07.2006

node030 (11:55)

Spontaneous reboot, reason unknown

Jobs affected: 9939215, 948090, 949417, and 949414


12.07.2006

node029 (19:45)

 CPU 0: Machine Check Exception:                4 Bank 4: b67e200204080813
 TSC 1531b02295e ADDR e9eda00
 Kernel panic - not syncing: Machine check

Jobs affected: 933745, 933746, 933631, and 933764


10.07.2006

node069 (16:55)

Crashed, cause to be investigated

Jobs affected: 933745, 933746, 933631, and 933764


node065 (11:50)

Spontaneous reboot after a couple of ECC errors

Jobs affected: 930651, 930162, 930085, and 930202


09.07.2006

node019 (14:45)

Spontaneous reboot

Jobs affected: "crashme" jobs" only


07.07.2006

node019 (13:30)

Spontaneous reboot after uncorrected ECC error

 MCE 0
 CPU 0 4 northbridge TSC 1fd38de714ea88
 RIP 33:446940 ADDR 7ba0b140
   Northbridge Chipkill ECC error
   Chipkill ECC syndrome = d44c
        bit45 = uncorrected ecc error
        bit61 = error uncorrected
        bit62 = error overflow (multiple errors)
   bus error 'local node response, request didn't time out
       generic read mem transaction
       memory access, level generic'
 STATUS f4262000d4080a13 MCGSTATUS 7
 

Jobs affected: 886578, 858858, 897244, 897226


03.07.2006

Head Node (15:10)

 CPU 0: Machine Check Exception:                7 Bank 4: f411200040080a13
 RIP 10:<ffffffff8010bb1a> {default_idle+0x3a/0x90}}
 TSC 1fbbd0799fe1a   ADDR 7a6a0000
 Kernel panic - not syncing: Uncorrected machine check 

30.06.2006

Head Node (06:25)

 CPU 0: Machine Check Exception:                4 Bank 4: f61c200179080813
 TSC 5566b29c92b9c ADDR 7a8a2000
 Kernel panic - not syncing: Machine check

24.06.2006

node042 (07:05)

 freezed (mcelog indicates faulty memory) 

Jobs affected: 792320, 800378, 800431, and 802236


04.06.2006

node012 (11:05)

 CPU 0: Machine Check Exception:                4 Bank 4: f623a0024f080813
 TSC e4651eedea4bc ADDR 1ee555000
 Kernel panic - not syncing: Machine check

Jobs affected: 710106,706216,706215, and 706237


15.05.2006

node064 (14:40)

 CPU 2: Machine Check Exception

Jobs affected: 610242, 623889, 623184, 623608


13.05.2006

node012 (21:05)

 CPU 0: Machine Check Exception

Jobs affected: 610009, 610165, 610166, 610167


10.05.2006

node052 (19:5)

 CPU 2: Machine Check Exception, Node doesn't boot

Jobs affected: 599675, 599266, 599351, 599479


25.04.2006

node042 (08:20)

 CPU 0: Machine Check Exception

Jobs affected:


21.04.2006

node019 (21:25)

 CPU 0: Machine Check Exception

Jobs affected:


19.04.2006

node042 (18:55)

 CPU 0: Machine Check Exception

Jobs affected: 506599, 4506600


18.04.2006

node042 (14:30)

 CPU 0: Machine Check Exception

Jobs affected: 434023, 434024


11.04.2006

  • CPUs exchanged on nodes 007, 010, 012, and 019
  • All memory modules exchanged on nodes 046 and 064

08.04.2006


node007 (12:57)

Spontaneous reboot

Jobs affected: crashme jobs solely


node010 (06:45)

 CPU 0: Machine Check Exception:                4 Bank 4: f60ca00206080813
 TSC 2fddc62050021 ADDR 1e8bf44c0
 Kernel panic - not syncing: Machine check

Jobs affected: crashme jobs solely


06.04.2006

Longest error message so far:

node007 (06:05)

 CPU 0: Machine Check Exception:                4 Bank 4: f6042001fe080813
 TSC fe0cb9b07691 ADDR 7ed65d0
 Kernel panic - not syncing: Machine check
  NMI Watchdog detected LOCKUP on CPU 0
 CPU 0
 Modules linked in: nfs lockd nfs_acl sunrpc af_packet dm_mod w83627hf eeprom lm85 hwmon_vid
 i2c_isa i2c_amd756 i2c_core genrtc
 Pid: 17369, comm: newcode25.x Tainted: G   M  2.6.15.3 #1
 RIP: 0010:[<ffffffff80118fe8>] <ffffffff80118fe8>{__smp_call_function+104}
 RSP: 0000:ffffffff80491b68  EFLAGS: 00000097
 RAX: 0000000000000002 RBX: 0000000000000003 RCX: 0000000000000004
 RDX: 0000ffff0000ffff RSI: 0000000000000000 RDI: ffffffff804d8830
 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff801190c0
 R13: 0000000000000000 R14: 0000fe0cb9b069cc R15: ffffffff8039c8a5
 FS:  00002aaaab5634a0(0000) GS:ffffffff804d3800(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00002aaaab738000 CR3: 00000001f8625000 CR4: 00000000000006e0
 Process newcode25.x (pid: 17369, threadinfo ffff8101f099c000, task ffff8101f8081120)
 Stack: ffffffff801190c0 0000000000000000 0000000000000002 0000000000000000
        0000000000000096 0000000000000000 0000000000000000 0000000000000000
        ffffffff804070c0 ffffffff80119120
 Call Trace: <#MC> <ffffffff801190c0>{smp_really_stop_cpu+0} <ffffffff80119120>{smp_send_stop+64}
        <ffffffff8013445b>{panic+203} <ffffffff8010f5b6>{oops_begin+102}
        <ffffffff8011523f>{print_mce+159} <ffffffff80115316>{mce_panic+166}
        <ffffffff80115718>{do_machine_check+968} <ffffffff8010eeb3>{machine_check+127}
        <ffffffff8015d678>{bad_range+40}  <EOE> <ffffffff8015dd6c>{__rmqueue+188}
        <ffffffff8015de1b>{rmqueue_bulk+75} <ffffffff8015e36e>{buffered_rmqueue+94}
        <ffffffff8015e653>{get_page_from_freelist+163} <ffffffff8015e6e0>{__alloc_pages+80}
        <ffffffff8016cd77>{do_anonymous_page+71} <ffffffff8016d54e>{__handle_mm_fault+414}
        <ffffffff8011ee0c>{do_page_fault+540} <ffffffff8016effb>{vma_merge+331}
        <ffffffff8016f8c4>{do_mmap_pgoff+1636} <ffffffff803754ed>{schedule+365}
        <ffffffff8010e8a5>{error_exit+0}
 
 Code: f3 90 8b 44 24 10 39 d8 75 f6 85 ed 74 1a 8b 44 24 14 39 d8
 console shuts up ...
  <0>Kernel panic - not syncing: Aiee, killing interrupt handler!
  NMI Watchdog detected LOCKUP on CPU 1
 CPU 1
 Modules linked in: nfs lockd nfs_acl sunrpc af_packet dm_mod w83627hf eeprom lm85 hwmon_vid
 i2c_isa i2c_amd756 i2c_core genrtc
 Pid: 17325, comm: newcode25.x Tainted: G   M  2.6.15.3 #1
 RIP: 0010:[<ffffffff803771fb>] <ffffffff803771fb>{.text.lock.spinlock+20}
 RSP: 0018:ffff8101f8a25d18  EFLAGS: 00000086
 RAX: ffff810000019680 RBX: ffff810007e73338 RCX: 0000000000000000
 RDX: ffff81020002e3d0 RSI: 000000000000001f RDI: ffff810000019680
 RBP: ffff81020002e3c0 R08: 0000000000000001 R09: 00000001f5ee5067
 R10: ffff8101fe587188 R11: ffff8101fe4cdda8 R12: ffff810000019600
 R13: ffff810207061a58 R14: ffff810000019600 R15: ffff8101f8a25de8
 FS:  00002aaaab5634a0(0000) GS:ffffffff804d3880(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00002aaaab73f000 CR3: 00000001f0804000 CR4: 00000000000006e0
 Process newcode25.x (pid: 17325, threadinfo ffff8101f8a24000, task ffff8101fe540a30)
 Stack: 0000000000000096 ffffffff8015d87a ffff81000001ac10 ffff810000019680
        00000000000280d2 ffff8101fe540a30 00000000fa357ad8 ffff81020002e3d0
        0000001f00019600 ffff810007e73338
 Call Trace:<ffffffff8015d87a>{free_pages_bulk+58} <ffffffff8015e2db>{free_hot_cold_page+235}
        <ffffffff8015ea40>{__pagevec_free+32} <ffffffff80165c92>{release_pages+322}
        <ffffffff80174c64>{free_pages_and_swap_cache+116} <ffffffff801701af>{unmap_region+287}
        <ffffffff80170538>{do_munmap+408} <ffffffff801705ad>{sys_munmap+77}
        <ffffffff8010db6a>{system_call+126}

 Code: f3 90 83 3f 00 7e f9 e9 92 fd ff ff f3 90 83 3f 00 7e f9 e9
 console shuts up ...

Jobs affected: crashme jobs solely

05.04.2006

Megware announces to replace the CPUs on the unstable nodes (probably at Tue, 11.04.2006)


04.04.2006

node007 (15:00)

 CPU 0: Machine Check Exception:                4 Bank 4: f646a0021e080813
 TSC 22fbfcb51e6d ADDR 1eededcc0
 Kernel panic - not syncing: Machine check

Jobs affected: crashme jobs solely


Cluster sysadmin sent a somewhat more distinct Email to Megware


node007 (05:35)

 CPU 0: Machine Check Exception:              4 Bank 4: f658a002ec080813
 TSC 83cbde955a64 ADDR 1f7eee840
 Kernel panic - not syncing: Machine check

Jobs affected: crashme jobs solely


03.04.2006

node046 (16:15)

 CPU 2: Machine Check Exception:                4 Bank 4: b604a00112080813
 TSC 1fe06d6bde5bf8 ADDR 15ccfc000
 Kernel panic - not syncing: Machine check

Jobs affected: 355638, 355676, 355677

This is probably an error caused by bad memory, as indicated by many mcelog entries of the type

 MCE 0
 CPU 2 4 northbridge TSC 1fdf53e3730d58
 ADDR 15ccfc4a0
   Northbridge Chipkill ECC error
   Chipkill ECC syndrome = be01
        bit32 = err cpu0
        bit46 = corrected ecc error
        bit62 = error overflow (multiple errors)
   bus error 'local node origin, request didn't time out
       generic read mem transaction
       memory access, level generic'
 STATUS d400c001be080813 MCGSTATUS 0

These errors had previously (27.03.06) been reported to Megware. No response yet.


02.04.2006

node007 (19:10)

 CPU 0: Machine Check Exception:                4 Bank 4: f64da001e3080813
 TSC 637b2ce888c32 ADDR 1f4dbaab0
 Kernel panic - not syncing: Machine check

Jobs affected: crashme jobs solely


01.04.2006

node010 (08:35)

 CPU 0: Machine Check Exception:                4 Bank 4: f645a002d5080813
 TSC 9890070606e2 ADDR 1eedbd900
 Kernel panic - not syncing: Machine check

Jobs affected: crashme jobs solely


31.03.2006

10:05: hard disk in the head node replaced, RAID mirror is rebuilding


node010 (02:35)

 CPU 0: Machine Check Exception:                4 Bank 4: f60520024f080813
 TSC 23bb7236ab260 ADDR 1fdbe5c40
 Kernel panic - not syncing: Machine check

Jobs affected: crashme jobs solely


30.03.2006

FUJITSU drive reverted to "FAILED" again. Megware ships replacement disk.


29.03.2006

AMD is able to reproduce a system freeze in a test environment. That test system is equipped with the Opteron CPUs which were previously dismantled from our Node011. Investigation continues ...

Cluster sysadmin graciously offers to support the investigation by assenting to exchange the CPUs of all the flawed nodes.


28.03.2006

Megware suggested to try to rebuild the RAID mirror/ successfully done: [03/28/2006 (17:14:25)]:

       Adapter 1 Channel 1 Target 2:
       Physical Drive[FUJITSU MAT3073NC       0109] is Changed to ONLINE.

Hard disk error (Head node) [03/28/2006 (12:21:04)]:

       Adapter 1 Channel 1 Target 2:
       Physical Drive[FUJITSU MAT3073NC       0109] is Changed to FAILED.

running in degraded mode (disk is was part of a RAID mirror)


27.03.2006

node010 (09:55)

 CPU 0: Machine Check Exception:                4 Bank 4: f67da0010e080813
 TSC 619e5a532a87c ADDR 1ecbc1e40
 Kernel panic - not syncing: Machine check

Jobs affected: crashme jobs solely


25.03.2006

node064 (21:45)

 CPU 2: Machine Check Exception:                4 Bank 4: f6502002f3080813
 TSC 1a8d1675e00788 ADDR 19fdbf2c0
 Kernel panic - not syncing: Machine check

Jobs affected: 239264,241015,241066,243328

Personal tools