sporadic crashing on Centos/6.4/64b

For developers writing C++, Fortran, Java, code who have questions or comments to make.

Moderators: silvia, selimgunay, Moderators

Post Reply
hjmangalam
Posts: 11
Joined: Wed Jun 01, 2011 3:28 pm
Location: UC Irvine

sporadic crashing on Centos/6.4/64b

Post by hjmangalam »

I am a sysadmin, not an Opensees user, trying to help a user discover why her Opensees jobs are failing on our cluster.

She's submitting 105 OpenSees (version 2.4.1_r5363, also 2.4.3_r5695) jobs to our gridengine scheduler (sending jobs to several 64b CentOS6.4 hosts, each with 512GB RAM).

Her qsub script is here: <http://pastie.org/9062325>


A few to several of the 105 jobs failed very early in the run:

Here is the start of the output:
==========================================================
Start time is: Wed Apr 9 12:32:30 PDT 2014
hostname is: compute-7-9.local


OpenSees -- Open System For Earthquake Engineering Simulation
Pacific Earthquake Engineering Research Center -- 2.4.3 (rev 5634)

(c) Copyright 1999-2013 The Regents of the University of California
All Rights Reserved
(Copyright and Disclaimer @ http://www.berkeley.edu/OpenSees/copyright.html)


Five Longitudinal Stiffness from acute (max) corner to obtuse (min) corner:
1328.3368505926462
1328.3368505926462
1328.3368505926462
1328.3368505926462
1328.3368505926462
Finished creating gravity load superstructure...
CTestNormDispIncr::test() - iteration: 1 current Norm: 5.12118 (max: 1e-08, Norm deltaR: 44586.4)
CTestNormDispIncr::test() - iteration: 2 current Norm: 0.0981256 (max: 1e-08, Norm deltaR: 989.217)
CTestNormDispIncr::test() - iteration: 3 current Norm: 0.00235124 (max: 1e-08, Norm deltaR: 1.78051)
CTestNormDispIncr::test() - iteration: 4 current Norm: 6.60416e-06 (max: 1e-08, Norm deltaR: 5.25307e-05)
CTestNormDispIncr::test() - iteration: 5 current Norm: 1.03438e-09 (max: 1e-08, Norm deltaR: 4.44241e-05)

Ground Motion: dt= 0.005000, NumPts= 10064, TmaxAnalysis= 50.32
*** glibc detected *** OpenSees: free(): invalid next size (normal): 0x00000000025c6620 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x76166)[0x2b224f456166]
/lib64/libc.so.6(+0x78c93)[0x2b224f458c93]
OpenSees(sp_coletree+0x1a1)[0xf802c1]
==========================================================

(for the complete dump, see <http://pastie.org/9061984>)

All the jobs fail due to that kind of error:
*** glibc detected *** OpenSees: free(): invalid next size (normal): 0x00000000033d6000 ***

which implies that there's a garbage value being fed to free().

the input file (B22_28_0_45.tcl) contains this:
===========================================
996 $ cat B22_28_0_45.tcl
wipe
set GMskew 0
set iM 28
set GMinter 0
set skew 45
source BF1U_Analyzer_22.tcl
============================================
and the file referenced above (BF1U_Analyzer_22.tcl) can be found here:
<http://pastie.org/9062108>


The runs do not fail on a single host - the failures are spread among the 3 hosts that the jobs are running on.
# fails: hosts
7 : compute-7-2.local
23 : compute-7-3.local
1 : compute-7-9.local

ie 7 jobs failed on compute-7-2, etc. As well, the number of aborted runs per submission changes. On a second run, only 7 jobs aborted and of those only 3 were in common:


So there is something that is not replicable exactly (the same inputs do not always cause a failure), but is replicable over machines, and over runs (Always get a few runs that fail).

Before I try to debug further, would it be possible for the Opensees devs to run the code thru valgrind or other memory debugger to try to find this explosive free()?

I can tar the entire dir up if it would be helpful for you to see all examples of the failures and successes.

I'll try to catch a crash inside of valgrind as well.

Thanks
Harry

---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---
fmk
Site Admin
Posts: 5884
Joined: Fri Jun 11, 2004 2:33 pm
Location: UC Berkeley
Contact:

Re: sporadic crashing on Centos/6.4/64b

Post by fmk »

Harry,

i need the scripts.

fmckenna ATTTTT berkeley DOTTTTT edu
Post Reply