|Message: CSV overwriting/interrupting lines in multithreaded operation||Not Logged In (login)|
Click on the Forum title, e.g. on the "Forums by Category" page, to read a sequence of postings to the Forum and its threads all in one page. If you are only interested in one thread or the thread following a specific posting, click the thread or the posting, which takes you to a smaller page, which contains only the part you are interested in and may be easier to navigate.
Messages are "chained" if there are only replies at the first level, i.e. 1/1.html, 1/1/1.html etc. In case of "chained" messages the message number is replaced by the icon and there is no indentation.
Inline: Display the subject line only or also the text of the posting(s); for the choice "All" the "Outline" choices are switched off.
|1||0||1||no text / full text of posting|
|2||1||All||text for level 1 only / text for All postings|
Outline: Choose the depth of the posting thread, successive toggle controls provide increasing detail.
|1||2||1||2 levels / 1 level (original posting)|
|2||3||2||3 levels / 2 levels|
|3||3||All||3 levels / all levels (all postings)|
I have written a Geant4 program that is using CSV output (via Analysis). A colleague is running the program on a cluster and is occasionally seeing problematic output. We are seeing errors like these once in one million events, roughly.
One problematic CSV line looked like: 1,47,alp118,14,e-,21.2086,1.16726,1.16726,msc,0.00253289,2.6746,-0.000918663,5.66533,e-,14,-918.663,406.852,0,2,Bi212[115.183],e-,RadioactiveDaughter,24.6569
I think the "correct" line would look like: 118,14,e-,21.2086,1.16726,1.16726,msc,0.00253289,2.6746,-0.000918663,5.66533,e-,14,-918.663,406.852,0,2,Bi212[115.183],e-,RadioactiveDaughter,24.6569 where it has interuppted or overwritten a line that began 1,47,alpha,...
The interruption seems to occur in different places in the line when the problem occurs. My colleague has not reported seeing any output warnings or errors, and otherwise the program appears to be working fine.
My hypothesis is that the harddrive isn't keeping up with the output. I do not know the details of the cluster, but I think there are multiple machines and multiple cores. While each thread (typically 15) is writing to its own file, there is a shared filesystem and I do not know how many of the threads are working on the same physical drive.
I've exhausted my knowledge of parallel computing, for either diagnosing or fixing this problem. The problem is complicated by my lack of access to my colleague's computing system and her lack of knowledge of Geant4. Has anyone seen a similar problem? I would appreciate any recommendations for debugging.
|Inline Depth:||Outline Depth:||Add message:|