Message: Parallelization of very short events, help! Not Logged In (login)
 Next-in-Thread Next-in-Thread
 Next-in-Forum Next-in-Forum

Question Parallelization of very short events, help! 

Forum: Run Management
Date: 25 Jan, 2006
From: Ioannis Sechopoulos <Ioannis Sechopoulos>

Hello, I need to make my simulation parallel because, although each event takes only 0.1-0.15 ms, I need to run billions of events, which takes days. Once I get to work in parallel, I'll have access to a 64 node cluster, but meanwhile I am trying this out in a 2-processor computer under Linux. I got my parallel program to work, but it is actually slower than the serial version. To check this, I tried various things with the included N02 and ParN02 and the same things happen:

1.- If the event length is relatively long, approx. 2 ms per event in N02, in the parallel version, ParN02, the event takes approx 1.2 ms (with aggregated-tasks=100), which makes sense in a 2-cpu computer.

2.- But, if the event length is very short, e.g. 0.02-0.04 ms in N02, this takes 0.6 ms in ParN02 even with aggregated-tasks=100. This I understand is probably because the communication between master and slave is still the predominant time factor over the simulation time of the 100 events. So I tried increasing the aggregated tasks to 1000 and even 10,000 and higher, and still ParN02 is slower, reaching approx 0.2 ms per event, but it seems that it is slower for a different reason. When I use the trace=1 option, I see that when the master is trying to send a job to the slaves, a lot of:

master -> -1:

are produced before the job is actually accepted by one of the slaves. I believe this "rejection" (I am not sure if this -1 means the job was rejected) is what is slowing down the simulation. From what I saw in the TOP-C code, a -1 is returned if there are no slaves available.

So it seems like if the total job size sent to a slave is too small (i.e. aggregated-tasks=100 or so), the parallel version suffers because of the communication overhead, and if the job included many aggregated-tasks then the job keeps getting rejected for some reason.

Is there somebody with more experience in ParGeant4 that could tell me if this is the cause of the problem, and, more importantly, how to solve it? Will this problem go away if I use a 4-cpu workstation or a 64 node cluster?

Thank you very much!

Inline Depth:
 1 1
 All All
Outline Depth:
 1 1
 2 2
 All All
Add message: (add)

 Add Message Add Message
to: "Parallelization of very short events, help!"

 Subscribe Subscribe

This site runs SLAC HyperNews version 1.11-slac-98, derived from the original HyperNews

[ Geant 4 Home | Geant 4 HyperNews | Search | Request New Forum | Feedback ]