Strange WUs

Message boards : Current tests : Strange WUs

To post messages, you must log in.

AuthorMessage
Profile [B^S] thierry@home
Avatar

Send message
Joined: 15 Feb 06
Posts: 20
Credit: 17,624
RAC: 0
Message 2209 - Posted: 21 Aug 2006, 20:02:03 UTC
Last modified: 21 Aug 2006, 19:50:46 UTC

I just had something strange (in my point of view) with the unit 1hz6A_CONTROL_ABRELAX_SAVE_ALL_OUT__1256_1_0

Windows XP pro sp2
P4 3.0 HT
RAM 1Gb

Here what I can see:
It was crunching a Ralph WU & a Leiden WU

The Ralph WU was running for +/- 16 minutes at 1.043%. Then the percentage done jump to 7.403%
At around 32 minutes of crunching, the BOINC manager frozen:
CPU activity: 2/3%
RAM: drop from 1.22 Gb to +/- 930 Mb

ANd this during +/- 2 minutes, with the following message in the BOINC manager:

21/08/2006 21:31:01|ralph@home|Resuming task 1hz6A_CONTROL_ABRELAX_SAVE_ALL_OUT__1256_1_0 using rosetta_beta version 525
21/08/2006 21:32:36|Leiden Classical|Sending scheduler request to http://boinc.gorlaeus.net/Classical_cgi/cgi
21/08/2006 21:32:36|Leiden Classical|Reason: To fetch work
21/08/2006 21:32:36|Leiden Classical|Requesting 6403 seconds of new work, and reporting 1 completed tasks
21/08/2006 21:32:41|Leiden Classical|Scheduler request succeeded
21/08/2006 21:32:43|Leiden Classical|Started download of file cu111leps_trajtou.inp_80398718_1156170482_416
21/08/2006 21:32:44|Leiden Classical|Finished download of file cu111leps_trajtou.inp_80398718_1156170482_416
21/08/2006 21:32:44|Leiden Classical|Throughput 26983 bytes/sec
21/08/2006 21:32:45||Rescheduling CPU: files downloaded
21/08/2006 21:32:45|rosetta@home|Pausing task FRA_t368_CASPR_hom001_7_t368_7_dec32IGNORE_THE_REST_2_1179_223_0 (left in memory)
21/08/2006 21:32:45|Leiden Classical|Resuming task wu_649415041_1156170482_155_0 using trajtou-pd110paw version 533
21/08/2006 21:33:01|Leiden Classical|Sending scheduler request to http://boinc.gorlaeus.net/Classical_cgi/cgi
21/08/2006 21:33:01|Leiden Classical|Reason: To fetch work
21/08/2006 21:33:01|Leiden Classical|Requesting 1564 seconds of new work
21/08/2006 21:33:06|Leiden Classical|Scheduler request succeeded
21/08/2006 21:33:06|Leiden Classical|Message from server: No work sent
21/08/2006 21:33:06|Leiden Classical|No work from project
21/08/2006 21:47:15|boincsimap|Task 60801003.045372_0 exited with zero status but no 'finished' file
21/08/2006 21:47:15|boincsimap|If this happens repeatedly you may need to reset the project.
21/08/2006 21:47:15||Rescheduling CPU: application exited
21/08/2006 21:47:15|rosetta@home|Task FRA_t368_CASPR_hom001_7_t368_7_dec32IGNORE_THE_REST_2_1179_223_0 exited with zero status but no 'finished' file
21/08/2006 21:47:15|rosetta@home|If this happens repeatedly you may need to reset the project.
21/08/2006 21:47:15||Rescheduling CPU: application exited
21/08/2006 21:47:15|rosetta@home|Task FRA_t368_CASPR_hom001_7_t368_7_dec13IGNORE_THE_REST_4_1179_268_0 exited with zero status but no 'finished' file
21/08/2006 21:47:15|rosetta@home|If this happens repeatedly you may need to reset the project.
21/08/2006 21:47:15||Rescheduling CPU: application exited
21/08/2006 21:47:15|Leiden Classical|Task wu_649415041_1156170482_155_0 exited with zero status but no 'finished' file
21/08/2006 21:47:15|Leiden Classical|If this happens repeatedly you may need to reset the project.
21/08/2006 21:47:15||Rescheduling CPU: application exited
21/08/2006 21:47:15|ralph@home|Task 1hz6A_CONTROL_ABRELAX_SAVE_ALL_OUT__1256_1_0 exited with zero status but no 'finished' file
21/08/2006 21:47:15|ralph@home|If this happens repeatedly you may need to reset the project.
21/08/2006 21:47:15||Rescheduling CPU: application exited
21/08/2006 21:47:15|Leiden Classical|Restarting task wu_649415041_1156170482_155_0 using trajtou-pd110paw version 533
21/08/2006 21:47:15|ralph@home|Restarting task 1hz6A_CONTROL_ABRELAX_SAVE_ALL_OUT__1256_1_0 using rosetta_beta version 525


Then the Wu restarted but with the Calculation time back to 18 minutes (and still at 7.043)

Now the WU is again around 32 minutes with 11.52% completed + CPU 100% and RAM 1.03 Gb. It seems it runs normally now.

Hope this can help in something
ID: 2209 · Report as offensive    Reply Quote
Profile Astro

Send message
Joined: 16 Feb 06
Posts: 141
Credit: 32,977
RAC: 0
Message 2221 - Posted: 22 Aug 2006, 22:47:04 UTC

Theirry, the "no finished file" message means the manager temporarily lost track of the daemon. The daemon (which actually controls the work) should have kept running. It's just reporting the manager (the part you see) lost contact.

tony
ID: 2221 · Report as offensive    Reply Quote
Profile [B^S] thierry@home
Avatar

Send message
Joined: 15 Feb 06
Posts: 20
Credit: 17,624
RAC: 0
Message 2224 - Posted: 23 Aug 2006, 6:03:44 UTC - in response to Message 2221.  

Theirry, the "no finished file" message means the manager temporarily lost track of the daemon. The daemon (which actually controls the work) should have kept running. It's just reporting the manager (the part you see) lost contact.

tony


OK... and that's why the Calculation time drop from 32 minutes to 18 minutes?
ID: 2224 · Report as offensive    Reply Quote
tralala

Send message
Joined: 12 Apr 06
Posts: 52
Credit: 15,257
RAC: 0
Message 2225 - Posted: 23 Aug 2006, 9:04:12 UTC - in response to Message 2224.  

Theirry, the "no finished file" message means the manager temporarily lost track of the daemon. The daemon (which actually controls the work) should have kept running. It's just reporting the manager (the part you see) lost contact.

tony


OK... and that's why the Calculation time drop from 32 minutes to 18 minutes?


No, I suppose that's because between 18 and 32 minutes runtime there was no checkpoint. Checkpoints occur every 5 - 30 minutes depending on the WU and the speed of your computer (on slow computers it might even take longer in rare cases).
ID: 2225 · Report as offensive    Reply Quote
Rayburner

Send message
Joined: 17 Feb 06
Posts: 1
Credit: 1,005,965
RAC: 0
Message 2227 - Posted: 23 Aug 2006, 12:08:12 UTC - in response to Message 2225.  

Theirry, the "no finished file" message means the manager temporarily lost track of the daemon. The daemon (which actually controls the work) should have kept running. It's just reporting the manager (the part you see) lost contact.

tony


OK... and that's why the Calculation time drop from 32 minutes to 18 minutes?


No, I suppose that's because between 18 and 32 minutes runtime there was no checkpoint. Checkpoints occur every 5 - 30 minutes depending on the WU and the speed of your computer (on slow computers it might even take longer in rare cases).



I think a checkpoint is created after a modell or decoy is finished. That's according to my observation everytime your progressbar jumps to a new value.



Regards

Rayburner
ID: 2227 · Report as offensive    Reply Quote
tralala

Send message
Joined: 12 Apr 06
Posts: 52
Credit: 15,257
RAC: 0
Message 2228 - Posted: 23 Aug 2006, 12:35:07 UTC - in response to Message 2227.  

I think a checkpoint is created after a modell or decoy is finished. That's according to my observation everytime your progressbar jumps to a new value.
Actually they put in checkpoints within a model run. They checkpoint after the ab initio search is done and do some checkpoints in the final relax stage. It differs from WU to WU though.

ID: 2228 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 2230 - Posted: 23 Aug 2006, 15:53:15 UTC

Checkpoints are always done at the end of a model. End of model is also when the % completed is recalculated, as compared to your WU runtime preference. There may also be checkpoints within a model, and when those occur you will see fractional change to the % complete. The large jumps in % complete are normal, as described in this QA.
ID: 2230 · Report as offensive    Reply Quote

Message boards : Current tests : Strange WUs



©2024 University of Washington
http://www.bakerlab.org