Message boards : Current tests : Strange WUs
Author | Message |
---|---|
[B^S] thierry@home Send message Joined: 15 Feb 06 Posts: 20 Credit: 17,624 RAC: 0 |
I just had something strange (in my point of view) with the unit 1hz6A_CONTROL_ABRELAX_SAVE_ALL_OUT__1256_1_0 Windows XP pro sp2 P4 3.0 HT RAM 1Gb Here what I can see: It was crunching a Ralph WU & a Leiden WU The Ralph WU was running for +/- 16 minutes at 1.043%. Then the percentage done jump to 7.403% At around 32 minutes of crunching, the BOINC manager frozen: CPU activity: 2/3% RAM: drop from 1.22 Gb to +/- 930 Mb ANd this during +/- 2 minutes, with the following message in the BOINC manager: 21/08/2006 21:31:01|ralph@home|Resuming task 1hz6A_CONTROL_ABRELAX_SAVE_ALL_OUT__1256_1_0 using rosetta_beta version 525 21/08/2006 21:32:36|Leiden Classical|Sending scheduler request to http://boinc.gorlaeus.net/Classical_cgi/cgi 21/08/2006 21:32:36|Leiden Classical|Reason: To fetch work 21/08/2006 21:32:36|Leiden Classical|Requesting 6403 seconds of new work, and reporting 1 completed tasks 21/08/2006 21:32:41|Leiden Classical|Scheduler request succeeded 21/08/2006 21:32:43|Leiden Classical|Started download of file cu111leps_trajtou.inp_80398718_1156170482_416 21/08/2006 21:32:44|Leiden Classical|Finished download of file cu111leps_trajtou.inp_80398718_1156170482_416 21/08/2006 21:32:44|Leiden Classical|Throughput 26983 bytes/sec 21/08/2006 21:32:45||Rescheduling CPU: files downloaded 21/08/2006 21:32:45|rosetta@home|Pausing task FRA_t368_CASPR_hom001_7_t368_7_dec32IGNORE_THE_REST_2_1179_223_0 (left in memory) 21/08/2006 21:32:45|Leiden Classical|Resuming task wu_649415041_1156170482_155_0 using trajtou-pd110paw version 533 21/08/2006 21:33:01|Leiden Classical|Sending scheduler request to http://boinc.gorlaeus.net/Classical_cgi/cgi 21/08/2006 21:33:01|Leiden Classical|Reason: To fetch work 21/08/2006 21:33:01|Leiden Classical|Requesting 1564 seconds of new work 21/08/2006 21:33:06|Leiden Classical|Scheduler request succeeded 21/08/2006 21:33:06|Leiden Classical|Message from server: No work sent 21/08/2006 21:33:06|Leiden Classical|No work from project 21/08/2006 21:47:15|boincsimap|Task 60801003.045372_0 exited with zero status but no 'finished' file 21/08/2006 21:47:15|boincsimap|If this happens repeatedly you may need to reset the project. 21/08/2006 21:47:15||Rescheduling CPU: application exited 21/08/2006 21:47:15|rosetta@home|Task FRA_t368_CASPR_hom001_7_t368_7_dec32IGNORE_THE_REST_2_1179_223_0 exited with zero status but no 'finished' file 21/08/2006 21:47:15|rosetta@home|If this happens repeatedly you may need to reset the project. 21/08/2006 21:47:15||Rescheduling CPU: application exited 21/08/2006 21:47:15|rosetta@home|Task FRA_t368_CASPR_hom001_7_t368_7_dec13IGNORE_THE_REST_4_1179_268_0 exited with zero status but no 'finished' file 21/08/2006 21:47:15|rosetta@home|If this happens repeatedly you may need to reset the project. 21/08/2006 21:47:15||Rescheduling CPU: application exited 21/08/2006 21:47:15|Leiden Classical|Task wu_649415041_1156170482_155_0 exited with zero status but no 'finished' file 21/08/2006 21:47:15|Leiden Classical|If this happens repeatedly you may need to reset the project. 21/08/2006 21:47:15||Rescheduling CPU: application exited 21/08/2006 21:47:15|ralph@home|Task 1hz6A_CONTROL_ABRELAX_SAVE_ALL_OUT__1256_1_0 exited with zero status but no 'finished' file 21/08/2006 21:47:15|ralph@home|If this happens repeatedly you may need to reset the project. 21/08/2006 21:47:15||Rescheduling CPU: application exited 21/08/2006 21:47:15|Leiden Classical|Restarting task wu_649415041_1156170482_155_0 using trajtou-pd110paw version 533 21/08/2006 21:47:15|ralph@home|Restarting task 1hz6A_CONTROL_ABRELAX_SAVE_ALL_OUT__1256_1_0 using rosetta_beta version 525 Then the Wu restarted but with the Calculation time back to 18 minutes (and still at 7.043) Now the WU is again around 32 minutes with 11.52% completed + CPU 100% and RAM 1.03 Gb. It seems it runs normally now. Hope this can help in something |
Astro Send message Joined: 16 Feb 06 Posts: 141 Credit: 32,977 RAC: 0 |
Theirry, the "no finished file" message means the manager temporarily lost track of the daemon. The daemon (which actually controls the work) should have kept running. It's just reporting the manager (the part you see) lost contact. tony |
[B^S] thierry@home Send message Joined: 15 Feb 06 Posts: 20 Credit: 17,624 RAC: 0 |
Theirry, the "no finished file" message means the manager temporarily lost track of the daemon. The daemon (which actually controls the work) should have kept running. It's just reporting the manager (the part you see) lost contact. OK... and that's why the Calculation time drop from 32 minutes to 18 minutes? |
tralala Send message Joined: 12 Apr 06 Posts: 52 Credit: 15,257 RAC: 0 |
Theirry, the "no finished file" message means the manager temporarily lost track of the daemon. The daemon (which actually controls the work) should have kept running. It's just reporting the manager (the part you see) lost contact. No, I suppose that's because between 18 and 32 minutes runtime there was no checkpoint. Checkpoints occur every 5 - 30 minutes depending on the WU and the speed of your computer (on slow computers it might even take longer in rare cases). |
Rayburner Send message Joined: 17 Feb 06 Posts: 1 Credit: 1,005,965 RAC: 0 |
Theirry, the "no finished file" message means the manager temporarily lost track of the daemon. The daemon (which actually controls the work) should have kept running. It's just reporting the manager (the part you see) lost contact. I think a checkpoint is created after a modell or decoy is finished. That's according to my observation everytime your progressbar jumps to a new value. Regards Rayburner |
tralala Send message Joined: 12 Apr 06 Posts: 52 Credit: 15,257 RAC: 0 |
I think a checkpoint is created after a modell or decoy is finished. That's according to my observation everytime your progressbar jumps to a new value.Actually they put in checkpoints within a model run. They checkpoint after the ab initio search is done and do some checkpoints in the final relax stage. It differs from WU to WU though. |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
Checkpoints are always done at the end of a model. End of model is also when the % completed is recalculated, as compared to your WU runtime preference. There may also be checkpoints within a model, and when those occur you will see fractional change to the % complete. The large jumps in % complete are normal, as described in this QA. |
Message boards :
Current tests :
Strange WUs
©2024 University of Washington
http://www.bakerlab.org