| Author | Message |
|
|
|
Really, we think this is the last one before Rosetta@home is updated! This is mainly to fix a silly, small bug that got introduced with the latest checkpointing.
For those interested, the watchdog is still not exiting gracefully every time -- if no data was created, there\'s still a file transfer error. We\'re trying to figure out why, but will likely need help from the Boinc team to fix it. Fortunately, the jobs that give these errors are rare -- and produce no data anyway. Of course, we will continue to grant credit every week for errored jobs when the app gets updated on Rosetta@home.
____________
|
|
|
|
|
|
If I try to fetch work I get a \"Project is down\" message.
____________
|
|
|
|
|
|
Now it\'s:
26/04/2006 10:48:22|ralph@home|Message from server: Server has software problem
26/04/2006 10:48:22|ralph@home|Project is down
____________
|
|
|
|
|
|
Alpha testers: Abort any 5.04 WU may be sitting up on u cache/queue
So, u can start testing 5.05 asap -:) |
|
|
|
|
Really, we think this is the last one before Rosetta@home is updated!
Hey, take as much time as you need! You are so close now, might as well wrap it up in style with a bulletproof application! :)
____________
|
|
|
|
|
|
Both WU I tried finished valid but both results show a warning:
WARNING! error deleting file .\\aah002.out
However no such file is present on my computer any longer.
http://ralph.bakerlab.org/results.php?userid=1266
____________
|
|
|
|
|
|
WUs failed in under 1 minute... and I\'ll tell you why...
I was playing around suspending R@H WUs and trying to prevent downloads of more and getting Ralph to get some new WUs, and kill those of the previous version, etc. suffice it ta say I had about 8 WUs suspended and left in memory. This caused Windows to entend it\'s paging file, and two other WUs failed immediately, the failures attempted to bring in debug code, which furthered the requirements for memory.
97682
97672
97670
Here\'s the msg you see in Windows:

____________
|
|
|
|
|
|
The watchdog aborted this although overall runtime was ony a couple of minutes. After a few minutes runtime it was for a few hours preempted by another WU and after resuming the watchdog probably assumed it run for over an hour with no progress. It seems the Watchdog is only comparing two points in time without checking what happened inbetween.
04/26/06 18:59:19||Rescheduling CPU: application exited
04/26/06 18:59:19|ralph@home|Computation for task AB_CASP6_u272__444_4_0 finished
335.453125
stderr out
<core_client_version>5.4.6</core_client_version>
<stderr_txt>
# cpu_run_time_pref: 10800
# random seed: 3882530
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is killing the run!
Stuck at score 33.7964 for 3600 seconds
**********************************************************************
GZIP SILENT FILE: .\\xxu272.out
WARNING! error deleting file .\\xxu272.out
</stderr_txt>
____________
|
|
|
|
|
|
Linux 256 MB RAM, plenty of swap space, WU Freeze at 100% Done !
Wed Apr 26 16:41:36 BRT 2006
crobertp [/home/boinc/BOINC] > cat stdoutdae.txt | grep CASP6
2006-04-26 12:31:54 [ralph@home] Starting result FA_CASP6_v272__435_19_0 using rosetta_beta version 505
2006-04-26 12:34:07 [ralph@home] Pausing result FA_CASP6_v272__435_19_0 (left in memory)
2006-04-26 12:48:20 [ralph@home] Resuming result FA_CASP6_v272__435_19_0 using rosetta_beta version 505
2006-04-26 13:55:38 [ralph@home] Pausing result FA_CASP6_v272__435_19_0 (left in memory)
2006-04-26 15:46:56 [ralph@home] Resuming result FA_CASP6_v272__435_19_0 using rosetta_beta version 505
Wed Apr 26 16:44:41 BRT 2006
CPU usage 0.0000%
What should I do next ?
Perhaps kill app with some special signal to force a core dump
and then e-mail that core dump to ???
Thanks
____________
Click signature for global team stats
  |
|
|
|
|
|
This Unit was aborted after less than one hour of runing ( My time preference is 2 hours)
http://ralph.bakerlab.org/result.php?resultid=97305
AB_CASP6_t216__438_3_0
Workunit 86138
CPU time 3180.21875
stderr out <core_client_version>5.2.13</core_client_version>
<stderr_txt>
# random seed: 3882811
# cpu_run_time_pref: 7200
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is killing the run!
Stuck at score 71.0875 for 3600 seconds
**********************************************************************
GZIP SILENT FILE: .\\xxt216.out
WARNING! attempt to gzip file .\\xxt216.out failed: file does not exist.
</stderr_txt>
<message><file_xfer_error>
<file_name>AB_CASP6_t216__438_3_0_0</file_name>
<error_code>-161</error_code>
<error_message></error_message>
</file_xfer_error>
</message>
Validate state Invalid
Claimed credit 11.1502124855248
Granted credit 0
application version 5.05
____________
|
|
|
|
|
|
Just got home to see how my PC did, I think the watchdog ralphed all over my work. This is on my same host as I posted about earlier where I was short on VM swap space and lost 3 WUs. Problems all over, watchdog kicking in (I\'ve never had any hung WUs before, so seems unlikely it was required), failing after <1 min. I did not abort any v5.05 WUs. Very few successes.
Now my Ralph WUs are completed, and I\'m crunching R@H again. Got a WU with FAST in the name, the thing has ripped 607 models in under 14hrs (I have 24hr preference, the output file is 4.6MB now... might come close to 10 once we\'re done!). I didn\'t reboot or restart BOINC since the memory issues about 13hrs ago.
____________
|
|
|
|
|
|
This almost looks like a BOINC manager problem. Go ahead and abort it; then can you post a link to the result here? Thanks. No need to send us the core dump.
Linux 256 MB RAM, plenty of swap space, WU Freeze at 100% Done !
Wed Apr 26 16:41:36 BRT 2006
crobertp [/home/boinc/BOINC] > cat stdoutdae.txt | grep CASP6
2006-04-26 12:31:54 [ralph@home] Starting result FA_CASP6_v272__435_19_0 using rosetta_beta version 505
2006-04-26 12:34:07 [ralph@home] Pausing result FA_CASP6_v272__435_19_0 (left in memory)
2006-04-26 12:48:20 [ralph@home] Resuming result FA_CASP6_v272__435_19_0 using rosetta_beta version 505
2006-04-26 13:55:38 [ralph@home] Pausing result FA_CASP6_v272__435_19_0 (left in memory)
2006-04-26 15:46:56 [ralph@home] Resuming result FA_CASP6_v272__435_19_0 using rosetta_beta version 505
Wed Apr 26 16:44:41 BRT 2006
CPU usage 0.0000%
What should I do next ?
Perhaps kill app with some special signal to force a core dump
and then e-mail that core dump to ???
Thanks
____________
|
|
|
|
|
|
Feet1st, whoa you\'ve got a really fast client....
As for ralph, I share your concerns. I\'m trying another fix on 5.06, can you attach to ralph now? If the watchdog is still too aggressive, we\'ll have to keep it off for rosetta@home until we avoid this error. Two questions for you:
How often do you switch between apps (if at all)?
When ralph is pre-empted, do you \"Keep in Memory\"?
My prediction is that the clients that have been having trouble with the watchdog switch occasionally between apps, and keep in memory. That was the case for tralala below, and I\'ve put in the fix for that case. Let me know.
Just got home to see how my PC did, I think the watchdog ralphed all over my work. This is on my same host as I posted about earlier where I was short on VM swap space and lost 3 WUs. Problems all over, watchdog kicking in (I\'ve never had any hung WUs before, so seems unlikely it was required), failing after <1 min. I did not abort any v5.05 WUs. Very few successes.
Now my Ralph WUs are completed, and I\'m crunching R@H again. Got a WU with FAST in the name, the thing has ripped 607 models in under 14hrs (I have 24hr preference, the output file is 4.6MB now... might come close to 10 once we\'re done!). I didn\'t reboot or restart BOINC since the memory issues about 13hrs ago.
____________
|
|
|
|
|
|
5.05 works fine, but.... In every result I\'ve checked there is \"WARNING! error deleting file .\\xxv272.out\". Name of file varies, but they are allways .out -files.
____________
|
|
|
|
|
|
wu 79004 killed by watchdog just after 2 hrs\' runtime (Stuck at score -115.914 for 3600 seconds)
(Sorry this was 5.05, can\'t get any 5.06 wus) |
|
|
|
|
|
I just finished one WU with several deliberate switchings inbetween (long and short) and it seems 5.06 has solved the issue. The warning message about the file deletion error is also gone. :-)
http://ralph.bakerlab.org/result.php?resultid=98261
Now Rhiju please have a look at this and this.
____________
|
|
|
|
|
|
Okies I have been running the following RALPH Work Unit:
ID 6204
Name AB_CASP6_t198__438_5_0
It worked for around 57 minutes and then was preempted ( keeping the record of the CPU Time in my work record) and the corresponding Rosetta Work Unit restarted. Once the Rosetta Work Unit stopped, the application switched to RALPH and Work Unit ID 6204 restarted , it started DE NOVO , that is from CPU time of 0 and not from the accumulated 57+ minutes it had when it preempted and the application switch happened. My preferences are set so that work is kept in memory and this did not happened in this case.
So to make the story short: the 57+ CPU time for the Work Unit that have been stored in memory disappeared into the big void in the sky. :)
____________
|
|
|
|
|
How often do you switch between apps (if at all)?
When ralph is pre-empted, do you \"Keep in Memory\"?
My prediction is that the clients that have been having trouble with the watchdog switch occasionally between apps, and keep in memory. That was the case for tralala below, and I\'ve put in the fix for that case. Let me know.
Sorry, I\'m not positive. I changed my settings to actually try and stress Ralph a little bit, but I am not certain if they took effect on THAT PC or not, depends when it updated. I believe it had a 360 min (6hrs) switch between projects, and leave in memory at the time of failures. My other Ralph host was updated to have 20min switch time, and remove from memory, and it seems to be going well, but it hasn\'t had other projects interrupting it either.
____________
|
|
|
|
|
|
Thanks for all the advice. I think we\'ve largely killed the watchdog timer problem and are ready to release. (Feet1st, you noticed how bad the problem was with 5.05; has your client tried any 5.06?) We haven\'t seen any evidence for jobs being aborted prematurely by the watchdog, except for the tests where we forced an infinite loop.
A few quick replies:
I\'ll bring the debate about shorter/longer deadlines (or a mix) to the attention of the other project scientists.
I really do like Feet1st\'s idea to ask ralph users to lower the fraction of time their client spends on ralph. That will distribute the jobs to as many different cpus as possible. I can make a note of it on the news page next time we release.
How often do you switch between apps (if at all)?
When ralph is pre-empted, do you \"Keep in Memory\"?
My prediction is that the clients that have been having trouble with the watchdog switch occasionally between apps, and keep in memory. That was the case for tralala below, and I\'ve put in the fix for that case. Let me know.
Sorry, I\'m not positive. I changed my settings to actually try and stress Ralph a little bit, but I am not certain if they took effect on THAT PC or not, depends when it updated. I believe it had a 360 min (6hrs) switch between projects, and leave in memory at the time of failures. My other Ralph host was updated to have 20min switch time, and remove from memory, and it seems to be going well, but it hasn\'t had other projects interrupting it either.
____________
|
|
|
|
|
|
A quick reply to Carlos... it seems like all your ralph jobs have been erroring out. The error message we\'re seeing is something about a lost heartbeat from the core client. That doesn\'t sound good. Have you had this issue with any workunits from rosetta@home?
Also, do you have a new version of the BOINC app? Thanks.
Linux 256 MB RAM, plenty of swap space, WU Freeze at 100% Done !
Wed Apr 26 16:41:36 BRT 2006
crobertp [/home/boinc/BOINC] > cat stdoutdae.txt | grep CASP6
2006-04-26 12:31:54 [ralph@home] Starting result FA_CASP6_v272__435_19_0 using rosetta_beta version 505
2006-04-26 12:34:07 [ralph@home] Pausing result FA_CASP6_v272__435_19_0 (left in memory)
2006-04-26 12:48:20 [ralph@home] Resuming result FA_CASP6_v272__435_19_0 using rosetta_beta version 505
2006-04-26 13:55:38 [ralph@home] Pausing result FA_CASP6_v272__435_19_0 (left in memory)
2006-04-26 15:46:56 [ralph@home] Resuming result FA_CASP6_v272__435_19_0 using rosetta_beta version 505
Wed Apr 26 16:44:41 BRT 2006
CPU usage 0.0000%
What should I do next ?
Perhaps kill app with some special signal to force a core dump
and then e-mail that core dump to ???
Thanks
____________
|
|
|
|
|
Thanks for all the advice. I think we\'ve largely killed the watchdog timer problem and are ready to release. (Feet1st, you noticed how bad the problem was with 5.05; has your client tried any 5.06?) We haven\'t seen any evidence for jobs being aborted prematurely by the watchdog, except for the tests where we forced an infinite loop.
So you are going to release it on Rosetta today? Good luck! ;-)
A few quick replies:
I\'ll bring the debate about shorter/longer deadlines (or a mix) to the attention of the other project scientists.
I really do like Feet1st\'s idea to ask ralph users to lower the fraction of time their client spends on ralph. That will distribute the jobs to as many different cpus as possible. I can make a note of it on the news page next time we release.
Asking is one thing making sure jobs will be distributed in the most useful manner is another. I really don\'t think one needs to rely on aware testers for that. Just lower the quota and shorten the deadlines and you get what you want. Probably a one-week deadline and a quota of 10 WU\'s is a first step and a compromise.
You can even make the WU/day quota editable by the participants. At least I saw it editable in one project not sure if this is still possible with the latest BOINC version. If you can I\'d recommend to set the quota to 3/day and make it editable for those who want to continue testing for more than three Wu/s per day. That will prevent ignorant users to hijack the WUs which just join the project which their usual 3-day-cache and load 20 WUs at once (and returning them after 10 days or so).
____________
|
|
|
|
|
|
Tralala, nice advice. I\'m lowering the RALPH deadline from 14 days to 4 days. We still value results that come back after two or three days, but you\'re right that its ridiculous to get back a job as ancient as two weeks old.
At least with the current BOINC system, I can\'t seem to set the max WU sent to a client per day. Can you post here which project allowed you to set that as a preference?
Asking is one thing making sure jobs will be distributed in the most useful manner is another. I really don\'t think one needs to rely on aware testers for that. Just lower the quota and shorten the deadlines and you get what you want. Probably a one-week deadline and a quota of 10 WU\'s is a first step and a compromise.
You can even make the WU/day quota editable by the participants. At least I saw it editable in one project not sure if this is still possible with the latest BOINC version. If you can I\'d recommend to set the quota to 3/day and make it editable for those who want to continue testing for more than three Wu/s per day. That will prevent ignorant users to hijack the WUs which just join the project which their usual 3-day-cache and load 20 WUs at once (and returning them after 10 days or so).
____________
|
|
|
|
|
|
I dont think 5.06 is good for Linux, However for Windows 5.06 is OK
May be u can trap that signal 11 to make it exit with 0 but no finshed file ?
So, boinc will restart that WUs again ... and possible finish OK.
These signal 11 are caused by a timing problem ... Not by an unallocated aray.
No heartbeat from core client for 31 sec - exiting
*too much network traffic ! 127.0.0.1 unserviced!
2006-04-27 11:58:14 [ralph@home] Finished download of 1tul__alltopologycodes.bar
2006-04-27 11:58:14 [ralph@home] Throughput 21465 bytes/sec
2006-04-27 11:58:15 [ralph@home] Starting result FACONTACTS_NOFILTERS_1tul__381_3_1 using rosetta_beta version 506
2006-04-27 12:02:48 [ralph@home] Pausing result FACONTACTS_NOFILTERS_1tul__381_3_1 (left in memory)
2006-04-27 12:02:49 [ralph@home] Unrecoverable error for result FACONTACTS_NOFILTERS_1tul__381_3_1 (process exited with code 131 (0x83))
2006-04-27 12:02:49 [ralph@home] Computation for result FACONTACTS_NOFILTERS_1tul__381_3_1 finished
2006-04-27 12:03:50 [ralph@home] Sending scheduler request to http://ralph.bakerlab.org/ralph_cgi/cgi
2006-04-27 12:03:50 [ralph@home] Reason: To report results
2006-04-27 12:03:50 [ralph@home] Requesting 0.864 seconds of new work, and reporting 1 results
2006-04-27 12:04:00 [ralph@home] Scheduler request to http://ralph.bakerlab.org/ralph_cgi/cgi succeeded
2006-04-27 12:04:02 [ralph@home] Started download of casp6_aat216_03_05.200_v1_3.gz
2006-04-27 12:04:02 [ralph@home] Started download of casp6_aat216_09_05.200_v1_3.gz
2006-04-27 12:07:47 [ralph@home] Finished download of casp6_aat216_03_05.200_v1_3.gz
2006-04-27 12:07:47 [ralph@home] Throughput 17163 bytes/sec
2006-04-27 12:07:47 [ralph@home] Started download of casp6_t216_.fasta.gz
2006-04-27 12:07:48 [ralph@home] Finished download of casp6_t216_.fasta.gz
2006-04-27 12:07:48 [ralph@home] Throughput 548 bytes/sec
2006-04-27 12:07:48 [ralph@home] Started download of casp6_t216.pdb.gz
2006-04-27 12:07:51 [ralph@home] Finished download of casp6_t216.pdb.gz
2006-04-27 12:07:51 [ralph@home] Throughput 21820 bytes/sec
2006-04-27 12:07:51 [ralph@home] Started download of casp6_t216_.psipred_ss2.gz
2006-04-27 12:07:52 [ralph@home] Finished download of casp6_t216_.psipred_ss2.gz
2006-04-27 12:07:52 [ralph@home] Throughput 8188 bytes/sec
2006-04-27 12:11:04 [ralph@home] Finished download of casp6_aat216_09_05.200_v1_3.gz
2006-04-27 12:11:04 [ralph@home] Throughput 26233 bytes/sec
2006-04-27 12:11:06 [ralph@home] Starting result FA_CASP6_t216__451_30_0 using rosetta_beta version 506
2006-04-27 12:12:41 [ralph@home] Pausing result FA_CASP6_t216__451_30_0 (left in memory)
2006-04-27 12:12:42 [ralph@home] Unrecoverable error for result FA_CASP6_t216__451_30_0 (process exited with code 131 (0x83))
2006-04-27 12:12:42 [ralph@home] Computation for result FA_CASP6_t216__451_30_0 finished
2006-04-27 12:13:42 [ralph@home] Sending scheduler request to http://ralph.bakerlab.org/ralph_cgi/cgi
2006-04-27 12:13:42 [ralph@home] Reason: To report results
2006-04-27 12:13:42 [ralph@home] Requesting 0.864 seconds of new work, and reporting 1 results
2006-04-27 12:13:48 [ralph@home] Scheduler request to http://ralph.bakerlab.org/ralph_cgi/cgi succeeded
2006-04-27 12:13:48 [ralph@home] Message from server: No work sent
2006-04-27 12:13:48 [ralph@home] Message from server: (reached daily quota of 6 results)
http://ralph.bakerlab.org/result.php?resultid=98808
http://ralph.bakerlab.org/result.php?resultid=98790
http://ralph.bakerlab.org/result.php?resultid=98787
http://ralph.bakerlab.org/result.php?resultid=98747
http://ralph.bakerlab.org/result.php?resultid=98658
http://ralph.bakerlab.org/result.php?resultid=98658
http://ralph.bakerlab.org/result.php?resultid=98613
and there is still the problem of WU freezing at 100% done
and other % too ... witout using CPU that I asked here what to do, to help fixing the problem
but get no answer ... so I aborted these WUs
____________
Click signature for global team stats
  |
|
|
|
|
At least with the current BOINC system, I can\'t seem to set the max WU sent to a client per day. Can you post here which project allowed you to set that as a preference?
I think I saw that over at CPDN but it\'s no longer setable there as well. Perhaps I remembered it wrong perhaps it has been disabled in more recent BOINC releases.
I\'d still think about 10 WU/day is sufficient and this will further prevent people from building up big caches.
____________
|
|
|
|
|
I\'d still think about 10 WU/day is sufficient and this will further prevent people from building up big caches.
I use to abort all WUs of previous version, When I notice a new version,
I know not everyone do this ...
However what is wrong if the boinc concept of limiting WUs by day
What should be limited is \"cache\" of unreturned WUs ... may be on 2
Once a client exceed the quota of 2 it does not get more WUs,
however if it return 1 it can download more 1 , even after quota exceeded.
*Ops forget that a project reset does not return any WUs ...
So, who do a project reset w/o aborting WUs first will have to wait next day
____________
Click signature for global team stats
  |
|
|
|
|
|
4 days? Ouch, I\'d hoped for 6 or 7, and definitely with smaller quotas.
seti beta has a painfully small return rate due to huge quotas. Shorter deadlines aren\'t as direct, well I guess they are here but if you\'re running a quorum of more than 1, short deadlines drag things out having to resend after earlier results time out...
Meanwhile I\'ve preferred to test with 16-hr runtimes, and I do run other projects. With my current mix I can probably just make 4 days. Of course when you want really fast returns those hit-and-quit wu\'s you\'ve been sending, do the job.
So I\'m wondering is there little value for testing longer time settings here? Easy enough to drop back to 2 or 4 hour runtimes.
p.s.
If this discussion continues, maybe it\'s better moved out of the bug-report thread? |
|
|
|
|
Meanwhile I\'ve preferred to test with 16-hr runtimes, and I do run other projects....
So I\'m wondering is there little value for testing longer time settings here? Easy enough to drop back to 2 or 4 hour runtimes.
I wonder this, too. Maybe for each run, Rhiju, you could advise us testers what settings you would like us to use to achieve your goals for that particular run. In other words, what runtime setting would you like, would you also like us to run other projects at the same time, or just run Ralph by itself to get some results back really quickly, etc., etc.
In this way we can more directly assist you in achieving your testing objectives.
Thanks!
____________
|
|
|
|
|
|
Maximum disk usage excedeed Linux
http://ralph.bakerlab.org/result.php?resultid=98187
May be is difficult wipping out from disk the files of previous version
before sending out a new version to test ?
Thanks
____________
Click signature for global team stats
  |
|
|
|
|
|
Back to possible bugs:
Rosetta 5.06
using 161 MB of memory, 542 MB of virtuel memory
The box is a very old one, the WU has run 11 hours now, sitting with 1,04 %
I guess, it will never finish :-(
Oh, my setting for RALPH Target CPU time is 4 hours ...
This is the box: http://ralph.bakerlab.org/show_host_detail.php?hostid=1911
This is the result: http://ralph.bakerlab.org/result.php?resultid=98748
Abort or stay a little bit longer ?
____________

Supporting BOINC, a great concept ! |
|
|
|
|
|
Just my 2 cents worth here. I\'ve said this on other project boards as well.
There should be two cache settings in each project.
1) Max Wu/cpu/day (current cache)
2) Max outstanding WU/CPU.
I\'d love to see #2 added. Personally I find it silly that some people and systems download hundreds of WUs to only return a portion of them. Just look at host 3755 on seti beta. Yes, the daily quota is down to 1 per day, but there were over 1000 outstanding WUs on the machine.
My thought for #2 would be this. Project defines how many outstanding WUs/cpu is acceptable. You can download up to this amount over any number of days with #1. Once you hit the limit in #2, the server refuses to send you more work until you return work.
Why keep sending work to hosts that are not returning work.
____________
|
|
|
|
|
|
28/04/2006 10:47:13 AM|ralph@home|Resuming task FA_CASP6_t198__435_26_1 using rosetta_beta version 505
http://ralph.bakerlab.org/result.php?resultid=97816
last night i quit boinc and this workunit was at about 1.0427% after 2h45m15s (model=1, step=340905, full atom relax) when i quit boinc. i\'ve now restarted boinc today, and it\'s now at 1.0424% after 2h10m10s (model=1, step=340558, full atom relax) and still runnning. so it has started redoing the same work again as it had done already last night.
it seems that the new checkpointing didn\'t work (since it was redone today). or, did it just not reach a \"checkpointable stage\" last night (since this seems like a rather large structure)?
____________
|
|
|
|
|
|
I think there is little value to crunch Ralph WU for more than 8 hours. I would suggest to deactivate this feature in Ralph and to send out WUs with fixed runtimes and to send out a mix most appropriate for the tested app/wu. But maybe Rhiju can give his opinion on this. Nevertheless if one can only crunch one WU in 4 days due to the ressource share of Ralph and runtime preference that is okay. I think the goal of Ralph is nto throughput but diversity. It is better to have 10 hosts trying 1 WU than 1 host trying 10. But perhaps Rhiju can give his opinion on that and post some advice in the news section (at least not to download 20 WUs at once).
\"Max outstanding WU/CPU\"
This would be a cool feature but that is something BOINC has to implement. It would certainly enable much better distribution of WU without restricting hosts on the maximum wu per day.
____________
|
|
|
|
|
Back to possible bugs:
Rosetta 5.06
using 161 MB of memory, 542 MB of virtuel memory
The box is a very old one, the WU has run 11 hours now, sitting with 1,04 %
I guess, it will never finish :-(
Oh, my setting for RALPH Target CPU time is 4 hours ...
This is the box: http://ralph.bakerlab.org/show_host_detail.php?hostid=1911
This is the result: http://ralph.bakerlab.org/result.php?resultid=98748
Abort or stay a little bit longer ?
This t216 protein is really big. It used up to 250 MB on my box and needed over an hour for the first model to finish (on AMD 64 @ 2400 MHz). So I suggest not to abort but to see whether it will finish on your old machine.
____________
|
|
|
|
|
|
Rosetta beta 5.06 Linux Success
http://ralph.bakerlab.org/result.php?resultid=98212
I had success completing above job on a Linux PC with 256 MB ram.
*All other jobs on above PC get some sort of error !
What I did ...
1) suspended all other projects running on this pc, left only ralph
2) opened some disk space by deleting some old stuff
3) shutdown one of my 10 mbps Internet links, and the load balancing stuff
4) cruched after midnight, while majority of my users are asleeping
So, the 5.06 must be OK for Linux too
However is weak ... any disturbance ... as big network traffic,
or running multiple projects (even keeping in RAM) causes job ops WU to fail.
suggestion:
*Signal 11 needs be trapped to exit with 0 instead of with 183
So, the job will exit with 0 , but no finished file
and next, boinc re-starts it. until it finish ...
____________
Click signature for global team stats
  |
|
|
|
|
Back to possible bugs:
Rosetta 5.06
using 161 MB of memory, 542 MB of virtuel memory
The box is a very old one, the WU has run 11 hours now, sitting with 1,04 %
I guess, it will never finish :-(
Oh, my setting for RALPH Target CPU time is 4 hours ...
This is the box: http://ralph.bakerlab.org/show_host_detail.php?hostid=1911
This is the result: http://ralph.bakerlab.org/result.php?resultid=98748
Abort or stay a little bit longer ?
This t216 protein is really big. It used up to 250 MB on my box and needed over an hour for the first model to finish (on AMD 64 @ 2400 MHz). So I suggest not to abort but to see whether it will finish on your old machine.
okay, it seems, as if it finished without error :-)
____________

Supporting BOINC, a great concept ! |
|
|
|
|
|
Workunits are done well also on WindowsXP x64 Edition, Pentium D 2.8Ghz and 1GB RAM, using 129MB and 75MB of it.
At this version, my computer doesn\'t experience the error, crashing workunits when graphics are shown on screen. very great.
But completion time is not expected well. For example, before a workunit start, to completion was \"01:51:20\". But it is \"01:57:00\" even 35% of the work was already done. I\'ve not noticed such a great difference at former version.
____________
|
|
|
|
|
|
4/28/2006 12:53:48 AM||Rescheduling CPU: files downloaded
4/28/2006 3:15:49 AM||Rescheduling CPU: application exited
4/28/2006 3:15:49 AM|ralph@home|Computation for task WATCHDOG_KILL_VERY_LONG_JOBS_424_9_2 finished
4/28/2006 3:15:50 AM|ralph@home|Unrecoverable error for result WATCHDOG_KILL_VERY_LONG_JOBS_424_9_2 (<file_xfer_error> <file_name>WATCHDOG_KILL_VERY_LONG_JOBS_424_9_2_0</file_name> <error_code>-161</error_code></file_xfer_error>)
result: http://ralph.bakerlab.org/result.php?resultid=97709
Win 2000 SP4 Intel Pentium 4 @ 2.4GHz w/ 512Meg RAM
There was is an additional message in the result about a non-existant file:
GZIP SILENT FILE: .\\xx1enh.out
WARNING! attempt to gzip file .\\xx1enh.out failed: file does not exist.
____________
 |
|
|
|
|
Maybe for each run, Rhiju, you could advise us testers what settings you would like us to use to achieve your goals for that particular run. In other words, what runtime setting would you like, would you also like us to run other projects at the same time, or just run Ralph by itself to get some results back really quickly, etc., etc.
In this way we can more directly assist you in achieving your testing objectives.
If they are adding checkpointing and want more frequent switch between jobs, then that makes sense... once we\'re over these hurdles, I think the best test would be for everyone to have their Ralph preferences match their R@H preference... and the randomness of how we all have these set is the best beta test, the most similar to the user base of Rosetta.
I guess what I\'m saying is, if necessary, instruct us on preference changes you\'d like to see... but then let\'s test same version another couple (several) days back on or normal settings.
____________
|
|
|
|
|
Feet1st, you noticed how bad the problem was with 5.05; has your client tried any 5.06?
I\'m not positive, but I believe (irony is cruel sometimes) all the 5.06 WUs were gone by the time I got home to that PC to notice and abort 5.05 WUs.
This is ironic for two reasons. One, I\'ve been discussing the merits of getting WUs to more hosts by limiting WUs per day or resource share, or other means of assuring some WUs remain available for at least 24hrs. Two, I asked why no application version shows on an unreturned WU on the website, and was told it\'s because it\'s flexible, so from work, I can\'t see if the WUs on my PC at home are for 5.05 or 5.06 :) Even though we all know that the Work tab of that PC has a specific version associated with the WU.
____________
|
|
|
|
|
|
Hello, i have some problem with this Wu : 5.06
FA_CASP6_t216__444_2_2
50\' for 1.02%
So i aborted it
Bye and go on...
____________
|
|
|
|
|
Hello, i have some problem with this Wu : 5.06
FA_CASP6_t216__444_2_2
50\' for 1.02%
So i aborted it
Bye and go on...
They are so big that it takes more than 1 H on a fast computer to complete 1 decoy.
Anders n
____________
|
|
|
|
|
|
I have some problems with 5.06 on Windows 98 :
<core_client_version>5.2.13</core_client_version>
<message> - exit code -164 (0xffffff5c)
</message>
<stderr_txt>
LoadLibraryA( dbghelp95.dll ): GetLastError = 1157
LoadLibraryA( dbghelp.dll ): GetLastError = 1157
</stderr_txt>
Result ID : http://ralph.bakerlab.org/result.php?resultid=100666
____________
|
|
|
|
|
|
Hi Feet1st, these are great suggestions, as usual! We\'ve come to expect them.
I\'m about to post 5.08, and I\'ll ask that ralph users use similar preferences to their r@h preferences, as you suggest. I think the checkpointing and watchdog issues have largely been resolved, thankfully, and we\'ve moved on to testing real science.
As for keeping work on ralph, we haven\'t quite got that figured out. We\'d like to have jobs go out instantly to clients when we post the new app or test a new scientific mode on ralph, so that we get feedback ASAP. The problem is that if we\'ve flooded the clients with jobs with the previous app or previous jobs, there\'s typically a wait for those clients to free up again. In the future, if we can get trickle-messages implemented, we could send out a purge request. Still, I hear you ... I\'ll keep sending out work and ask others to do the same.
Feet1st, you noticed how bad the problem was with 5.05; has your client tried any 5.06?
I\'m not positive, but I believe (irony is cruel sometimes) all the 5.06 WUs were gone by the time I got home to that PC to notice and abort 5.05 WUs.
This is ironic for two reasons. One, I\'ve been discussing the merits of getting WUs to more hosts by limiting WUs per day or resource share, or other means of assuring some WUs remain available for at least 24hrs. Two, I asked why no application version shows on an unreturned WU on the website, and was told it\'s because it\'s flexible, so from work, I can\'t see if the WUs on my PC at home are for 5.05 or 5.06 :) Even though we all know that the Work tab of that PC has a specific version associated with the WU.
____________
|
|
|
|
|
|
Hi Mike: this is a silly thing that we haven\'t quite been able to fix, but should happen rarely on rosetta@home. That ralph workunit was a test that our watchdog timer properly aborts really long running jobs. So we\'re very glad to see it worked on your computer! If you ever run into similar super-long workunits on Rosetta@home (hopefully not!), you\'ll eventually get credit granted to it, because that\'s our policy. Thanks for posting!
4/28/2006 12:53:48 AM||Rescheduling CPU: files downloaded
4/28/2006 3:15:49 AM||Rescheduling CPU: application exited
4/28/2006 3:15:49 AM|ralph@home|Computation for task WATCHDOG_KILL_VERY_LONG_JOBS_424_9_2 finished
4/28/2006 3:15:50 AM|ralph@home|Unrecoverable error for result WATCHDOG_KILL_VERY_LONG_JOBS_424_9_2 (<file_xfer_error> <file_name>WATCHDOG_KILL_VERY_LONG_JOBS_424_9_2_0</file_name> <error_code>-161</error_code></file_xfer_error>)
result: http://ralph.bakerlab.org/result.php?resultid=97709
Win 2000 SP4 Intel Pentium 4 @ 2.4GHz w/ 512Meg RAM
There was is an additional message in the result about a non-existant file:
GZIP SILENT FILE: .\\xx1enh.out
WARNING! attempt to gzip file .\\xx1enh.out failed: file does not exist.
____________
|
|
|
|
|
As for keeping work on ralph, we haven\'t quite got that figured out. We\'d like to have jobs go out instantly to clients when we post the new app or test a new scientific mode on ralph, so that we get feedback ASAP. The problem is that if we\'ve flooded the clients with jobs with the previous app or previous jobs, there\'s typically a wait for those clients to free up again.
That\'s easy to solve: limit the daily quota to five or less. That means clients grab new jobs instantly but can\'t pile up big caches.
At the moment it works as follows the first 20 clients pile up 20 WUs each and no more work is available. These hosts are busy with them several days so you get your work returned late. With 5WU/day the first 80 clients grab 5 WU each and are busy with them only for a day or less. I\'d even say 3WU/day is a good quota.
Short deadlines have a similar effect but it seems you reset them to match those of Rosetta.
____________
|
|
|
|
|
|
Yes a quota of 3-5 would keep most of the host with work and
if you need fast answers to a test batch set the return date to 1-3 days
and they will be cruched first.
Anders n
____________
|
|
|
|
|
|
I would think for the daily quota 2 would be the minimum and the max 4 or 8. You would want to have a chance at getting multiple tasks running on multi-CPU hosts.
____________
BOINC WIKI

BOINCing since 2002/12/8 |
|
|
|
|
I would think for the daily quota 2 would be the minimum and the max 4 or 8. You would want to have a chance at getting multiple tasks running on multi-CPU hosts.
The daily quota is per CPU. So if you have a dual-core or a Hyperthreading-enabled P4 you get 6 WU/day if the daily quote is 3WU/Day.
____________
|
|
|
|
|
|
ROM,
I currently have a rosetta_beta_5.06 that has been running 14 hours+ with 1.04% for progress. I have debug capability on this computer, any suggestions, or just Abort?
its labeled: WATCHDOG_KILL_VERY_LONG_JOBS_414_3
I notice that 2 others ran this unit and it died at 1.5 hours and 1.8 hours
Running on Win2000 SP4, leave in memory is set.
____________
 |
|
|
|
|
its labeled: WATCHDOG_KILL_VERY_LONG_JOBS_414_3
I\'ve seen other posts that this WU was specially designed to TEST the watchdog. It is INTENDED to have the watchdog step in and end it for you. So if you abort, you essentially leave the watchdog less proven. He\'ll get it!
But that SHOULD be the reason why the others \"failed\".
____________
|
|
|
|
|
ROM,
I currently have a rosetta_beta_5.06 that has been running 14 hours+ with 1.04% for progress. I have debug capability on this computer, any suggestions, or just Abort?
its labeled: WATCHDOG_KILL_VERY_LONG_JOBS_414_3
I notice that 2 others ran this unit and it died at 1.5 hours and 1.8 hours
Running on Win2000 SP4, leave in memory is set.
http://ralph.bakerlab.org/workunit.php?wuid=83793
Now at 24 hours and still stuck at 1.04%.
____________
 |
|
|
|
|
|
Hi,
Got two erroneous results, but did not report them here, yet, sorry for being so late....
resultid=98902
resultid=99919
App version 5.06 (both)...
Other 2 earlier workunits completed succesfully....
greetings,
William Senn...
____________
 |
|
|
|
|
ROM,
I currently have a rosetta_beta_5.06 that has been running 14 hours+ with 1.04% for progress. I have debug capability on this computer, any suggestions, or just Abort?
its labeled: WATCHDOG_KILL_VERY_LONG_JOBS_414_3
I notice that 2 others ran this unit and it died at 1.5 hours and 1.8 hours
Running on Win2000 SP4, leave in memory is set.
http://ralph.bakerlab.org/workunit.php?wuid=83793
Now at 24 hours and still stuck at 1.04%.
36 hours and still stuck at 1.04%... the watchdog is NOT working... is anyone out there?
____________
 |
|
|
|
|
36 hours and still stuck at 1.04%... the watchdog is NOT working... is anyone out there?
Hi Mike
Have you checked the grafics to se if the steps or % has changed?
The % should show with 1.04?? and not as on boinc manager with only 1,04.
Anders n
____________
|
|
|
|
|
36 hours and still stuck at 1.04%... the watchdog is NOT working... is anyone out there?
Hi Mike
Have you checked the grafics to se if the steps or % has changed?
The % should show with 1.04?? and not as on boinc manager with only 1,04.
Anders n
This computer is headless. Remote access only. Hence no screensaver.
____________
 |
|
|
|
|
|
Looks like your normal WUs are the 4hrs default... so we\'re now well passed the 4x preference guideline I\'ve seen posted elsewhere... so it is time to abort. Since we\'re here on Ralph, the diagnostic info. should prove useful for study. Hopefully it\'s something they fixed in the versions after 5.06.
Ironic... given your photo that your computer is \"headless\" :):)
____________
|
|
|
|
|
[This computer is headless. Remote access only. Hence no screensaver.
Mike, I use VNC to see the graphics on my remote monitorless, keyboardless, and mouseless puter. I click on the WU from the task tab and then view graphics. No screensaver here either. If it\'s a service install your hosed.
tony |
|
|
|
|
[This computer is headless. Remote access only. Hence no screensaver.
Mike, I use VNC to see the graphics on my remote monitorless, keyboardless, and mouseless puter. I click on the WU from the task tab and then view graphics. No screensaver here either. If it\'s a service install your hosed.
tony
It is a service install. I forgot about the \"View Graphics button\" I do VN into this computer. OK... 1.041% complete after 40 hours. Stage Full atom relax, Mode 1, Step 100, Accepted RMSD 50.36, Accepted Energy -19.40622 whatever this all means.
____________
 |
|
|
|
|
[This computer is headless. Remote access only. Hence no screensaver.
Mike, I use VNC to see the graphics on my remote monitorless, keyboardless, and mouseless puter. I click on the WU from the task tab and then view graphics. No screensaver here either. If it\'s a service install your hosed.
tony
It is a service install. I forgot about the \"View Graphics button\" I do VN into this computer. OK... 1.041% complete after 40 hours. Stage Full atom relax, Mode 1, Step 100, Accepted RMSD 50.36, Accepted Energy -19.40622 whatever this all means.
If it is a BIG protein you may have to wait for some time to see the steps advance, but you may be able to detect the slightest motion in the searching window image. If you see either the steps counting up or the movement in the searching window, it is still processing. On some of the large Work Units, it is possible for them to run very long times past your time setting. I would note however that yours is running way too long over the time setting. I have had a few lately that went 14 hours with a time setting of 2 hours.
The point being this. Unless the Workunit is either swapped out for project switching, or boinc is turned on and off four times the watchdog will never wake up and abort the work unit. Failing that the work unit will be aborted when it hits a limit preset by the project which SHOULD be 24 hours of CPU time.
My understanding is that it is designed to look at the Work unit each time it starts to process and determine of progress has been made since the last time it started up. This presuposes that the process was stopped for some reason. It does not just sit there checking the work unit all the time. If it never stops processing the workunit it will not check it. With luck Rhiju will chime in here and correct me if I am wrong about this, but I am going on the last explanation I had for all this.
Now let me add a caution here. If you restart BOINC before the workunit reaches a percent complete of greater than 2%, the Work unit WILL START OVER FROM THE BEGINNING AND THE CPU TIME WILL RESET TO ZERO!
So if you are going to play with starting and stopping. You should have keep in memory set to yes, and then suspend the Work unit or start another project long enough for another process to run for a while.
The watch dog is supposed to do 4 of these checks which show no progress before it will abort the workunit. That is part of how they worked out the \"four times your time setting\" concept for manual aborts.
So the short of this is, if the workunit is simply running uninterrupted, it could run forever, or until it hits the Max time setting. This is the risk of running a single project setup. If you don\'t see movement in the graphic, try suspending the Work unit and letting the system run a different one for 5 min. Then restart the first Work unit again for 5 min. Repeat this process 4 -5 times and it should abort the workunit if it was stuck. If it is not stuck it should let it keep running. Either that or we have a watchdog bug.
____________
Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact |
|
|
|
|
|
Hi Mike: thanks very much for posting. This sounds weird. The job should have been killed by the watchdog. In fact we sent out these workunits to test that infinite loops are aborted by the watchdog, and they\'ve been \"successful\" in that they\'ve mostly returned without keeping computers in infinite loops. For now, please either abort or follow mod9\'s suggestion of suspending and restarting a few times. If this occurs again, please post!
[This computer is headless. Remote access only. Hence no screensaver.
Mike, I use VNC to see the graphics on my remote monitorless, keyboardless, and mouseless puter. I click on the WU from the task tab and then view graphics. No screensaver here either. If it\'s a service install your hosed.
tony
It is a service install. I forgot about the \"View Graphics button\" I do VN into this computer. OK... 1.041% complete after 40 hours. Stage Full atom relax, Mode 1, Step 100, Accepted RMSD 50.36, Accepted Energy -19.40622 whatever this all means.
If it is a BIG protein you may have to wait for some time to see the steps advance, but you may be able to detect the slightest motion in the searching window image. If you see either the steps counting up or the movement in the searching window, it is still processing. On some of the large Work Units, it is possible for them to run very long times past your time setting. I would note however that yours is running way too long over the time setting. I have had a few lately that went 14 hours with a time setting of 2 hours.
The point being this. Unless the Workunit is either swapped out for project switching, or boinc is turned on and off four times the watchdog will never wake up and abort the work unit. Failing that the work unit will be aborted when it hits a limit preset by the project which SHOULD be 24 hours of CPU time.
My understanding is that it is designed to look at the Work unit each time it starts to process and determine of progress has been made since the last time it started up. This presuposes that the process was stopped for some reason. It does not just sit there checking the work unit all the time. If it never stops processing the workunit it will not check it. With luck Rhiju will chime in here and correct me if I am wrong about this, but I am going on the last explanation I had for all this.
Now let me add a caution here. If you restart BOINC before the workunit reaches a percent complete of greater than 2%, the Work unit WILL START OVER FROM THE BEGINNING AND THE CPU TIME WILL RESET TO ZERO!
So if you are going to play with starting and stopping. You should have keep in memory set to yes, and then suspend the Work unit or start another project long enough for another process to run for a while.
The watch dog is supposed to do 4 of these checks which show no progress before it will abort the workunit. That is part of how they worked out the \"four times your time setting\" concept for manual aborts.
So the short of this is, if the workunit is simply running uninterrupted, it could run forever, or until it hits the Max time setting. This is the risk of running a single project setup. If you don\'t see movement in the graphic, try suspending the Work unit and letting the system run a different one for 5 min. Then restart the first Work unit again for 5 min. Repeat this process 4 -5 times and it should abort the workunit if it was stuck. If it is not stuck it should let it keep running. Either that or we have a watchdog bug.
____________
|
|
|
|
|
|
Version 5.09 has been released. If you have errors in Version 5.09 please report them in the 5.09 Bug thread.
____________
Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact |
|
|
|
|
[This computer is headless. Remote access only. Hence no screensaver.
Mike, I use VNC to see the graphics on my remote monitorless, keyboardless, and mouseless puter. I click on the WU from the task tab and then view graphics. No screensaver here either. If it\'s a service install your hosed.
tony
It is a service install. I forgot about the \"View Graphics button\" I do VN into this computer. OK... 1.041% complete after 40 hours. Stage Full atom relax, Mode 1, Step 100, Accepted RMSD 50.36, Accepted Energy -19.40622 whatever this all means.
Starting and stopping did indeed reset the time to 0 (I had to reboot for other reasons). I am going to allow it to build back up... at over 24 I will report back. Its the Max Time Setting (24 hrs) that appears to not be working.
____________
 |
|
|
|
|
So the short of this is, if the workunit is simply running uninterrupted, it could run forever, or until it hits the Max time setting. This is the risk of running a single project setup. If you don\'t see movement in the graphic, try suspending the Work unit and letting the system run a different one for 5 min. Then restart the first Work unit again for 5 min. Repeat this process 4 -5 times and it should abort the workunit if it was stuck. If it is not stuck it should let it keep running. Either that or we have a watchdog bug.
Does the \"Max time\" get checked even if the app is not swapped out? That could be it, as my computer was running in EDF mode, hence it NEVER got swapped.
May I suggest that these items, (flavors of the watchdogs) get checked whenever BOINC requests a checkpoint? I understand this is every hour or so. I realize that Rosetta doesn’t perform the checkpoint, but it could process watchdog duties.
____________
 |
|
|
|
|
So the short of this is, if the workunit is simply running uninterrupted, it could run forever, or until it hits the Max time setting. This is the risk of running a single project setup. If you don\'t see movement in the graphic, try suspending the Work unit and letting the system run a different one for 5 min. Then restart the first Work unit again for 5 min. Repeat this process 4 -5 times and it should abort the workunit if it was stuck. If it is not stuck it should let it keep running. Either that or we have a watchdog bug.
Does the \"Max time\" get checked even if the app is not swapped out? That could be it, as my computer was running in EDF mode, hence it NEVER got swapped.
May I suggest that these items, (flavors of the watchdogs) get checked whenever BOINC requests a checkpoint? I understand this is every hour or so. I realize that Rosetta doesn’t perform the checkpoint, but it could process watchdog duties.
Well it is really two separate functions that are fallbacks to one another. If the watchdog never has the opportunity to work (i.e. the work unit is never stopped and started for the check to occur) then the Work Unit will hit a wall for maximum time to process. The Max time function is independent of the watchdog and works on a different set of criteria and variables. he Max time is hard coded by the project before the Work unit is sent out.
Right now that max time on Rosetta is 24 hours. I think it is the asme for Ralph but Rhiju would have to verify that, because it could be different for each set of Work Units.
In any case you are correct. If you system was in EDF mode, the watchdog would not likely have kicked in. Perhaps that is a good reason to revisit how the checking is done.
____________
Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact |
|
|
|
|
May I suggest that these items, (flavors of the watchdogs) get checked whenever BOINC requests a checkpoint? I understand this is every hour or so. I realize that Rosetta doesn’t perform the checkpoint, but it could process watchdog duties.
Perhaps he\'s on to something there, could watchdog code be evaluated at the areas in the model where checkpoints are possible? Or is that part of the problem? We don\'t reach the checkpointable stage in the model?
I just wanted to point out that BOINC doesn\'t request checkpoints. It is up to the application to do so when appropriate. Rosetta now does checkpoints about every 20 minutes or so. So it was not that previously Rosetta was ignoring any requests from BOINC. It\'s just that the architecture of BOINC is such that the manager cannot signal the application to do a checkpoint, indeed most applications have to complete a certain phase of processing before they can do so, and in that sense Rosetta is no different. With the new changes, they have actually created new places in their crunching where checkpoints may be performed... and performed efficiently. You don\'t want to waste time doing too much checkpointing either, so it\'s a balance.
What happens every hour or so is BOINC reevaluating if the application being run should be switched (60min is the default \"switch between applications every...\" time). And if, at the point of that switch, the application is removed from memory, then the work done since last checkpoint is all lost. This is how BOINC works. This is why the more frequent checkpointing was such a great thing for productivity. And now if we can just get BOINC to ONLY preempt an application after it does a checkpoint, then we\'ll REALLY be cruisin\'!
____________
|
|
|
|
|
And now if we can just get BOINC to ONLY preempt an application after it does a checkpoint, then we\'ll REALLY be cruisin\'!
This was posted to the boinc alpha mail list yesterday by JM7 (the creator of the scheduler)
John.McLeod@xxxxxxxxxx.com to boinc_dev
More options May 4 (1 day ago)
I have been working on the CPU scheduler to see what I can do to make it
work as the doc says it should.
What I have at the moment:
The CPU scheduler checks the necessity to preempt:
1) If one of the events that could cause entry to EDF occurs.
(Checkpoint after process swap time, files downloaded, task exit, ...).
2) At least once every 10 minutes. (Just to be safe). What should this
frequency be? 10 minutes? an hour? the time between allowed checkpoints?
The CPU scheduler select tasks to run if:
1) There are not enough runnable tasks scheduled to meet 1 per CPU
allowed. (Startup / task complete / running task suspended ...).
2) A checkpoint has been reached after the process swap time.
3) One or more results has recently entered the state of requiring EDF.
Enforcement is immediate. If a result has reached its checkpoint after
process swap time, and the CPU scheduler has scheduled it for another
process time, then it gets the full time allotted to it (default another
hour + time to checkpoint).
AND
John.McLeod@xxxxxxxxx.com to elst93, boinc_dev
More options May 4 (1 day ago)
How often to check to see if pre-emption is needed may not want to be user
configurable because someone is going to set the number to way too large.
If the process doesn\'t checkpoint, it will either complete (and the system
will fall under 1 - not enough runable results running) OR another process
will require attention in order to meet deadline in which case, that
process will start running.
One further note, if a process does actually make it to a checkpoint, it
will then be removed from memory when it suspends - this suspend will
happen within a second or two of the checkpoint.
jm7
seems from this, it\'s already being looked into |
|
|