Message boards : RALPH@home bug list : Bug reports for Ralph 5.05 and higher
Previous · 1 · 2 · 3 · 4
Author | Message |
---|---|
Moderator9 Volunteer moderator Send message Joined: 16 Feb 06 Posts: 251 Credit: 0 RAC: 0 |
Version 5.09 has been released. If you have errors in Version 5.09 please report them in the 5.09 Bug thread. Moderator9 RALPH@home FAQs RALPH@home Guidelines Moderator Contact |
Mike Gelvin![]() Send message Joined: 17 Feb 06 Posts: 50 Credit: 55,397 RAC: 0 |
[This computer is headless. Remote access only. Hence no screensaver. Starting and stopping did indeed reset the time to 0 (I had to reboot for other reasons). I am going to allow it to build back up... at over 24 I will report back. Its the Max Time Setting (24 hrs) that appears to not be working. ![]() |
Mike Gelvin![]() Send message Joined: 17 Feb 06 Posts: 50 Credit: 55,397 RAC: 0 |
So the short of this is, if the workunit is simply running uninterrupted, it could run forever, or until it hits the Max time setting. This is the risk of running a single project setup. If you don't see movement in the graphic, try suspending the Work unit and letting the system run a different one for 5 min. Then restart the first Work unit again for 5 min. Repeat this process 4 -5 times and it should abort the workunit if it was stuck. If it is not stuck it should let it keep running. Either that or we have a watchdog bug. Does the "Max time" get checked even if the app is not swapped out? That could be it, as my computer was running in EDF mode, hence it NEVER got swapped. May I suggest that these items, (flavors of the watchdogs) get checked whenever BOINC requests a checkpoint? I understand this is every hour or so. I realize that Rosetta doesn’t perform the checkpoint, but it could process watchdog duties. ![]() |
Moderator9 Volunteer moderator Send message Joined: 16 Feb 06 Posts: 251 Credit: 0 RAC: 0 |
So the short of this is, if the workunit is simply running uninterrupted, it could run forever, or until it hits the Max time setting. This is the risk of running a single project setup. If you don't see movement in the graphic, try suspending the Work unit and letting the system run a different one for 5 min. Then restart the first Work unit again for 5 min. Repeat this process 4 -5 times and it should abort the workunit if it was stuck. If it is not stuck it should let it keep running. Either that or we have a watchdog bug. Well it is really two separate functions that are fallbacks to one another. If the watchdog never has the opportunity to work (i.e. the work unit is never stopped and started for the check to occur) then the Work Unit will hit a wall for maximum time to process. The Max time function is independent of the watchdog and works on a different set of criteria and variables. he Max time is hard coded by the project before the Work unit is sent out. Right now that max time on Rosetta is 24 hours. I think it is the asme for Ralph but Rhiju would have to verify that, because it could be different for each set of Work Units. In any case you are correct. If you system was in EDF mode, the watchdog would not likely have kicked in. Perhaps that is a good reason to revisit how the checking is done. Moderator9 RALPH@home FAQs RALPH@home Guidelines Moderator Contact |
![]() Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
May I suggest that these items, (flavors of the watchdogs) get checked whenever BOINC requests a checkpoint? I understand this is every hour or so. I realize that Rosetta doesn’t perform the checkpoint, but it could process watchdog duties. Perhaps he's on to something there, could watchdog code be evaluated at the areas in the model where checkpoints are possible? Or is that part of the problem? We don't reach the checkpointable stage in the model? I just wanted to point out that BOINC doesn't request checkpoints. It is up to the application to do so when appropriate. Rosetta now does checkpoints about every 20 minutes or so. So it was not that previously Rosetta was ignoring any requests from BOINC. It's just that the architecture of BOINC is such that the manager cannot signal the application to do a checkpoint, indeed most applications have to complete a certain phase of processing before they can do so, and in that sense Rosetta is no different. With the new changes, they have actually created new places in their crunching where checkpoints may be performed... and performed efficiently. You don't want to waste time doing too much checkpointing either, so it's a balance. What happens every hour or so is BOINC reevaluating if the application being run should be switched (60min is the default "switch between applications every..." time). And if, at the point of that switch, the application is removed from memory, then the work done since last checkpoint is all lost. This is how BOINC works. This is why the more frequent checkpointing was such a great thing for productivity. And now if we can just get BOINC to ONLY preempt an application after it does a checkpoint, then we'll REALLY be cruisin'! |
![]() Send message Joined: 16 Feb 06 Posts: 141 Credit: 32,977 RAC: 0 |
And now if we can just get BOINC to ONLY preempt an application after it does a checkpoint, then we'll REALLY be cruisin'! This was posted to the boinc alpha mail list yesterday by JM7 (the creator of the scheduler) John.McLeod@xxxxxxxxxx.com to boinc_dev More options May 4 (1 day ago) I have been working on the CPU scheduler to see what I can do to make it work as the doc says it should. What I have at the moment: The CPU scheduler checks the necessity to preempt: 1) If one of the events that could cause entry to EDF occurs. (Checkpoint after process swap time, files downloaded, task exit, ...). 2) At least once every 10 minutes. (Just to be safe). What should this frequency be? 10 minutes? an hour? the time between allowed checkpoints? The CPU scheduler select tasks to run if: 1) There are not enough runnable tasks scheduled to meet 1 per CPU allowed. (Startup / task complete / running task suspended ...). 2) A checkpoint has been reached after the process swap time. 3) One or more results has recently entered the state of requiring EDF. Enforcement is immediate. If a result has reached its checkpoint after process swap time, and the CPU scheduler has scheduled it for another process time, then it gets the full time allotted to it (default another hour + time to checkpoint). AND John.McLeod@xxxxxxxxx.com to elst93, boinc_dev More options May 4 (1 day ago) How often to check to see if pre-emption is needed may not want to be user configurable because someone is going to set the number to way too large. If the process doesn't checkpoint, it will either complete (and the system will fall under 1 - not enough runable results running) OR another process will require attention in order to meet deadline in which case, that process will start running. One further note, if a process does actually make it to a checkpoint, it will then be removed from memory when it suspends - this suspend will happen within a second or two of the checkpoint. jm7 seems from this, it's already being looked into |
Message boards :
RALPH@home bug list :
Bug reports for Ralph 5.05 and higher
©2023 University of Washington
http://www.bakerlab.org