Message boards : RALPH@home bug list : Bug reports for Ralph 5.05 and higher
Previous · 1 · 2 · 3 · 4
Author | Message |
---|---|
Mike Gelvin Send message Joined: 17 Feb 06 Posts: 50 Credit: 55,397 RAC: 0 |
So the short of this is, if the workunit is simply running uninterrupted, it could run forever, or until it hits the Max time setting. This is the risk of running a single project setup. If you don't see movement in the graphic, try suspending the Work unit and letting the system run a different one for 5 min. Then restart the first Work unit again for 5 min. Repeat this process 4 -5 times and it should abort the workunit if it was stuck. If it is not stuck it should let it keep running. Either that or we have a watchdog bug. Does the "Max time" get checked even if the app is not swapped out? That could be it, as my computer was running in EDF mode, hence it NEVER got swapped. May I suggest that these items, (flavors of the watchdogs) get checked whenever BOINC requests a checkpoint? I understand this is every hour or so. I realize that Rosetta doesn’t perform the checkpoint, but it could process watchdog duties. |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
May I suggest that these items, (flavors of the watchdogs) get checked whenever BOINC requests a checkpoint? I understand this is every hour or so. I realize that Rosetta doesn’t perform the checkpoint, but it could process watchdog duties. Perhaps he's on to something there, could watchdog code be evaluated at the areas in the model where checkpoints are possible? Or is that part of the problem? We don't reach the checkpointable stage in the model? I just wanted to point out that BOINC doesn't request checkpoints. It is up to the application to do so when appropriate. Rosetta now does checkpoints about every 20 minutes or so. So it was not that previously Rosetta was ignoring any requests from BOINC. It's just that the architecture of BOINC is such that the manager cannot signal the application to do a checkpoint, indeed most applications have to complete a certain phase of processing before they can do so, and in that sense Rosetta is no different. With the new changes, they have actually created new places in their crunching where checkpoints may be performed... and performed efficiently. You don't want to waste time doing too much checkpointing either, so it's a balance. What happens every hour or so is BOINC reevaluating if the application being run should be switched (60min is the default "switch between applications every..." time). And if, at the point of that switch, the application is removed from memory, then the work done since last checkpoint is all lost. This is how BOINC works. This is why the more frequent checkpointing was such a great thing for productivity. And now if we can just get BOINC to ONLY preempt an application after it does a checkpoint, then we'll REALLY be cruisin'! |
Astro Send message Joined: 16 Feb 06 Posts: 141 Credit: 32,977 RAC: 0 |
And now if we can just get BOINC to ONLY preempt an application after it does a checkpoint, then we'll REALLY be cruisin'! This was posted to the boinc alpha mail list yesterday by JM7 (the creator of the scheduler) John.McLeod@xxxxxxxxxx.com to boinc_dev More options May 4 (1 day ago) I have been working on the CPU scheduler to see what I can do to make it work as the doc says it should. What I have at the moment: The CPU scheduler checks the necessity to preempt: 1) If one of the events that could cause entry to EDF occurs. (Checkpoint after process swap time, files downloaded, task exit, ...). 2) At least once every 10 minutes. (Just to be safe). What should this frequency be? 10 minutes? an hour? the time between allowed checkpoints? The CPU scheduler select tasks to run if: 1) There are not enough runnable tasks scheduled to meet 1 per CPU allowed. (Startup / task complete / running task suspended ...). 2) A checkpoint has been reached after the process swap time. 3) One or more results has recently entered the state of requiring EDF. Enforcement is immediate. If a result has reached its checkpoint after process swap time, and the CPU scheduler has scheduled it for another process time, then it gets the full time allotted to it (default another hour + time to checkpoint). AND John.McLeod@xxxxxxxxx.com to elst93, boinc_dev More options May 4 (1 day ago) How often to check to see if pre-emption is needed may not want to be user configurable because someone is going to set the number to way too large. If the process doesn't checkpoint, it will either complete (and the system will fall under 1 - not enough runable results running) OR another process will require attention in order to meet deadline in which case, that process will start running. One further note, if a process does actually make it to a checkpoint, it will then be removed from memory when it suspends - this suspend will happen within a second or two of the checkpoint. jm7 seems from this, it's already being looked into |
Message boards :
RALPH@home bug list :
Bug reports for Ralph 5.05 and higher
©2025 University of Washington
http://www.bakerlab.org