Bug reports for Ralph 5.05 and higher

Message boards : RALPH@home bug list : Bug reports for Ralph 5.05 and higher

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4

AuthorMessage
Mike Gelvin
Avatar

Send message
Joined: 17 Feb 06
Posts: 50
Credit: 55,397
RAC: 0
Message 1487 - Posted: 5 May 2006, 14:52:26 UTC - in response to Message 1480.  

So the short of this is, if the workunit is simply running uninterrupted, it could run forever, or until it hits the Max time setting. This is the risk of running a single project setup. If you don't see movement in the graphic, try suspending the Work unit and letting the system run a different one for 5 min. Then restart the first Work unit again for 5 min. Repeat this process 4 -5 times and it should abort the workunit if it was stuck. If it is not stuck it should let it keep running. Either that or we have a watchdog bug.



Does the "Max time" get checked even if the app is not swapped out? That could be it, as my computer was running in EDF mode, hence it NEVER got swapped.

May I suggest that these items, (flavors of the watchdogs) get checked whenever BOINC requests a checkpoint? I understand this is every hour or so. I realize that Rosetta doesn’t perform the checkpoint, but it could process watchdog duties.



ID: 1487 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 1495 - Posted: 5 May 2006, 22:16:17 UTC - in response to Message 1487.  

May I suggest that these items, (flavors of the watchdogs) get checked whenever BOINC requests a checkpoint? I understand this is every hour or so. I realize that Rosetta doesn’t perform the checkpoint, but it could process watchdog duties.


Perhaps he's on to something there, could watchdog code be evaluated at the areas in the model where checkpoints are possible? Or is that part of the problem? We don't reach the checkpointable stage in the model?

I just wanted to point out that BOINC doesn't request checkpoints. It is up to the application to do so when appropriate. Rosetta now does checkpoints about every 20 minutes or so. So it was not that previously Rosetta was ignoring any requests from BOINC. It's just that the architecture of BOINC is such that the manager cannot signal the application to do a checkpoint, indeed most applications have to complete a certain phase of processing before they can do so, and in that sense Rosetta is no different. With the new changes, they have actually created new places in their crunching where checkpoints may be performed... and performed efficiently. You don't want to waste time doing too much checkpointing either, so it's a balance.

What happens every hour or so is BOINC reevaluating if the application being run should be switched (60min is the default "switch between applications every..." time). And if, at the point of that switch, the application is removed from memory, then the work done since last checkpoint is all lost. This is how BOINC works. This is why the more frequent checkpointing was such a great thing for productivity. And now if we can just get BOINC to ONLY preempt an application after it does a checkpoint, then we'll REALLY be cruisin'!
ID: 1495 · Report as offensive    Reply Quote
Profile Astro

Send message
Joined: 16 Feb 06
Posts: 141
Credit: 32,977
RAC: 0
Message 1497 - Posted: 5 May 2006, 23:15:28 UTC - in response to Message 1495.  
Last modified: 5 May 2006, 23:15:53 UTC

And now if we can just get BOINC to ONLY preempt an application after it does a checkpoint, then we'll REALLY be cruisin'!

This was posted to the boinc alpha mail list yesterday by JM7 (the creator of the scheduler)

John.McLeod@xxxxxxxxxx.com to boinc_dev
More options May 4 (1 day ago)

I have been working on the CPU scheduler to see what I can do to make it
work as the doc says it should.

What I have at the moment:

The CPU scheduler checks the necessity to preempt:
1) If one of the events that could cause entry to EDF occurs.
(Checkpoint after process swap time, files downloaded, task exit, ...).
2) At least once every 10 minutes. (Just to be safe). What should this
frequency be? 10 minutes? an hour? the time between allowed checkpoints?

The CPU scheduler select tasks to run if:
1) There are not enough runnable tasks scheduled to meet 1 per CPU
allowed. (Startup / task complete / running task suspended ...).
2) A checkpoint has been reached after the process swap time.
3) One or more results has recently entered the state of requiring EDF.

Enforcement is immediate. If a result has reached its checkpoint after
process swap time, and the CPU scheduler has scheduled it for another
process time, then it gets the full time allotted to it (default another
hour + time to checkpoint).

AND

John.McLeod@xxxxxxxxx.com to elst93, boinc_dev
More options May 4 (1 day ago)

How often to check to see if pre-emption is needed may not want to be user
configurable because someone is going to set the number to way too large.

If the process doesn't checkpoint, it will either complete (and the system
will fall under 1 - not enough runable results running) OR another process
will require attention in order to meet deadline in which case, that
process will start running.

One further note, if a process does actually make it to a checkpoint, it
will then be removed from memory when it suspends - this suspend will
happen within a second or two of the checkpoint.

jm7

seems from this, it's already being looked into
ID: 1497 · Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4

Message boards : RALPH@home bug list : Bug reports for Ralph 5.05 and higher



©2025 University of Washington
http://www.bakerlab.org