Bug reports for Ralph 5.05 and higher

Message boards : RALPH@home bug list : Bug reports for Ralph 5.05 and higher

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4

AuthorMessage
Moderator9
Volunteer moderator

Send message
Joined: 16 Feb 06
Posts: 251
Credit: 0
RAC: 0
Message 1482 - Posted: 5 May 2006, 3:00:28 UTC
Last modified: 5 May 2006, 11:39:15 UTC

Version 5.09 has been released. If you have errors in Version 5.09 please report them in the 5.09 Bug thread.
Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact
ID: 1482 · Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 17 Feb 06
Posts: 50
Credit: 55,397
RAC: 0
Message 1484 - Posted: 5 May 2006, 3:42:28 UTC - in response to Message 1476.  

[This computer is headless. Remote access only. Hence no screensaver.

Mike, I use VNC to see the graphics on my remote monitorless, keyboardless, and mouseless puter. I click on the WU from the task tab and then view graphics. No screensaver here either. If it's a service install your hosed.

tony


It is a service install. I forgot about the "View Graphics button" I do VN into this computer. OK... 1.041% complete after 40 hours. Stage Full atom relax, Mode 1, Step 100, Accepted RMSD 50.36, Accepted Energy -19.40622 whatever this all means.


Starting and stopping did indeed reset the time to 0 (I had to reboot for other reasons). I am going to allow it to build back up... at over 24 I will report back. Its the Max Time Setting (24 hrs) that appears to not be working.

ID: 1484 · Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 17 Feb 06
Posts: 50
Credit: 55,397
RAC: 0
Message 1487 - Posted: 5 May 2006, 14:52:26 UTC - in response to Message 1480.  

So the short of this is, if the workunit is simply running uninterrupted, it could run forever, or until it hits the Max time setting. This is the risk of running a single project setup. If you don't see movement in the graphic, try suspending the Work unit and letting the system run a different one for 5 min. Then restart the first Work unit again for 5 min. Repeat this process 4 -5 times and it should abort the workunit if it was stuck. If it is not stuck it should let it keep running. Either that or we have a watchdog bug.



Does the "Max time" get checked even if the app is not swapped out? That could be it, as my computer was running in EDF mode, hence it NEVER got swapped.

May I suggest that these items, (flavors of the watchdogs) get checked whenever BOINC requests a checkpoint? I understand this is every hour or so. I realize that Rosetta doesn’t perform the checkpoint, but it could process watchdog duties.



ID: 1487 · Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 16 Feb 06
Posts: 251
Credit: 0
RAC: 0
Message 1490 - Posted: 5 May 2006, 17:56:09 UTC - in response to Message 1487.  

So the short of this is, if the workunit is simply running uninterrupted, it could run forever, or until it hits the Max time setting. This is the risk of running a single project setup. If you don't see movement in the graphic, try suspending the Work unit and letting the system run a different one for 5 min. Then restart the first Work unit again for 5 min. Repeat this process 4 -5 times and it should abort the workunit if it was stuck. If it is not stuck it should let it keep running. Either that or we have a watchdog bug.



Does the "Max time" get checked even if the app is not swapped out? That could be it, as my computer was running in EDF mode, hence it NEVER got swapped.

May I suggest that these items, (flavors of the watchdogs) get checked whenever BOINC requests a checkpoint? I understand this is every hour or so. I realize that Rosetta doesn’t perform the checkpoint, but it could process watchdog duties.

Well it is really two separate functions that are fallbacks to one another. If the watchdog never has the opportunity to work (i.e. the work unit is never stopped and started for the check to occur) then the Work Unit will hit a wall for maximum time to process. The Max time function is independent of the watchdog and works on a different set of criteria and variables. he Max time is hard coded by the project before the Work unit is sent out.

Right now that max time on Rosetta is 24 hours. I think it is the asme for Ralph but Rhiju would have to verify that, because it could be different for each set of Work Units.

In any case you are correct. If you system was in EDF mode, the watchdog would not likely have kicked in. Perhaps that is a good reason to revisit how the checking is done.

Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact
ID: 1490 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 1495 - Posted: 5 May 2006, 22:16:17 UTC - in response to Message 1487.  

May I suggest that these items, (flavors of the watchdogs) get checked whenever BOINC requests a checkpoint? I understand this is every hour or so. I realize that Rosetta doesn’t perform the checkpoint, but it could process watchdog duties.


Perhaps he's on to something there, could watchdog code be evaluated at the areas in the model where checkpoints are possible? Or is that part of the problem? We don't reach the checkpointable stage in the model?

I just wanted to point out that BOINC doesn't request checkpoints. It is up to the application to do so when appropriate. Rosetta now does checkpoints about every 20 minutes or so. So it was not that previously Rosetta was ignoring any requests from BOINC. It's just that the architecture of BOINC is such that the manager cannot signal the application to do a checkpoint, indeed most applications have to complete a certain phase of processing before they can do so, and in that sense Rosetta is no different. With the new changes, they have actually created new places in their crunching where checkpoints may be performed... and performed efficiently. You don't want to waste time doing too much checkpointing either, so it's a balance.

What happens every hour or so is BOINC reevaluating if the application being run should be switched (60min is the default "switch between applications every..." time). And if, at the point of that switch, the application is removed from memory, then the work done since last checkpoint is all lost. This is how BOINC works. This is why the more frequent checkpointing was such a great thing for productivity. And now if we can just get BOINC to ONLY preempt an application after it does a checkpoint, then we'll REALLY be cruisin'!
ID: 1495 · Report as offensive    Reply Quote
Profile Astro

Send message
Joined: 16 Feb 06
Posts: 141
Credit: 32,977
RAC: 0
Message 1497 - Posted: 5 May 2006, 23:15:28 UTC - in response to Message 1495.  
Last modified: 5 May 2006, 23:15:53 UTC

And now if we can just get BOINC to ONLY preempt an application after it does a checkpoint, then we'll REALLY be cruisin'!

This was posted to the boinc alpha mail list yesterday by JM7 (the creator of the scheduler)

John.McLeod@xxxxxxxxxx.com to boinc_dev
More options May 4 (1 day ago)

I have been working on the CPU scheduler to see what I can do to make it
work as the doc says it should.

What I have at the moment:

The CPU scheduler checks the necessity to preempt:
1) If one of the events that could cause entry to EDF occurs.
(Checkpoint after process swap time, files downloaded, task exit, ...).
2) At least once every 10 minutes. (Just to be safe). What should this
frequency be? 10 minutes? an hour? the time between allowed checkpoints?

The CPU scheduler select tasks to run if:
1) There are not enough runnable tasks scheduled to meet 1 per CPU
allowed. (Startup / task complete / running task suspended ...).
2) A checkpoint has been reached after the process swap time.
3) One or more results has recently entered the state of requiring EDF.

Enforcement is immediate. If a result has reached its checkpoint after
process swap time, and the CPU scheduler has scheduled it for another
process time, then it gets the full time allotted to it (default another
hour + time to checkpoint).

AND

John.McLeod@xxxxxxxxx.com to elst93, boinc_dev
More options May 4 (1 day ago)

How often to check to see if pre-emption is needed may not want to be user
configurable because someone is going to set the number to way too large.

If the process doesn't checkpoint, it will either complete (and the system
will fall under 1 - not enough runable results running) OR another process
will require attention in order to meet deadline in which case, that
process will start running.

One further note, if a process does actually make it to a checkpoint, it
will then be removed from memory when it suspends - this suspend will
happen within a second or two of the checkpoint.

jm7

seems from this, it's already being looked into
ID: 1497 · Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4

Message boards : RALPH@home bug list : Bug reports for Ralph 5.05 and higher



©2024 University of Washington
http://www.bakerlab.org