Bug reports for Ralph 5.05 and higher

Message boards : RALPH@home bug list : Bug reports for Ralph 5.05 and higher

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Profile anders n

Send message
Joined: 16 Feb 06
Posts: 166
Credit: 131,419
RAC: 0
Message 1431 - Posted: 29 Apr 2006, 10:54:13 UTC - in response to Message 1430.  

Hello, i have some problem with this Wu : 5.06
FA_CASP6_t216__444_2_2
50' for 1.02%
So i aborted it
Bye and go on...



They are so big that it takes more than 1 H on a fast computer to complete 1 decoy.

Anders n
ID: 1431 · Report as offensive    Reply Quote
Dotsch
Avatar

Send message
Joined: 4 Mar 06
Posts: 12
Credit: 13,725
RAC: 0
Message 1434 - Posted: 29 Apr 2006, 21:10:15 UTC

I have some problems with 5.06 on Windows 98 :
<core_client_version>5.2.13</core_client_version>
<message> - exit code -164 (0xffffff5c)
</message>
<stderr_txt>
LoadLibraryA( dbghelp95.dll ): GetLastError = 1157
LoadLibraryA( dbghelp.dll ): GetLastError = 1157

</stderr_txt>


Result ID : https://ralph.bakerlab.org/result.php?resultid=100666
ID: 1434 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 1436 - Posted: 30 Apr 2006, 2:30:46 UTC - in response to Message 1425.  

Hi Feet1st, these are great suggestions, as usual! We've come to expect them.
I'm about to post 5.08, and I'll ask that ralph users use similar preferences to their r@h preferences, as you suggest. I think the checkpointing and watchdog issues have largely been resolved, thankfully, and we've moved on to testing real science.

As for keeping work on ralph, we haven't quite got that figured out. We'd like to have jobs go out instantly to clients when we post the new app or test a new scientific mode on ralph, so that we get feedback ASAP. The problem is that if we've flooded the clients with jobs with the previous app or previous jobs, there's typically a wait for those clients to free up again. In the future, if we can get trickle-messages implemented, we could send out a purge request. Still, I hear you ... I'll keep sending out work and ask others to do the same.

Feet1st, you noticed how bad the problem was with 5.05; has your client tried any 5.06?

I'm not positive, but I believe (irony is cruel sometimes) all the 5.06 WUs were gone by the time I got home to that PC to notice and abort 5.05 WUs.

This is ironic for two reasons. One, I've been discussing the merits of getting WUs to more hosts by limiting WUs per day or resource share, or other means of assuring some WUs remain available for at least 24hrs. Two, I asked why no application version shows on an unreturned WU on the website, and was told it's because it's flexible, so from work, I can't see if the WUs on my PC at home are for 5.05 or 5.06 :) Even though we all know that the Work tab of that PC has a specific version associated with the WU.


ID: 1436 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 1437 - Posted: 30 Apr 2006, 2:33:32 UTC - in response to Message 1422.  

Hi Mike: this is a silly thing that we haven't quite been able to fix, but should happen rarely on rosetta@home. That ralph workunit was a test that our watchdog timer properly aborts really long running jobs. So we're very glad to see it worked on your computer! If you ever run into similar super-long workunits on Rosetta@home (hopefully not!), you'll eventually get credit granted to it, because that's our policy. Thanks for posting!

4/28/2006 12:53:48 AM||Rescheduling CPU: files downloaded
4/28/2006 3:15:49 AM||Rescheduling CPU: application exited
4/28/2006 3:15:49 AM|ralph@home|Computation for task WATCHDOG_KILL_VERY_LONG_JOBS_424_9_2 finished
4/28/2006 3:15:50 AM|ralph@home|Unrecoverable error for result WATCHDOG_KILL_VERY_LONG_JOBS_424_9_2 ( WATCHDOG_KILL_VERY_LONG_JOBS_424_9_2_0 -161)


result: https://ralph.bakerlab.org/result.php?resultid=97709

Win 2000 SP4 Intel Pentium 4 @ 2.4GHz w/ 512Meg RAM


There was is an additional message in the result about a non-existant file:
GZIP SILENT FILE: .xx1enh.out
WARNING! attempt to gzip file .xx1enh.out failed: file does not exist.


ID: 1437 · Report as offensive    Reply Quote
tralala

Send message
Joined: 12 Apr 06
Posts: 52
Credit: 15,257
RAC: 0
Message 1439 - Posted: 30 Apr 2006, 7:19:42 UTC - in response to Message 1436.  
Last modified: 30 Apr 2006, 7:26:39 UTC

As for keeping work on ralph, we haven't quite got that figured out. We'd like to have jobs go out instantly to clients when we post the new app or test a new scientific mode on ralph, so that we get feedback ASAP. The problem is that if we've flooded the clients with jobs with the previous app or previous jobs, there's typically a wait for those clients to free up again.


That's easy to solve: limit the daily quota to five or less. That means clients grab new jobs instantly but can't pile up big caches.
At the moment it works as follows the first 20 clients pile up 20 WUs each and no more work is available. These hosts are busy with them several days so you get your work returned late. With 5WU/day the first 80 clients grab 5 WU each and are busy with them only for a day or less. I'd even say 3WU/day is a good quota.

Short deadlines have a similar effect but it seems you reset them to match those of Rosetta.
ID: 1439 · Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 16 Feb 06
Posts: 166
Credit: 131,419
RAC: 0
Message 1440 - Posted: 30 Apr 2006, 8:02:49 UTC

Yes a quota of 3-5 would keep most of the host with work and

if you need fast answers to a test batch set the return date to 1-3 days

and they will be cruched first.

Anders n
ID: 1440 · Report as offensive    Reply Quote
Profile JKeck {pirate}
Avatar

Send message
Joined: 16 Feb 06
Posts: 14
Credit: 153,095
RAC: 0
Message 1441 - Posted: 30 Apr 2006, 11:16:15 UTC

I would think for the daily quota 2 would be the minimum and the max 4 or 8. You would want to have a chance at getting multiple tasks running on multi-CPU hosts.
BOINC WIKI

BOINCing since 2002/12/8
ID: 1441 · Report as offensive    Reply Quote
tralala

Send message
Joined: 12 Apr 06
Posts: 52
Credit: 15,257
RAC: 0
Message 1442 - Posted: 30 Apr 2006, 12:57:49 UTC - in response to Message 1441.  

I would think for the daily quota 2 would be the minimum and the max 4 or 8. You would want to have a chance at getting multiple tasks running on multi-CPU hosts.


The daily quota is per CPU. So if you have a dual-core or a Hyperthreading-enabled P4 you get 6 WU/day if the daily quote is 3WU/Day.
ID: 1442 · Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 17 Feb 06
Posts: 50
Credit: 55,397
RAC: 0
Message 1467 - Posted: 3 May 2006, 20:11:30 UTC

ROM,
I currently have a rosetta_beta_5.06 that has been running 14 hours+ with 1.04% for progress. I have debug capability on this computer, any suggestions, or just Abort?

its labeled: WATCHDOG_KILL_VERY_LONG_JOBS_414_3

I notice that 2 others ran this unit and it died at 1.5 hours and 1.8 hours

Running on Win2000 SP4, leave in memory is set.

ID: 1467 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 1468 - Posted: 3 May 2006, 22:35:50 UTC - in response to Message 1467.  

its labeled: WATCHDOG_KILL_VERY_LONG_JOBS_414_3

I've seen other posts that this WU was specially designed to TEST the watchdog. It is INTENDED to have the watchdog step in and end it for you. So if you abort, you essentially leave the watchdog less proven. He'll get it!

But that SHOULD be the reason why the others "failed".

ID: 1468 · Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 17 Feb 06
Posts: 50
Credit: 55,397
RAC: 0
Message 1469 - Posted: 4 May 2006, 5:53:44 UTC - in response to Message 1467.  
Last modified: 4 May 2006, 5:54:14 UTC

ROM,
I currently have a rosetta_beta_5.06 that has been running 14 hours+ with 1.04% for progress. I have debug capability on this computer, any suggestions, or just Abort?

its labeled: WATCHDOG_KILL_VERY_LONG_JOBS_414_3

I notice that 2 others ran this unit and it died at 1.5 hours and 1.8 hours

Running on Win2000 SP4, leave in memory is set.


https://ralph.bakerlab.org/workunit.php?wuid=83793

Now at 24 hours and still stuck at 1.04%.
ID: 1469 · Report as offensive    Reply Quote
Profile William Senn
Avatar

Send message
Joined: 16 Feb 06
Posts: 4
Credit: 30,895
RAC: 0
Message 1470 - Posted: 4 May 2006, 10:46:03 UTC

Hi,

Got two erroneous results, but did not report them here, yet, sorry for being so late....

resultid=98902
resultid=99919

App version 5.06 (both)...

Other 2 earlier workunits completed succesfully....

greetings,

William Senn...


ID: 1470 · Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 17 Feb 06
Posts: 50
Credit: 55,397
RAC: 0
Message 1471 - Posted: 4 May 2006, 18:45:25 UTC - in response to Message 1469.  

ROM,
I currently have a rosetta_beta_5.06 that has been running 14 hours+ with 1.04% for progress. I have debug capability on this computer, any suggestions, or just Abort?

its labeled: WATCHDOG_KILL_VERY_LONG_JOBS_414_3

I notice that 2 others ran this unit and it died at 1.5 hours and 1.8 hours

Running on Win2000 SP4, leave in memory is set.


https://ralph.bakerlab.org/workunit.php?wuid=83793

Now at 24 hours and still stuck at 1.04%.


36 hours and still stuck at 1.04%... the watchdog is NOT working... is anyone out there?

ID: 1471 · Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 16 Feb 06
Posts: 166
Credit: 131,419
RAC: 0
Message 1472 - Posted: 4 May 2006, 19:02:41 UTC - in response to Message 1471.  

36 hours and still stuck at 1.04%... the watchdog is NOT working... is anyone out there?


Hi Mike

Have you checked the grafics to se if the steps or % has changed?

The % should show with 1.04?? and not as on boinc manager with only 1,04.

Anders n

ID: 1472 · Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 17 Feb 06
Posts: 50
Credit: 55,397
RAC: 0
Message 1473 - Posted: 4 May 2006, 19:20:30 UTC - in response to Message 1472.  

36 hours and still stuck at 1.04%... the watchdog is NOT working... is anyone out there?


Hi Mike

Have you checked the grafics to se if the steps or % has changed?

The % should show with 1.04?? and not as on boinc manager with only 1,04.

Anders n


This computer is headless. Remote access only. Hence no screensaver.



ID: 1473 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 1474 - Posted: 4 May 2006, 19:32:08 UTC
Last modified: 4 May 2006, 19:34:48 UTC

Looks like your normal WUs are the 4hrs default... so we're now well passed the 4x preference guideline I've seen posted elsewhere... so it is time to abort. Since we're here on Ralph, the diagnostic info. should prove useful for study. Hopefully it's something they fixed in the versions after 5.06.

Ironic... given your photo that your computer is "headless" :):)
ID: 1474 · Report as offensive    Reply Quote
Profile Astro

Send message
Joined: 16 Feb 06
Posts: 141
Credit: 32,977
RAC: 0
Message 1475 - Posted: 4 May 2006, 21:57:41 UTC - in response to Message 1473.  
Last modified: 4 May 2006, 21:59:17 UTC

[This computer is headless. Remote access only. Hence no screensaver.

Mike, I use VNC to see the graphics on my remote monitorless, keyboardless, and mouseless puter. I click on the WU from the task tab and then view graphics. No screensaver here either. If it's a service install your hosed.

tony
ID: 1475 · Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 17 Feb 06
Posts: 50
Credit: 55,397
RAC: 0
Message 1476 - Posted: 4 May 2006, 22:13:47 UTC - in response to Message 1475.  
Last modified: 4 May 2006, 22:21:10 UTC

[This computer is headless. Remote access only. Hence no screensaver.

Mike, I use VNC to see the graphics on my remote monitorless, keyboardless, and mouseless puter. I click on the WU from the task tab and then view graphics. No screensaver here either. If it's a service install your hosed.

tony


It is a service install. I forgot about the "View Graphics button" I do VN into this computer. OK... 1.041% complete after 40 hours. Stage Full atom relax, Mode 1, Step 100, Accepted RMSD 50.36, Accepted Energy -19.40622 whatever this all means.

ID: 1476 · Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 16 Feb 06
Posts: 251
Credit: 0
RAC: 0
Message 1478 - Posted: 5 May 2006, 2:17:59 UTC - in response to Message 1476.  
Last modified: 5 May 2006, 2:27:12 UTC

[This computer is headless. Remote access only. Hence no screensaver.

Mike, I use VNC to see the graphics on my remote monitorless, keyboardless, and mouseless puter. I click on the WU from the task tab and then view graphics. No screensaver here either. If it's a service install your hosed.

tony


It is a service install. I forgot about the "View Graphics button" I do VN into this computer. OK... 1.041% complete after 40 hours. Stage Full atom relax, Mode 1, Step 100, Accepted RMSD 50.36, Accepted Energy -19.40622 whatever this all means.


If it is a BIG protein you may have to wait for some time to see the steps advance, but you may be able to detect the slightest motion in the searching window image. If you see either the steps counting up or the movement in the searching window, it is still processing. On some of the large Work Units, it is possible for them to run very long times past your time setting. I would note however that yours is running way too long over the time setting. I have had a few lately that went 14 hours with a time setting of 2 hours.

The point being this. Unless the Workunit is either swapped out for project switching, or boinc is turned on and off four times the watchdog will never wake up and abort the work unit. Failing that the work unit will be aborted when it hits a limit preset by the project which SHOULD be 24 hours of CPU time.

My understanding is that it is designed to look at the Work unit each time it starts to process and determine of progress has been made since the last time it started up. This presuposes that the process was stopped for some reason. It does not just sit there checking the work unit all the time. If it never stops processing the workunit it will not check it. With luck Rhiju will chime in here and correct me if I am wrong about this, but I am going on the last explanation I had for all this.

Now let me add a caution here. If you restart BOINC before the workunit reaches a percent complete of greater than 2%, the Work unit WILL START OVER FROM THE BEGINNING AND THE CPU TIME WILL RESET TO ZERO!

So if you are going to play with starting and stopping. You should have keep in memory set to yes, and then suspend the Work unit or start another project long enough for another process to run for a while.

The watch dog is supposed to do 4 of these checks which show no progress before it will abort the workunit. That is part of how they worked out the "four times your time setting" concept for manual aborts.

So the short of this is, if the workunit is simply running uninterrupted, it could run forever, or until it hits the Max time setting. This is the risk of running a single project setup. If you don't see movement in the graphic, try suspending the Work unit and letting the system run a different one for 5 min. Then restart the first Work unit again for 5 min. Repeat this process 4 -5 times and it should abort the workunit if it was stuck. If it is not stuck it should let it keep running. Either that or we have a watchdog bug.

Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact
ID: 1478 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 1480 - Posted: 5 May 2006, 2:34:00 UTC - in response to Message 1478.  

Hi Mike: thanks very much for posting. This sounds weird. The job should have been killed by the watchdog. In fact we sent out these workunits to test that infinite loops are aborted by the watchdog, and they've been "successful" in that they've mostly returned without keeping computers in infinite loops. For now, please either abort or follow mod9's suggestion of suspending and restarting a few times. If this occurs again, please post!

[This computer is headless. Remote access only. Hence no screensaver.

Mike, I use VNC to see the graphics on my remote monitorless, keyboardless, and mouseless puter. I click on the WU from the task tab and then view graphics. No screensaver here either. If it's a service install your hosed.

tony


It is a service install. I forgot about the "View Graphics button" I do VN into this computer. OK... 1.041% complete after 40 hours. Stage Full atom relax, Mode 1, Step 100, Accepted RMSD 50.36, Accepted Energy -19.40622 whatever this all means.


If it is a BIG protein you may have to wait for some time to see the steps advance, but you may be able to detect the slightest motion in the searching window image. If you see either the steps counting up or the movement in the searching window, it is still processing. On some of the large Work Units, it is possible for them to run very long times past your time setting. I would note however that yours is running way too long over the time setting. I have had a few lately that went 14 hours with a time setting of 2 hours.

The point being this. Unless the Workunit is either swapped out for project switching, or boinc is turned on and off four times the watchdog will never wake up and abort the work unit. Failing that the work unit will be aborted when it hits a limit preset by the project which SHOULD be 24 hours of CPU time.

My understanding is that it is designed to look at the Work unit each time it starts to process and determine of progress has been made since the last time it started up. This presuposes that the process was stopped for some reason. It does not just sit there checking the work unit all the time. If it never stops processing the workunit it will not check it. With luck Rhiju will chime in here and correct me if I am wrong about this, but I am going on the last explanation I had for all this.

Now let me add a caution here. If you restart BOINC before the workunit reaches a percent complete of greater than 2%, the Work unit WILL START OVER FROM THE BEGINNING AND THE CPU TIME WILL RESET TO ZERO!

So if you are going to play with starting and stopping. You should have keep in memory set to yes, and then suspend the Work unit or start another project long enough for another process to run for a while.

The watch dog is supposed to do 4 of these checks which show no progress before it will abort the workunit. That is part of how they worked out the "four times your time setting" concept for manual aborts.

So the short of this is, if the workunit is simply running uninterrupted, it could run forever, or until it hits the Max time setting. This is the risk of running a single project setup. If you don't see movement in the graphic, try suspending the Work unit and letting the system run a different one for 5 min. Then restart the first Work unit again for 5 min. Repeat this process 4 -5 times and it should abort the workunit if it was stuck. If it is not stuck it should let it keep running. Either that or we have a watchdog bug.


ID: 1480 · Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : RALPH@home bug list : Bug reports for Ralph 5.05 and higher



©2024 University of Washington
http://www.bakerlab.org