Message boards : RALPH@home bug list : Bug reports for Ralph 5.05 and higher
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
anders n Send message Joined: 16 Feb 06 Posts: 166 Credit: 131,419 RAC: 0 |
Hello, i have some problem with this Wu : 5.06 They are so big that it takes more than 1 H on a fast computer to complete 1 decoy. Anders n |
Dotsch Send message Joined: 4 Mar 06 Posts: 12 Credit: 13,725 RAC: 0 |
I have some problems with 5.06 on Windows 98 : <core_client_version>5.2.13</core_client_version> <message> - exit code -164 (0xffffff5c) </message> <stderr_txt> LoadLibraryA( dbghelp95.dll ): GetLastError = 1157 LoadLibraryA( dbghelp.dll ): GetLastError = 1157 </stderr_txt> Result ID : https://ralph.bakerlab.org/result.php?resultid=100666 |
Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0 |
Hi Feet1st, these are great suggestions, as usual! We've come to expect them. I'm about to post 5.08, and I'll ask that ralph users use similar preferences to their r@h preferences, as you suggest. I think the checkpointing and watchdog issues have largely been resolved, thankfully, and we've moved on to testing real science. As for keeping work on ralph, we haven't quite got that figured out. We'd like to have jobs go out instantly to clients when we post the new app or test a new scientific mode on ralph, so that we get feedback ASAP. The problem is that if we've flooded the clients with jobs with the previous app or previous jobs, there's typically a wait for those clients to free up again. In the future, if we can get trickle-messages implemented, we could send out a purge request. Still, I hear you ... I'll keep sending out work and ask others to do the same. Feet1st, you noticed how bad the problem was with 5.05; has your client tried any 5.06? |
Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0 |
Hi Mike: this is a silly thing that we haven't quite been able to fix, but should happen rarely on rosetta@home. That ralph workunit was a test that our watchdog timer properly aborts really long running jobs. So we're very glad to see it worked on your computer! If you ever run into similar super-long workunits on Rosetta@home (hopefully not!), you'll eventually get credit granted to it, because that's our policy. Thanks for posting! 4/28/2006 12:53:48 AM||Rescheduling CPU: files downloaded |
tralala Send message Joined: 12 Apr 06 Posts: 52 Credit: 15,257 RAC: 0 |
As for keeping work on ralph, we haven't quite got that figured out. We'd like to have jobs go out instantly to clients when we post the new app or test a new scientific mode on ralph, so that we get feedback ASAP. The problem is that if we've flooded the clients with jobs with the previous app or previous jobs, there's typically a wait for those clients to free up again. That's easy to solve: limit the daily quota to five or less. That means clients grab new jobs instantly but can't pile up big caches. At the moment it works as follows the first 20 clients pile up 20 WUs each and no more work is available. These hosts are busy with them several days so you get your work returned late. With 5WU/day the first 80 clients grab 5 WU each and are busy with them only for a day or less. I'd even say 3WU/day is a good quota. Short deadlines have a similar effect but it seems you reset them to match those of Rosetta. |
anders n Send message Joined: 16 Feb 06 Posts: 166 Credit: 131,419 RAC: 0 |
Yes a quota of 3-5 would keep most of the host with work and if you need fast answers to a test batch set the return date to 1-3 days and they will be cruched first. Anders n |
JKeck {pirate} Send message Joined: 16 Feb 06 Posts: 14 Credit: 153,095 RAC: 0 |
I would think for the daily quota 2 would be the minimum and the max 4 or 8. You would want to have a chance at getting multiple tasks running on multi-CPU hosts. BOINC WIKI BOINCing since 2002/12/8 |
tralala Send message Joined: 12 Apr 06 Posts: 52 Credit: 15,257 RAC: 0 |
I would think for the daily quota 2 would be the minimum and the max 4 or 8. You would want to have a chance at getting multiple tasks running on multi-CPU hosts. The daily quota is per CPU. So if you have a dual-core or a Hyperthreading-enabled P4 you get 6 WU/day if the daily quote is 3WU/Day. |
Mike Gelvin Send message Joined: 17 Feb 06 Posts: 50 Credit: 55,397 RAC: 0 |
ROM, I currently have a rosetta_beta_5.06 that has been running 14 hours+ with 1.04% for progress. I have debug capability on this computer, any suggestions, or just Abort? its labeled: WATCHDOG_KILL_VERY_LONG_JOBS_414_3 I notice that 2 others ran this unit and it died at 1.5 hours and 1.8 hours Running on Win2000 SP4, leave in memory is set. |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
its labeled: WATCHDOG_KILL_VERY_LONG_JOBS_414_3 I've seen other posts that this WU was specially designed to TEST the watchdog. It is INTENDED to have the watchdog step in and end it for you. So if you abort, you essentially leave the watchdog less proven. He'll get it! But that SHOULD be the reason why the others "failed". |
Mike Gelvin Send message Joined: 17 Feb 06 Posts: 50 Credit: 55,397 RAC: 0 |
ROM, https://ralph.bakerlab.org/workunit.php?wuid=83793 Now at 24 hours and still stuck at 1.04%. |
William Senn Send message Joined: 16 Feb 06 Posts: 4 Credit: 30,895 RAC: 0 |
Hi, Got two erroneous results, but did not report them here, yet, sorry for being so late.... resultid=98902 resultid=99919 App version 5.06 (both)... Other 2 earlier workunits completed succesfully.... greetings, William Senn... |
Mike Gelvin Send message Joined: 17 Feb 06 Posts: 50 Credit: 55,397 RAC: 0 |
ROM, 36 hours and still stuck at 1.04%... the watchdog is NOT working... is anyone out there? |
anders n Send message Joined: 16 Feb 06 Posts: 166 Credit: 131,419 RAC: 0 |
36 hours and still stuck at 1.04%... the watchdog is NOT working... is anyone out there? Hi Mike Have you checked the grafics to se if the steps or % has changed? The % should show with 1.04?? and not as on boinc manager with only 1,04. Anders n |
Mike Gelvin Send message Joined: 17 Feb 06 Posts: 50 Credit: 55,397 RAC: 0 |
36 hours and still stuck at 1.04%... the watchdog is NOT working... is anyone out there? This computer is headless. Remote access only. Hence no screensaver. |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
Looks like your normal WUs are the 4hrs default... so we're now well passed the 4x preference guideline I've seen posted elsewhere... so it is time to abort. Since we're here on Ralph, the diagnostic info. should prove useful for study. Hopefully it's something they fixed in the versions after 5.06. Ironic... given your photo that your computer is "headless" :):) |
Astro Send message Joined: 16 Feb 06 Posts: 141 Credit: 32,977 RAC: 0 |
[This computer is headless. Remote access only. Hence no screensaver. Mike, I use VNC to see the graphics on my remote monitorless, keyboardless, and mouseless puter. I click on the WU from the task tab and then view graphics. No screensaver here either. If it's a service install your hosed. tony |
Mike Gelvin Send message Joined: 17 Feb 06 Posts: 50 Credit: 55,397 RAC: 0 |
[This computer is headless. Remote access only. Hence no screensaver. It is a service install. I forgot about the "View Graphics button" I do VN into this computer. OK... 1.041% complete after 40 hours. Stage Full atom relax, Mode 1, Step 100, Accepted RMSD 50.36, Accepted Energy -19.40622 whatever this all means. |
Moderator9 Volunteer moderator Send message Joined: 16 Feb 06 Posts: 251 Credit: 0 RAC: 0 |
[This computer is headless. Remote access only. Hence no screensaver. If it is a BIG protein you may have to wait for some time to see the steps advance, but you may be able to detect the slightest motion in the searching window image. If you see either the steps counting up or the movement in the searching window, it is still processing. On some of the large Work Units, it is possible for them to run very long times past your time setting. I would note however that yours is running way too long over the time setting. I have had a few lately that went 14 hours with a time setting of 2 hours. The point being this. Unless the Workunit is either swapped out for project switching, or boinc is turned on and off four times the watchdog will never wake up and abort the work unit. Failing that the work unit will be aborted when it hits a limit preset by the project which SHOULD be 24 hours of CPU time. My understanding is that it is designed to look at the Work unit each time it starts to process and determine of progress has been made since the last time it started up. This presuposes that the process was stopped for some reason. It does not just sit there checking the work unit all the time. If it never stops processing the workunit it will not check it. With luck Rhiju will chime in here and correct me if I am wrong about this, but I am going on the last explanation I had for all this. Now let me add a caution here. If you restart BOINC before the workunit reaches a percent complete of greater than 2%, the Work unit WILL START OVER FROM THE BEGINNING AND THE CPU TIME WILL RESET TO ZERO! So if you are going to play with starting and stopping. You should have keep in memory set to yes, and then suspend the Work unit or start another project long enough for another process to run for a while. The watch dog is supposed to do 4 of these checks which show no progress before it will abort the workunit. That is part of how they worked out the "four times your time setting" concept for manual aborts. So the short of this is, if the workunit is simply running uninterrupted, it could run forever, or until it hits the Max time setting. This is the risk of running a single project setup. If you don't see movement in the graphic, try suspending the Work unit and letting the system run a different one for 5 min. Then restart the first Work unit again for 5 min. Repeat this process 4 -5 times and it should abort the workunit if it was stuck. If it is not stuck it should let it keep running. Either that or we have a watchdog bug. Moderator9 RALPH@home FAQs RALPH@home Guidelines Moderator Contact |
Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0 |
Hi Mike: thanks very much for posting. This sounds weird. The job should have been killed by the watchdog. In fact we sent out these workunits to test that infinite loops are aborted by the watchdog, and they've been "successful" in that they've mostly returned without keeping computers in infinite loops. For now, please either abort or follow mod9's suggestion of suspending and restarting a few times. If this occurs again, please post! [This computer is headless. Remote access only. Hence no screensaver. |
Message boards :
RALPH@home bug list :
Bug reports for Ralph 5.05 and higher
©2024 University of Washington
http://www.bakerlab.org