Bug reports for Ralph 5.05 and higher

Author	Message
anders n Send message Joined: 16 Feb 06 Posts: 166 Credit: 131,419 RAC: 0	Message 1431 - Posted: 29 Apr 2006, 10:54:13 UTC - in response to Message 1430. Hello, i have some problem with this Wu : 5.06 FA_CASP6_t216__444_2_2 50' for 1.02% So i aborted it Bye and go on... They are so big that it takes more than 1 H on a fast computer to complete 1 decoy. Anders n ID: 1431 · Reply Quote

Dotsch Send message Joined: 4 Mar 06 Posts: 12 Credit: 13,725 RAC: 0	Message 1434 - Posted: 29 Apr 2006, 21:10:15 UTC I have some problems with 5.06 on Windows 98 : <core_client_version>5.2.13</core_client_version> <message> - exit code -164 (0xffffff5c) </message> <stderr_txt> LoadLibraryA( dbghelp95.dll ): GetLastError = 1157 LoadLibraryA( dbghelp.dll ): GetLastError = 1157 </stderr_txt> Result ID : https://ralph.bakerlab.org/result.php?resultid=100666 ID: 1434 · Reply Quote

Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0	Message 1436 - Posted: 30 Apr 2006, 2:30:46 UTC - in response to Message 1425. Hi Feet1st, these are great suggestions, as usual! We've come to expect them. I'm about to post 5.08, and I'll ask that ralph users use similar preferences to their r@h preferences, as you suggest. I think the checkpointing and watchdog issues have largely been resolved, thankfully, and we've moved on to testing real science. As for keeping work on ralph, we haven't quite got that figured out. We'd like to have jobs go out instantly to clients when we post the new app or test a new scientific mode on ralph, so that we get feedback ASAP. The problem is that if we've flooded the clients with jobs with the previous app or previous jobs, there's typically a wait for those clients to free up again. In the future, if we can get trickle-messages implemented, we could send out a purge request. Still, I hear you ... I'll keep sending out work and ask others to do the same. Feet1st, you noticed how bad the problem was with 5.05; has your client tried any 5.06? I'm not positive, but I believe (irony is cruel sometimes) all the 5.06 WUs were gone by the time I got home to that PC to notice and abort 5.05 WUs. This is ironic for two reasons. One, I've been discussing the merits of getting WUs to more hosts by limiting WUs per day or resource share, or other means of assuring some WUs remain available for at least 24hrs. Two, I asked why no application version shows on an unreturned WU on the website, and was told it's because it's flexible, so from work, I can't see if the WUs on my PC at home are for 5.05 or 5.06 :) Even though we all know that the Work tab of that PC has a specific version associated with the WU. ID: 1436 · Reply Quote

Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0	Message 1437 - Posted: 30 Apr 2006, 2:33:32 UTC - in response to Message 1422. Hi Mike: this is a silly thing that we haven't quite been able to fix, but should happen rarely on rosetta@home. That ralph workunit was a test that our watchdog timer properly aborts really long running jobs. So we're very glad to see it worked on your computer! If you ever run into similar super-long workunits on Rosetta@home (hopefully not!), you'll eventually get credit granted to it, because that's our policy. Thanks for posting! 4/28/2006 12:53:48 AM\|\|Rescheduling CPU: files downloaded 4/28/2006 3:15:49 AM\|\|Rescheduling CPU: application exited 4/28/2006 3:15:49 AM\|ralph@home\|Computation for task WATCHDOG_KILL_VERY_LONG_JOBS_424_9_2 finished 4/28/2006 3:15:50 AM\|ralph@home\|Unrecoverable error for result WATCHDOG_KILL_VERY_LONG_JOBS_424_9_2 ( WATCHDOG_KILL_VERY_LONG_JOBS_424_9_2_0 -161) result: https://ralph.bakerlab.org/result.php?resultid=97709 Win 2000 SP4 Intel Pentium 4 @ 2.4GHz w/ 512Meg RAM There was is an additional message in the result about a non-existant file: GZIP SILENT FILE: .xx1enh.out WARNING! attempt to gzip file .xx1enh.out failed: file does not exist. ID: 1437 · Reply Quote

tralala Send message Joined: 12 Apr 06 Posts: 52 Credit: 15,257 RAC: 0	Message 1439 - Posted: 30 Apr 2006, 7:19:42 UTC - in response to Message 1436. Last modified: 30 Apr 2006, 7:26:39 UTC As for keeping work on ralph, we haven't quite got that figured out. We'd like to have jobs go out instantly to clients when we post the new app or test a new scientific mode on ralph, so that we get feedback ASAP. The problem is that if we've flooded the clients with jobs with the previous app or previous jobs, there's typically a wait for those clients to free up again. That's easy to solve: limit the daily quota to five or less. That means clients grab new jobs instantly but can't pile up big caches. At the moment it works as follows the first 20 clients pile up 20 WUs each and no more work is available. These hosts are busy with them several days so you get your work returned late. With 5WU/day the first 80 clients grab 5 WU each and are busy with them only for a day or less. I'd even say 3WU/day is a good quota. Short deadlines have a similar effect but it seems you reset them to match those of Rosetta. ID: 1439 · Reply Quote

anders n Send message Joined: 16 Feb 06 Posts: 166 Credit: 131,419 RAC: 0	Message 1440 - Posted: 30 Apr 2006, 8:02:49 UTC Yes a quota of 3-5 would keep most of the host with work and if you need fast answers to a test batch set the return date to 1-3 days and they will be cruched first. Anders n ID: 1440 · Reply Quote

JKeck {pirate} Send message Joined: 16 Feb 06 Posts: 14 Credit: 153,095 RAC: 0	Message 1441 - Posted: 30 Apr 2006, 11:16:15 UTC I would think for the daily quota 2 would be the minimum and the max 4 or 8. You would want to have a chance at getting multiple tasks running on multi-CPU hosts. BOINC WIKI BOINCing since 2002/12/8 ID: 1441 · Reply Quote

tralala Send message Joined: 12 Apr 06 Posts: 52 Credit: 15,257 RAC: 0	Message 1442 - Posted: 30 Apr 2006, 12:57:49 UTC - in response to Message 1441. I would think for the daily quota 2 would be the minimum and the max 4 or 8. You would want to have a chance at getting multiple tasks running on multi-CPU hosts. The daily quota is per CPU. So if you have a dual-core or a Hyperthreading-enabled P4 you get 6 WU/day if the daily quote is 3WU/Day. ID: 1442 · Reply Quote

Mike Gelvin Send message Joined: 17 Feb 06 Posts: 50 Credit: 55,397 RAC: 0	Message 1467 - Posted: 3 May 2006, 20:11:30 UTC ROM, I currently have a rosetta_beta_5.06 that has been running 14 hours+ with 1.04% for progress. I have debug capability on this computer, any suggestions, or just Abort? its labeled: WATCHDOG_KILL_VERY_LONG_JOBS_414_3 I notice that 2 others ran this unit and it died at 1.5 hours and 1.8 hours Running on Win2000 SP4, leave in memory is set. ID: 1467 · Reply Quote

feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 1468 - Posted: 3 May 2006, 22:35:50 UTC - in response to Message 1467. its labeled: WATCHDOG_KILL_VERY_LONG_JOBS_414_3 I've seen other posts that this WU was specially designed to TEST the watchdog. It is INTENDED to have the watchdog step in and end it for you. So if you abort, you essentially leave the watchdog less proven. He'll get it! But that SHOULD be the reason why the others "failed". ID: 1468 · Reply Quote

Mike Gelvin Send message Joined: 17 Feb 06 Posts: 50 Credit: 55,397 RAC: 0	Message 1469 - Posted: 4 May 2006, 5:53:44 UTC - in response to Message 1467. Last modified: 4 May 2006, 5:54:14 UTC ROM, I currently have a rosetta_beta_5.06 that has been running 14 hours+ with 1.04% for progress. I have debug capability on this computer, any suggestions, or just Abort? its labeled: WATCHDOG_KILL_VERY_LONG_JOBS_414_3 I notice that 2 others ran this unit and it died at 1.5 hours and 1.8 hours Running on Win2000 SP4, leave in memory is set. https://ralph.bakerlab.org/workunit.php?wuid=83793 Now at 24 hours and still stuck at 1.04%. ID: 1469 · Reply Quote

William Senn Send message Joined: 16 Feb 06 Posts: 4 Credit: 30,895 RAC: 0	Message 1470 - Posted: 4 May 2006, 10:46:03 UTC Hi, Got two erroneous results, but did not report them here, yet, sorry for being so late.... resultid=98902 resultid=99919 App version 5.06 (both)... Other 2 earlier workunits completed succesfully.... greetings, William Senn... ID: 1470 · Reply Quote

Mike Gelvin Send message Joined: 17 Feb 06 Posts: 50 Credit: 55,397 RAC: 0	Message 1471 - Posted: 4 May 2006, 18:45:25 UTC - in response to Message 1469. ROM, I currently have a rosetta_beta_5.06 that has been running 14 hours+ with 1.04% for progress. I have debug capability on this computer, any suggestions, or just Abort? its labeled: WATCHDOG_KILL_VERY_LONG_JOBS_414_3 I notice that 2 others ran this unit and it died at 1.5 hours and 1.8 hours Running on Win2000 SP4, leave in memory is set. https://ralph.bakerlab.org/workunit.php?wuid=83793 Now at 24 hours and still stuck at 1.04%. 36 hours and still stuck at 1.04%... the watchdog is NOT working... is anyone out there? ID: 1471 · Reply Quote

anders n Send message Joined: 16 Feb 06 Posts: 166 Credit: 131,419 RAC: 0	Message 1472 - Posted: 4 May 2006, 19:02:41 UTC - in response to Message 1471. 36 hours and still stuck at 1.04%... the watchdog is NOT working... is anyone out there? Hi Mike Have you checked the grafics to se if the steps or % has changed? The % should show with 1.04?? and not as on boinc manager with only 1,04. Anders n ID: 1472 · Reply Quote

Mike Gelvin Send message Joined: 17 Feb 06 Posts: 50 Credit: 55,397 RAC: 0	Message 1473 - Posted: 4 May 2006, 19:20:30 UTC - in response to Message 1472. 36 hours and still stuck at 1.04%... the watchdog is NOT working... is anyone out there? Hi Mike Have you checked the grafics to se if the steps or % has changed? The % should show with 1.04?? and not as on boinc manager with only 1,04. Anders n This computer is headless. Remote access only. Hence no screensaver. ID: 1473 · Reply Quote

feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 1474 - Posted: 4 May 2006, 19:32:08 UTC Last modified: 4 May 2006, 19:34:48 UTC Looks like your normal WUs are the 4hrs default... so we're now well passed the 4x preference guideline I've seen posted elsewhere... so it is time to abort. Since we're here on Ralph, the diagnostic info. should prove useful for study. Hopefully it's something they fixed in the versions after 5.06. Ironic... given your photo that your computer is "headless" :):) ID: 1474 · Reply Quote

Astro Send message Joined: 16 Feb 06 Posts: 141 Credit: 32,977 RAC: 0	Message 1475 - Posted: 4 May 2006, 21:57:41 UTC - in response to Message 1473. Last modified: 4 May 2006, 21:59:17 UTC [This computer is headless. Remote access only. Hence no screensaver. Mike, I use VNC to see the graphics on my remote monitorless, keyboardless, and mouseless puter. I click on the WU from the task tab and then view graphics. No screensaver here either. If it's a service install your hosed. tony ID: 1475 · Reply Quote

Mike Gelvin Send message Joined: 17 Feb 06 Posts: 50 Credit: 55,397 RAC: 0	Message 1476 - Posted: 4 May 2006, 22:13:47 UTC - in response to Message 1475. Last modified: 4 May 2006, 22:21:10 UTC [This computer is headless. Remote access only. Hence no screensaver. Mike, I use VNC to see the graphics on my remote monitorless, keyboardless, and mouseless puter. I click on the WU from the task tab and then view graphics. No screensaver here either. If it's a service install your hosed. tony It is a service install. I forgot about the "View Graphics button" I do VN into this computer. OK... 1.041% complete after 40 hours. Stage Full atom relax, Mode 1, Step 100, Accepted RMSD 50.36, Accepted Energy -19.40622 whatever this all means. ID: 1476 · Reply Quote

Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0	Message 1480 - Posted: 5 May 2006, 2:34:00 UTC - in response to Message 1478. Hi Mike: thanks very much for posting. This sounds weird. The job should have been killed by the watchdog. In fact we sent out these workunits to test that infinite loops are aborted by the watchdog, and they've been "successful" in that they've mostly returned without keeping computers in infinite loops. For now, please either abort or follow mod9's suggestion of suspending and restarting a few times. If this occurs again, please post! [This computer is headless. Remote access only. Hence no screensaver. Mike, I use VNC to see the graphics on my remote monitorless, keyboardless, and mouseless puter. I click on the WU from the task tab and then view graphics. No screensaver here either. If it's a service install your hosed. tony It is a service install. I forgot about the "View Graphics button" I do VN into this computer. OK... 1.041% complete after 40 hours. Stage Full atom relax, Mode 1, Step 100, Accepted RMSD 50.36, Accepted Energy -19.40622 whatever this all means. If it is a BIG protein you may have to wait for some time to see the steps advance, but you may be able to detect the slightest motion in the searching window image. If you see either the steps counting up or the movement in the searching window, it is still processing. On some of the large Work Units, it is possible for them to run very long times past your time setting. I would note however that yours is running way too long over the time setting. I have had a few lately that went 14 hours with a time setting of 2 hours. The point being this. Unless the Workunit is either swapped out for project switching, or boinc is turned on and off four times the watchdog will never wake up and abort the work unit. Failing that the work unit will be aborted when it hits a limit preset by the project which SHOULD be 24 hours of CPU time. My understanding is that it is designed to look at the Work unit each time it starts to process and determine of progress has been made since the last time it started up. This presuposes that the process was stopped for some reason. It does not just sit there checking the work unit all the time. If it never stops processing the workunit it will not check it. With luck Rhiju will chime in here and correct me if I am wrong about this, but I am going on the last explanation I had for all this. Now let me add a caution here. If you restart BOINC before the workunit reaches a percent complete of greater than 2%, the Work unit WILL START OVER FROM THE BEGINNING AND THE CPU TIME WILL RESET TO ZERO! So if you are going to play with starting and stopping. You should have keep in memory set to yes, and then suspend the Work unit or start another project long enough for another process to run for a while. The watch dog is supposed to do 4 of these checks which show no progress before it will abort the workunit. That is part of how they worked out the "four times your time setting" concept for manual aborts. So the short of this is, if the workunit is simply running uninterrupted, it could run forever, or until it hits the Max time setting. This is the risk of running a single project setup. If you don't see movement in the graphic, try suspending the Work unit and letting the system run a different one for 5 min. Then restart the first Work unit again for 5 min. Repeat this process 4 -5 times and it should abort the workunit if it was stuck. If it is not stuck it should let it keep running. Either that or we have a watchdog bug. ID: 1480 · Reply Quote

Mike Gelvin Send message Joined: 17 Feb 06 Posts: 50 Credit: 55,397 RAC: 0	Message 1484 - Posted: 5 May 2006, 3:42:28 UTC - in response to Message 1476. [This computer is headless. Remote access only. Hence no screensaver. Mike, I use VNC to see the graphics on my remote monitorless, keyboardless, and mouseless puter. I click on the WU from the task tab and then view graphics. No screensaver here either. If it's a service install your hosed. tony It is a service install. I forgot about the "View Graphics button" I do VN into this computer. OK... 1.041% complete after 40 hours. Stage Full atom relax, Mode 1, Step 100, Accepted RMSD 50.36, Accepted Energy -19.40622 whatever this all means. Starting and stopping did indeed reset the time to 0 (I had to reboot for other reasons). I am going to allow it to build back up... at over 24 I will report back. Its the Max Time Setting (24 hrs) that appears to not be working. ID: 1484 · Reply Quote