Bug reports for Ralph 5.05 and higher

Author	Message
Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0	Message 1367 - Posted: 26 Apr 2006, 5:55:54 UTC Last modified: 27 Apr 2006, 3:51:57 UTC Really, we think this is the last one before Rosetta@home is updated! This is mainly to fix a silly, small bug that got introduced with the latest checkpointing. For those interested, the watchdog is still not exiting gracefully every time -- if no data was created, there's still a file transfer error. We're trying to figure out why, but will likely need help from the Boinc team to fix it. Fortunately, the jobs that give these errors are rare -- and produce no data anyway. Of course, we will continue to grant credit every week for errored jobs when the app gets updated on Rosetta@home. ID: 1367 · Reply Quote

tralala Send message Joined: 12 Apr 06 Posts: 52 Credit: 15,257 RAC: 0	Message 1368 - Posted: 26 Apr 2006, 7:45:06 UTC - in response to Message 1367. If I try to fetch work I get a "Project is down" message. ID: 1368 · Reply Quote

tralala Send message Joined: 12 Apr 06 Posts: 52 Credit: 15,257 RAC: 0	Message 1369 - Posted: 26 Apr 2006, 8:29:55 UTC Now it's: 26/04/2006 10:48:22\|ralph@home\|Message from server: Server has software problem 26/04/2006 10:48:22\|ralph@home\|Project is down ID: 1369 · Reply Quote

Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0	Message 1375 - Posted: 26 Apr 2006, 11:38:41 UTC Alpha testers: Abort any 5.04 WU may be sitting up on u cache/queue So, u can start testing 5.05 asap -:) ID: 1375 · Reply Quote

rbpeake Send message Joined: 16 Feb 06 Posts: 19 Credit: 3,370 RAC: 0	Message 1376 - Posted: 26 Apr 2006, 11:41:15 UTC - in response to Message 1367. Really, we think this is the last one before Rosetta@home is updated! Hey, take as much time as you need! You are so close now, might as well wrap it up in style with a bulletproof application! :) ID: 1376 · Reply Quote

tralala Send message Joined: 12 Apr 06 Posts: 52 Credit: 15,257 RAC: 0	Message 1377 - Posted: 26 Apr 2006, 13:04:09 UTC Last modified: 26 Apr 2006, 13:05:03 UTC Both WU I tried finished valid but both results show a warning: WARNING! error deleting file .aah002.out However no such file is present on my computer any longer. https://ralph.bakerlab.org/results.php?userid=1266 ID: 1377 · Reply Quote

feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 1379 - Posted: 26 Apr 2006, 14:11:32 UTC WUs failed in under 1 minute... and I'll tell you why... I was playing around suspending R@H WUs and trying to prevent downloads of more and getting Ralph to get some new WUs, and kill those of the previous version, etc. suffice it ta say I had about 8 WUs suspended and left in memory. This caused Windows to entend it's paging file, and two other WUs failed immediately, the failures attempted to bring in debug code, which furthered the requirements for memory. 97682 97672 97670 Here's the msg you see in Windows: ID: 1379 · Reply Quote

tralala Send message Joined: 12 Apr 06 Posts: 52 Credit: 15,257 RAC: 0	Message 1382 - Posted: 26 Apr 2006, 17:07:23 UTC - in response to Message 1379. Last modified: 26 Apr 2006, 17:48:19 UTC The watchdog aborted this although overall runtime was ony a couple of minutes. After a few minutes runtime it was for a few hours preempted by another WU and after resuming the watchdog probably assumed it run for over an hour with no progress. It seems the Watchdog is only comparing two points in time without checking what happened inbetween. 04/26/06 18:59:19\|\|Rescheduling CPU: application exited 04/26/06 18:59:19\|ralph@home\|Computation for task AB_CASP6_u272__444_4_0 finished 335.453125 stderr out <core_client_version>5.4.6</core_client_version> <stderr_txt> # cpu_run_time_pref: 10800 # random seed: 3882530 ******************************************************************** Rosetta score is stuck or going too long. Watchdog is killing the run! Stuck at score 33.7964 for 3600 seconds ******************************************************************** GZIP SILENT FILE: .xxu272.out WARNING! error deleting file .xxu272.out </stderr_txt> ID: 1382 · Reply Quote

Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0	Message 1385 - Posted: 26 Apr 2006, 19:28:16 UTC Linux 256 MB RAM, plenty of swap space, WU Freeze at 100% Done ! Wed Apr 26 16:41:36 BRT 2006 crobertp [/home/boinc/BOINC] > cat stdoutdae.txt \| grep CASP6 2006-04-26 12:31:54 [ralph@home] Starting result FA_CASP6_v272__435_19_0 using rosetta_beta version 505 2006-04-26 12:34:07 [ralph@home] Pausing result FA_CASP6_v272__435_19_0 (left in memory) 2006-04-26 12:48:20 [ralph@home] Resuming result FA_CASP6_v272__435_19_0 using rosetta_beta version 505 2006-04-26 13:55:38 [ralph@home] Pausing result FA_CASP6_v272__435_19_0 (left in memory) 2006-04-26 15:46:56 [ralph@home] Resuming result FA_CASP6_v272__435_19_0 using rosetta_beta version 505 Wed Apr 26 16:44:41 BRT 2006 CPU usage 0.0000% What should I do next ? Perhaps kill app with some special signal to force a core dump and then e-mail that core dump to ??? Thanks Click signature for global team stats ID: 1385 · Reply Quote

Jose Send message Joined: 25 Apr 06 Posts: 7 Credit: 77 RAC: 0	Message 1386 - Posted: 26 Apr 2006, 20:35:09 UTC This Unit was aborted after less than one hour of runing ( My time preference is 2 hours) https://ralph.bakerlab.org/result.php?resultid=97305 AB_CASP6_t216__438_3_0 Workunit 86138 CPU time 3180.21875 stderr out <core_client_version>5.2.13</core_client_version> <stderr_txt> # random seed: 3882811 # cpu_run_time_pref: 7200 ******************************************************************** Rosetta score is stuck or going too long. Watchdog is killing the run! Stuck at score 71.0875 for 3600 seconds ******************************************************************** GZIP SILENT FILE: .xxt216.out WARNING! attempt to gzip file .xxt216.out failed: file does not exist. </stderr_txt> <message><file_xfer_error> <file_name>AB_CASP6_t216__438_3_0_0</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> </message> Validate state Invalid Claimed credit 11.1502124855248 Granted credit 0 application version 5.05 ID: 1386 · Reply Quote

feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 1391 - Posted: 27 Apr 2006, 2:34:12 UTC Just got home to see how my PC did, I think the watchdog ralphed all over my work. This is on my same host as I posted about earlier where I was short on VM swap space and lost 3 WUs. Problems all over, watchdog kicking in (I've never had any hung WUs before, so seems unlikely it was required), failing after <1 min. I did not abort any v5.05 WUs. Very few successes. Now my Ralph WUs are completed, and I'm crunching R@H again. Got a WU with FAST in the name, the thing has ripped 607 models in under 14hrs (I have 24hr preference, the output file is 4.6MB now... might come close to 10 once we're done!). I didn't reboot or restart BOINC since the memory issues about 13hrs ago. ID: 1391 · Reply Quote

Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0	Message 1392 - Posted: 27 Apr 2006, 3:47:55 UTC - in response to Message 1385. This almost looks like a BOINC manager problem. Go ahead and abort it; then can you post a link to the result here? Thanks. No need to send us the core dump. Linux 256 MB RAM, plenty of swap space, WU Freeze at 100% Done ! Wed Apr 26 16:41:36 BRT 2006 crobertp [/home/boinc/BOINC] > cat stdoutdae.txt \| grep CASP6 2006-04-26 12:31:54 [ralph@home] Starting result FA_CASP6_v272__435_19_0 using rosetta_beta version 505 2006-04-26 12:34:07 [ralph@home] Pausing result FA_CASP6_v272__435_19_0 (left in memory) 2006-04-26 12:48:20 [ralph@home] Resuming result FA_CASP6_v272__435_19_0 using rosetta_beta version 505 2006-04-26 13:55:38 [ralph@home] Pausing result FA_CASP6_v272__435_19_0 (left in memory) 2006-04-26 15:46:56 [ralph@home] Resuming result FA_CASP6_v272__435_19_0 using rosetta_beta version 505 Wed Apr 26 16:44:41 BRT 2006 CPU usage 0.0000% What should I do next ? Perhaps kill app with some special signal to force a core dump and then e-mail that core dump to ??? Thanks ID: 1392 · Reply Quote

Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0	Message 1393 - Posted: 27 Apr 2006, 3:50:11 UTC - in response to Message 1391. Last modified: 27 Apr 2006, 3:53:12 UTC Feet1st, whoa you've got a really fast client.... As for ralph, I share your concerns. I'm trying another fix on 5.06, can you attach to ralph now? If the watchdog is still too aggressive, we'll have to keep it off for rosetta@home until we avoid this error. Two questions for you: How often do you switch between apps (if at all)? When ralph is pre-empted, do you "Keep in Memory"? My prediction is that the clients that have been having trouble with the watchdog switch occasionally between apps, and keep in memory. That was the case for tralala below, and I've put in the fix for that case. Let me know. Just got home to see how my PC did, I think the watchdog ralphed all over my work. This is on my same host as I posted about earlier where I was short on VM swap space and lost 3 WUs. Problems all over, watchdog kicking in (I've never had any hung WUs before, so seems unlikely it was required), failing after <1 min. I did not abort any v5.05 WUs. Very few successes. Now my Ralph WUs are completed, and I'm crunching R@H again. Got a WU with FAST in the name, the thing has ripped 607 models in under 14hrs (I have 24hr preference, the output file is 4.6MB now... might come close to 10 once we're done!). I didn't reboot or restart BOINC since the memory issues about 13hrs ago. ID: 1393 · Reply Quote

simpe73 Send message Joined: 20 Feb 06 Posts: 2 Credit: 36,752 RAC: 0	Message 1394 - Posted: 27 Apr 2006, 4:05:05 UTC 5.05 works fine, but.... In every result I've checked there is "WARNING! error deleting file .xxv272.out". Name of file varies, but they are allways .out -files. ID: 1394 · Reply Quote

[B^S] sTrey Send message Joined: 15 Feb 06 Posts: 58 Credit: 15,430 RAC: 0	Message 1397 - Posted: 27 Apr 2006, 6:41:18 UTC Last modified: 27 Apr 2006, 6:52:38 UTC wu 79004 killed by watchdog just after 2 hrs' runtime (Stuck at score -115.914 for 3600 seconds) (Sorry this was 5.05, can't get any 5.06 wus) ID: 1397 · Reply Quote

tralala Send message Joined: 12 Apr 06 Posts: 52 Credit: 15,257 RAC: 0	Message 1398 - Posted: 27 Apr 2006, 9:32:05 UTC I just finished one WU with several deliberate switchings inbetween (long and short) and it seems 5.06 has solved the issue. The warning message about the file deletion error is also gone. :-) https://ralph.bakerlab.org/result.php?resultid=98261 Now Rhiju please have a look at this and this. ID: 1398 · Reply Quote

Jose Send message Joined: 25 Apr 06 Posts: 7 Credit: 77 RAC: 0	Message 1400 - Posted: 27 Apr 2006, 12:20:19 UTC Okies I have been running the following RALPH Work Unit: ID 6204 Name AB_CASP6_t198__438_5_0 It worked for around 57 minutes and then was preempted ( keeping the record of the CPU Time in my work record) and the corresponding Rosetta Work Unit restarted. Once the Rosetta Work Unit stopped, the application switched to RALPH and Work Unit ID 6204 restarted , it started DE NOVO , that is from CPU time of 0 and not from the accumulated 57+ minutes it had when it preempted and the application switch happened. My preferences are set so that work is kept in memory and this did not happened in this case. So to make the story short: the 57+ CPU time for the Work Unit that have been stored in memory disappeared into the big void in the sky. :) ID: 1400 · Reply Quote

feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 1401 - Posted: 27 Apr 2006, 14:07:52 UTC - in response to Message 1393. How often do you switch between apps (if at all)? When ralph is pre-empted, do you "Keep in Memory"? My prediction is that the clients that have been having trouble with the watchdog switch occasionally between apps, and keep in memory. That was the case for tralala below, and I've put in the fix for that case. Let me know. Sorry, I'm not positive. I changed my settings to actually try and stress Ralph a little bit, but I am not certain if they took effect on THAT PC or not, depends when it updated. I believe it had a 360 min (6hrs) switch between projects, and leave in memory at the time of failures. My other Ralph host was updated to have 20min switch time, and remove from memory, and it seems to be going well, but it hasn't had other projects interrupting it either. ID: 1401 · Reply Quote

Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0	Message 1403 - Posted: 27 Apr 2006, 17:48:22 UTC - in response to Message 1401. Thanks for all the advice. I think we've largely killed the watchdog timer problem and are ready to release. (Feet1st, you noticed how bad the problem was with 5.05; has your client tried any 5.06?) We haven't seen any evidence for jobs being aborted prematurely by the watchdog, except for the tests where we forced an infinite loop. A few quick replies: I'll bring the debate about shorter/longer deadlines (or a mix) to the attention of the other project scientists. I really do like Feet1st's idea to ask ralph users to lower the fraction of time their client spends on ralph. That will distribute the jobs to as many different cpus as possible. I can make a note of it on the news page next time we release. How often do you switch between apps (if at all)? When ralph is pre-empted, do you "Keep in Memory"? My prediction is that the clients that have been having trouble with the watchdog switch occasionally between apps, and keep in memory. That was the case for tralala below, and I've put in the fix for that case. Let me know. Sorry, I'm not positive. I changed my settings to actually try and stress Ralph a little bit, but I am not certain if they took effect on THAT PC or not, depends when it updated. I believe it had a 360 min (6hrs) switch between projects, and leave in memory at the time of failures. My other Ralph host was updated to have 20min switch time, and remove from memory, and it seems to be going well, but it hasn't had other projects interrupting it either. ID: 1403 · Reply Quote

Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0	Message 1404 - Posted: 27 Apr 2006, 17:53:19 UTC - in response to Message 1385. A quick reply to Carlos... it seems like all your ralph jobs have been erroring out. The error message we're seeing is something about a lost heartbeat from the core client. That doesn't sound good. Have you had this issue with any workunits from rosetta@home? Also, do you have a new version of the BOINC app? Thanks. Linux 256 MB RAM, plenty of swap space, WU Freeze at 100% Done ! Wed Apr 26 16:41:36 BRT 2006 crobertp [/home/boinc/BOINC] > cat stdoutdae.txt \| grep CASP6 2006-04-26 12:31:54 [ralph@home] Starting result FA_CASP6_v272__435_19_0 using rosetta_beta version 505 2006-04-26 12:34:07 [ralph@home] Pausing result FA_CASP6_v272__435_19_0 (left in memory) 2006-04-26 12:48:20 [ralph@home] Resuming result FA_CASP6_v272__435_19_0 using rosetta_beta version 505 2006-04-26 13:55:38 [ralph@home] Pausing result FA_CASP6_v272__435_19_0 (left in memory) 2006-04-26 15:46:56 [ralph@home] Resuming result FA_CASP6_v272__435_19_0 using rosetta_beta version 505 Wed Apr 26 16:44:41 BRT 2006 CPU usage 0.0000% What should I do next ? Perhaps kill app with some special signal to force a core dump and then e-mail that core dump to ??? Thanks ID: 1404 · Reply Quote