Message boards : RALPH@home bug list : Bug reports for Ralph 5.05 and higher
Author | Message |
---|---|
Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0 |
Really, we think this is the last one before Rosetta@home is updated! This is mainly to fix a silly, small bug that got introduced with the latest checkpointing. For those interested, the watchdog is still not exiting gracefully every time -- if no data was created, there's still a file transfer error. We're trying to figure out why, but will likely need help from the Boinc team to fix it. Fortunately, the jobs that give these errors are rare -- and produce no data anyway. Of course, we will continue to grant credit every week for errored jobs when the app gets updated on Rosetta@home. |
tralala Send message Joined: 12 Apr 06 Posts: 52 Credit: 15,257 RAC: 0 |
If I try to fetch work I get a "Project is down" message. |
tralala Send message Joined: 12 Apr 06 Posts: 52 Credit: 15,257 RAC: 0 |
Now it's: 26/04/2006 10:48:22|ralph@home|Message from server: Server has software problem 26/04/2006 10:48:22|ralph@home|Project is down |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
Alpha testers: Abort any 5.04 WU may be sitting up on u cache/queue So, u can start testing 5.05 asap -:) |
rbpeake Send message Joined: 16 Feb 06 Posts: 19 Credit: 3,370 RAC: 0 |
Really, we think this is the last one before Rosetta@home is updated! Hey, take as much time as you need! You are so close now, might as well wrap it up in style with a bulletproof application! :) |
tralala Send message Joined: 12 Apr 06 Posts: 52 Credit: 15,257 RAC: 0 |
Both WU I tried finished valid but both results show a warning: WARNING! error deleting file .aah002.out However no such file is present on my computer any longer. https://ralph.bakerlab.org/results.php?userid=1266 |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
WUs failed in under 1 minute... and I'll tell you why... I was playing around suspending R@H WUs and trying to prevent downloads of more and getting Ralph to get some new WUs, and kill those of the previous version, etc. suffice it ta say I had about 8 WUs suspended and left in memory. This caused Windows to entend it's paging file, and two other WUs failed immediately, the failures attempted to bring in debug code, which furthered the requirements for memory. 97682 97672 97670 Here's the msg you see in Windows: |
tralala Send message Joined: 12 Apr 06 Posts: 52 Credit: 15,257 RAC: 0 |
The watchdog aborted this although overall runtime was ony a couple of minutes. After a few minutes runtime it was for a few hours preempted by another WU and after resuming the watchdog probably assumed it run for over an hour with no progress. It seems the Watchdog is only comparing two points in time without checking what happened inbetween. 04/26/06 18:59:19||Rescheduling CPU: application exited 04/26/06 18:59:19|ralph@home|Computation for task AB_CASP6_u272__444_4_0 finished 335.453125 stderr out <core_client_version>5.4.6</core_client_version> <stderr_txt> # cpu_run_time_pref: 10800 # random seed: 3882530 ********************************************************************** Rosetta score is stuck or going too long. Watchdog is killing the run! Stuck at score 33.7964 for 3600 seconds ********************************************************************** GZIP SILENT FILE: .xxu272.out WARNING! error deleting file .xxu272.out </stderr_txt> |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
Linux 256 MB RAM, plenty of swap space, WU Freeze at 100% Done ! Wed Apr 26 16:41:36 BRT 2006 crobertp [/home/boinc/BOINC] > cat stdoutdae.txt | grep CASP6 2006-04-26 12:31:54 [ralph@home] Starting result FA_CASP6_v272__435_19_0 using rosetta_beta version 505 2006-04-26 12:34:07 [ralph@home] Pausing result FA_CASP6_v272__435_19_0 (left in memory) 2006-04-26 12:48:20 [ralph@home] Resuming result FA_CASP6_v272__435_19_0 using rosetta_beta version 505 2006-04-26 13:55:38 [ralph@home] Pausing result FA_CASP6_v272__435_19_0 (left in memory) 2006-04-26 15:46:56 [ralph@home] Resuming result FA_CASP6_v272__435_19_0 using rosetta_beta version 505 Wed Apr 26 16:44:41 BRT 2006 CPU usage 0.0000% What should I do next ? Perhaps kill app with some special signal to force a core dump and then e-mail that core dump to ??? Thanks Click signature for global team stats |
Jose Send message Joined: 25 Apr 06 Posts: 7 Credit: 77 RAC: 0 |
This Unit was aborted after less than one hour of runing ( My time preference is 2 hours) https://ralph.bakerlab.org/result.php?resultid=97305 AB_CASP6_t216__438_3_0 Workunit 86138 CPU time 3180.21875 stderr out <core_client_version>5.2.13</core_client_version> <stderr_txt> # random seed: 3882811 # cpu_run_time_pref: 7200 ********************************************************************** Rosetta score is stuck or going too long. Watchdog is killing the run! Stuck at score 71.0875 for 3600 seconds ********************************************************************** GZIP SILENT FILE: .xxt216.out WARNING! attempt to gzip file .xxt216.out failed: file does not exist. </stderr_txt> <message><file_xfer_error> <file_name>AB_CASP6_t216__438_3_0_0</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> </message> Validate state Invalid Claimed credit 11.1502124855248 Granted credit 0 application version 5.05 |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
Just got home to see how my PC did, I think the watchdog ralphed all over my work. This is on my same host as I posted about earlier where I was short on VM swap space and lost 3 WUs. Problems all over, watchdog kicking in (I've never had any hung WUs before, so seems unlikely it was required), failing after <1 min. I did not abort any v5.05 WUs. Very few successes. Now my Ralph WUs are completed, and I'm crunching R@H again. Got a WU with FAST in the name, the thing has ripped 607 models in under 14hrs (I have 24hr preference, the output file is 4.6MB now... might come close to 10 once we're done!). I didn't reboot or restart BOINC since the memory issues about 13hrs ago. |
Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0 |
This almost looks like a BOINC manager problem. Go ahead and abort it; then can you post a link to the result here? Thanks. No need to send us the core dump. Linux 256 MB RAM, plenty of swap space, WU Freeze at 100% Done ! |
Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0 |
Feet1st, whoa you've got a really fast client.... As for ralph, I share your concerns. I'm trying another fix on 5.06, can you attach to ralph now? If the watchdog is still too aggressive, we'll have to keep it off for rosetta@home until we avoid this error. Two questions for you: How often do you switch between apps (if at all)? When ralph is pre-empted, do you "Keep in Memory"? My prediction is that the clients that have been having trouble with the watchdog switch occasionally between apps, and keep in memory. That was the case for tralala below, and I've put in the fix for that case. Let me know. Just got home to see how my PC did, I think the watchdog ralphed all over my work. This is on my same host as I posted about earlier where I was short on VM swap space and lost 3 WUs. Problems all over, watchdog kicking in (I've never had any hung WUs before, so seems unlikely it was required), failing after <1 min. I did not abort any v5.05 WUs. Very few successes. |
simpe73 Send message Joined: 20 Feb 06 Posts: 2 Credit: 36,752 RAC: 0 |
5.05 works fine, but.... In every result I've checked there is "WARNING! error deleting file .xxv272.out". Name of file varies, but they are allways .out -files. |
[B^S] sTrey Send message Joined: 15 Feb 06 Posts: 58 Credit: 15,430 RAC: 0 |
wu 79004 killed by watchdog just after 2 hrs' runtime (Stuck at score -115.914 for 3600 seconds) (Sorry this was 5.05, can't get any 5.06 wus) |
tralala Send message Joined: 12 Apr 06 Posts: 52 Credit: 15,257 RAC: 0 |
I just finished one WU with several deliberate switchings inbetween (long and short) and it seems 5.06 has solved the issue. The warning message about the file deletion error is also gone. :-) https://ralph.bakerlab.org/result.php?resultid=98261 Now Rhiju please have a look at this and this. |
Jose Send message Joined: 25 Apr 06 Posts: 7 Credit: 77 RAC: 0 |
Okies I have been running the following RALPH Work Unit: ID 6204 Name AB_CASP6_t198__438_5_0 It worked for around 57 minutes and then was preempted ( keeping the record of the CPU Time in my work record) and the corresponding Rosetta Work Unit restarted. Once the Rosetta Work Unit stopped, the application switched to RALPH and Work Unit ID 6204 restarted , it started DE NOVO , that is from CPU time of 0 and not from the accumulated 57+ minutes it had when it preempted and the application switch happened. My preferences are set so that work is kept in memory and this did not happened in this case. So to make the story short: the 57+ CPU time for the Work Unit that have been stored in memory disappeared into the big void in the sky. :) |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
How often do you switch between apps (if at all)? Sorry, I'm not positive. I changed my settings to actually try and stress Ralph a little bit, but I am not certain if they took effect on THAT PC or not, depends when it updated. I believe it had a 360 min (6hrs) switch between projects, and leave in memory at the time of failures. My other Ralph host was updated to have 20min switch time, and remove from memory, and it seems to be going well, but it hasn't had other projects interrupting it either. |
Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0 |
Thanks for all the advice. I think we've largely killed the watchdog timer problem and are ready to release. (Feet1st, you noticed how bad the problem was with 5.05; has your client tried any 5.06?) We haven't seen any evidence for jobs being aborted prematurely by the watchdog, except for the tests where we forced an infinite loop. A few quick replies: I'll bring the debate about shorter/longer deadlines (or a mix) to the attention of the other project scientists. I really do like Feet1st's idea to ask ralph users to lower the fraction of time their client spends on ralph. That will distribute the jobs to as many different cpus as possible. I can make a note of it on the news page next time we release. How often do you switch between apps (if at all)? |
Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0 |
A quick reply to Carlos... it seems like all your ralph jobs have been erroring out. The error message we're seeing is something about a lost heartbeat from the core client. That doesn't sound good. Have you had this issue with any workunits from rosetta@home? Also, do you have a new version of the BOINC app? Thanks. Linux 256 MB RAM, plenty of swap space, WU Freeze at 100% Done ! |
Message boards :
RALPH@home bug list :
Bug reports for Ralph 5.05 and higher
©2025 University of Washington
http://www.bakerlab.org