Bug reports for Ralph 5.05 and higher

Message boards : RALPH@home bug list : Bug reports for Ralph 5.05 and higher

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 1367 - Posted: 26 Apr 2006, 5:55:54 UTC
Last modified: 27 Apr 2006, 3:51:57 UTC

Really, we think this is the last one before Rosetta@home is updated! This is mainly to fix a silly, small bug that got introduced with the latest checkpointing.

For those interested, the watchdog is still not exiting gracefully every time -- if no data was created, there's still a file transfer error. We're trying to figure out why, but will likely need help from the Boinc team to fix it. Fortunately, the jobs that give these errors are rare -- and produce no data anyway. Of course, we will continue to grant credit every week for errored jobs when the app gets updated on Rosetta@home.
ID: 1367 · Report as offensive    Reply Quote
tralala

Send message
Joined: 12 Apr 06
Posts: 52
Credit: 15,257
RAC: 0
Message 1368 - Posted: 26 Apr 2006, 7:45:06 UTC - in response to Message 1367.  

If I try to fetch work I get a "Project is down" message.
ID: 1368 · Report as offensive    Reply Quote
tralala

Send message
Joined: 12 Apr 06
Posts: 52
Credit: 15,257
RAC: 0
Message 1369 - Posted: 26 Apr 2006, 8:29:55 UTC

Now it's:

26/04/2006 10:48:22|ralph@home|Message from server: Server has software problem
26/04/2006 10:48:22|ralph@home|Project is down

ID: 1369 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 1375 - Posted: 26 Apr 2006, 11:38:41 UTC

Alpha testers: Abort any 5.04 WU may be sitting up on u cache/queue
So, u can start testing 5.05 asap -:)
ID: 1375 · Report as offensive    Reply Quote
rbpeake

Send message
Joined: 16 Feb 06
Posts: 19
Credit: 3,370
RAC: 0
Message 1376 - Posted: 26 Apr 2006, 11:41:15 UTC - in response to Message 1367.  

Really, we think this is the last one before Rosetta@home is updated!

Hey, take as much time as you need! You are so close now, might as well wrap it up in style with a bulletproof application! :)
ID: 1376 · Report as offensive    Reply Quote
tralala

Send message
Joined: 12 Apr 06
Posts: 52
Credit: 15,257
RAC: 0
Message 1377 - Posted: 26 Apr 2006, 13:04:09 UTC
Last modified: 26 Apr 2006, 13:05:03 UTC

Both WU I tried finished valid but both results show a warning:

WARNING! error deleting file .aah002.out

However no such file is present on my computer any longer.

https://ralph.bakerlab.org/results.php?userid=1266
ID: 1377 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 1379 - Posted: 26 Apr 2006, 14:11:32 UTC

WUs failed in under 1 minute... and I'll tell you why...

I was playing around suspending R@H WUs and trying to prevent downloads of more and getting Ralph to get some new WUs, and kill those of the previous version, etc. suffice it ta say I had about 8 WUs suspended and left in memory. This caused Windows to entend it's paging file, and two other WUs failed immediately, the failures attempted to bring in debug code, which furthered the requirements for memory.

97682
97672
97670

Here's the msg you see in Windows:

ID: 1379 · Report as offensive    Reply Quote
tralala

Send message
Joined: 12 Apr 06
Posts: 52
Credit: 15,257
RAC: 0
Message 1382 - Posted: 26 Apr 2006, 17:07:23 UTC - in response to Message 1379.  
Last modified: 26 Apr 2006, 17:48:19 UTC


The watchdog aborted this although overall runtime was ony a couple of minutes. After a few minutes runtime it was for a few hours preempted by another WU and after resuming the watchdog probably assumed it run for over an hour with no progress. It seems the Watchdog is only comparing two points in time without checking what happened inbetween.

04/26/06 18:59:19||Rescheduling CPU: application exited
04/26/06 18:59:19|ralph@home|Computation for task AB_CASP6_u272__444_4_0 finished


335.453125
stderr out

<core_client_version>5.4.6</core_client_version>
<stderr_txt>
# cpu_run_time_pref: 10800
# random seed: 3882530
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is killing the run!
Stuck at score 33.7964 for 3600 seconds
**********************************************************************
GZIP SILENT FILE: .xxu272.out
WARNING! error deleting file .xxu272.out

</stderr_txt>

ID: 1382 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 1385 - Posted: 26 Apr 2006, 19:28:16 UTC

Linux 256 MB RAM, plenty of swap space, WU Freeze at 100% Done !

Wed Apr 26 16:41:36 BRT 2006
crobertp [/home/boinc/BOINC] > cat stdoutdae.txt | grep CASP6
2006-04-26 12:31:54 [ralph@home] Starting result FA_CASP6_v272__435_19_0 using rosetta_beta version 505
2006-04-26 12:34:07 [ralph@home] Pausing result FA_CASP6_v272__435_19_0 (left in memory)
2006-04-26 12:48:20 [ralph@home] Resuming result FA_CASP6_v272__435_19_0 using rosetta_beta version 505
2006-04-26 13:55:38 [ralph@home] Pausing result FA_CASP6_v272__435_19_0 (left in memory)
2006-04-26 15:46:56 [ralph@home] Resuming result FA_CASP6_v272__435_19_0 using rosetta_beta version 505
Wed Apr 26 16:44:41 BRT 2006

CPU usage 0.0000%

What should I do next ?

Perhaps kill app with some special signal to force a core dump
and then e-mail that core dump to ???
Thanks
Click signature for global team stats
ID: 1385 · Report as offensive    Reply Quote
Jose
Avatar

Send message
Joined: 25 Apr 06
Posts: 7
Credit: 77
RAC: 0
Message 1386 - Posted: 26 Apr 2006, 20:35:09 UTC

This Unit was aborted after less than one hour of runing ( My time preference is 2 hours)

https://ralph.bakerlab.org/result.php?resultid=97305

AB_CASP6_t216__438_3_0

Workunit 86138

CPU time 3180.21875
stderr out <core_client_version>5.2.13</core_client_version>
<stderr_txt>
# random seed: 3882811
# cpu_run_time_pref: 7200
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is killing the run!
Stuck at score 71.0875 for 3600 seconds
**********************************************************************
GZIP SILENT FILE: .xxt216.out
WARNING! attempt to gzip file .xxt216.out failed: file does not exist.

</stderr_txt>
<message><file_xfer_error>
<file_name>AB_CASP6_t216__438_3_0_0</file_name>
<error_code>-161</error_code>
<error_message></error_message>
</file_xfer_error>

</message>
Validate state Invalid
Claimed credit 11.1502124855248
Granted credit 0
application version 5.05

ID: 1386 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 1391 - Posted: 27 Apr 2006, 2:34:12 UTC

Just got home to see how my PC did, I think the watchdog ralphed all over my work. This is on my same host as I posted about earlier where I was short on VM swap space and lost 3 WUs. Problems all over, watchdog kicking in (I've never had any hung WUs before, so seems unlikely it was required), failing after <1 min. I did not abort any v5.05 WUs. Very few successes.

Now my Ralph WUs are completed, and I'm crunching R@H again. Got a WU with FAST in the name, the thing has ripped 607 models in under 14hrs (I have 24hr preference, the output file is 4.6MB now... might come close to 10 once we're done!). I didn't reboot or restart BOINC since the memory issues about 13hrs ago.
ID: 1391 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 1392 - Posted: 27 Apr 2006, 3:47:55 UTC - in response to Message 1385.  

This almost looks like a BOINC manager problem. Go ahead and abort it; then can you post a link to the result here? Thanks. No need to send us the core dump.

Linux 256 MB RAM, plenty of swap space, WU Freeze at 100% Done !

Wed Apr 26 16:41:36 BRT 2006
crobertp [/home/boinc/BOINC] > cat stdoutdae.txt | grep CASP6
2006-04-26 12:31:54 [ralph@home] Starting result FA_CASP6_v272__435_19_0 using rosetta_beta version 505
2006-04-26 12:34:07 [ralph@home] Pausing result FA_CASP6_v272__435_19_0 (left in memory)
2006-04-26 12:48:20 [ralph@home] Resuming result FA_CASP6_v272__435_19_0 using rosetta_beta version 505
2006-04-26 13:55:38 [ralph@home] Pausing result FA_CASP6_v272__435_19_0 (left in memory)
2006-04-26 15:46:56 [ralph@home] Resuming result FA_CASP6_v272__435_19_0 using rosetta_beta version 505
Wed Apr 26 16:44:41 BRT 2006

CPU usage 0.0000%

What should I do next ?

Perhaps kill app with some special signal to force a core dump
and then e-mail that core dump to ???
Thanks


ID: 1392 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 1393 - Posted: 27 Apr 2006, 3:50:11 UTC - in response to Message 1391.  
Last modified: 27 Apr 2006, 3:53:12 UTC

Feet1st, whoa you've got a really fast client....

As for ralph, I share your concerns. I'm trying another fix on 5.06, can you attach to ralph now? If the watchdog is still too aggressive, we'll have to keep it off for rosetta@home until we avoid this error. Two questions for you:

How often do you switch between apps (if at all)?
When ralph is pre-empted, do you "Keep in Memory"?

My prediction is that the clients that have been having trouble with the watchdog switch occasionally between apps, and keep in memory. That was the case for tralala below, and I've put in the fix for that case. Let me know.


Just got home to see how my PC did, I think the watchdog ralphed all over my work. This is on my same host as I posted about earlier where I was short on VM swap space and lost 3 WUs. Problems all over, watchdog kicking in (I've never had any hung WUs before, so seems unlikely it was required), failing after <1 min. I did not abort any v5.05 WUs. Very few successes.

Now my Ralph WUs are completed, and I'm crunching R@H again. Got a WU with FAST in the name, the thing has ripped 607 models in under 14hrs (I have 24hr preference, the output file is 4.6MB now... might come close to 10 once we're done!). I didn't reboot or restart BOINC since the memory issues about 13hrs ago.


ID: 1393 · Report as offensive    Reply Quote
simpe73

Send message
Joined: 20 Feb 06
Posts: 2
Credit: 36,752
RAC: 0
Message 1394 - Posted: 27 Apr 2006, 4:05:05 UTC

5.05 works fine, but.... In every result I've checked there is "WARNING! error deleting file .xxv272.out". Name of file varies, but they are allways .out -files.
ID: 1394 · Report as offensive    Reply Quote
[B^S] sTrey
Avatar

Send message
Joined: 15 Feb 06
Posts: 58
Credit: 15,430
RAC: 0
Message 1397 - Posted: 27 Apr 2006, 6:41:18 UTC
Last modified: 27 Apr 2006, 6:52:38 UTC

wu 79004 killed by watchdog just after 2 hrs' runtime (Stuck at score -115.914 for 3600 seconds)
(Sorry this was 5.05, can't get any 5.06 wus)
ID: 1397 · Report as offensive    Reply Quote
tralala

Send message
Joined: 12 Apr 06
Posts: 52
Credit: 15,257
RAC: 0
Message 1398 - Posted: 27 Apr 2006, 9:32:05 UTC

I just finished one WU with several deliberate switchings inbetween (long and short) and it seems 5.06 has solved the issue. The warning message about the file deletion error is also gone. :-)

https://ralph.bakerlab.org/result.php?resultid=98261

Now Rhiju please have a look at this and this.

ID: 1398 · Report as offensive    Reply Quote
Jose
Avatar

Send message
Joined: 25 Apr 06
Posts: 7
Credit: 77
RAC: 0
Message 1400 - Posted: 27 Apr 2006, 12:20:19 UTC

Okies I have been running the following RALPH Work Unit:
ID 6204
Name AB_CASP6_t198__438_5_0

It worked for around 57 minutes and then was preempted ( keeping the record of the CPU Time in my work record) and the corresponding Rosetta Work Unit restarted. Once the Rosetta Work Unit stopped, the application switched to RALPH and Work Unit ID 6204 restarted , it started DE NOVO , that is from CPU time of 0 and not from the accumulated 57+ minutes it had when it preempted and the application switch happened. My preferences are set so that work is kept in memory and this did not happened in this case.

So to make the story short: the 57+ CPU time for the Work Unit that have been stored in memory disappeared into the big void in the sky. :)






ID: 1400 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 1401 - Posted: 27 Apr 2006, 14:07:52 UTC - in response to Message 1393.  

How often do you switch between apps (if at all)?
When ralph is pre-empted, do you "Keep in Memory"?

My prediction is that the clients that have been having trouble with the watchdog switch occasionally between apps, and keep in memory. That was the case for tralala below, and I've put in the fix for that case. Let me know.

Sorry, I'm not positive. I changed my settings to actually try and stress Ralph a little bit, but I am not certain if they took effect on THAT PC or not, depends when it updated. I believe it had a 360 min (6hrs) switch between projects, and leave in memory at the time of failures. My other Ralph host was updated to have 20min switch time, and remove from memory, and it seems to be going well, but it hasn't had other projects interrupting it either.

ID: 1401 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 1403 - Posted: 27 Apr 2006, 17:48:22 UTC - in response to Message 1401.  

Thanks for all the advice. I think we've largely killed the watchdog timer problem and are ready to release. (Feet1st, you noticed how bad the problem was with 5.05; has your client tried any 5.06?) We haven't seen any evidence for jobs being aborted prematurely by the watchdog, except for the tests where we forced an infinite loop.

A few quick replies:

I'll bring the debate about shorter/longer deadlines (or a mix) to the attention of the other project scientists.

I really do like Feet1st's idea to ask ralph users to lower the fraction of time their client spends on ralph. That will distribute the jobs to as many different cpus as possible. I can make a note of it on the news page next time we release.


How often do you switch between apps (if at all)?
When ralph is pre-empted, do you "Keep in Memory"?

My prediction is that the clients that have been having trouble with the watchdog switch occasionally between apps, and keep in memory. That was the case for tralala below, and I've put in the fix for that case. Let me know.

Sorry, I'm not positive. I changed my settings to actually try and stress Ralph a little bit, but I am not certain if they took effect on THAT PC or not, depends when it updated. I believe it had a 360 min (6hrs) switch between projects, and leave in memory at the time of failures. My other Ralph host was updated to have 20min switch time, and remove from memory, and it seems to be going well, but it hasn't had other projects interrupting it either.


ID: 1403 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 1404 - Posted: 27 Apr 2006, 17:53:19 UTC - in response to Message 1385.  

A quick reply to Carlos... it seems like all your ralph jobs have been erroring out. The error message we're seeing is something about a lost heartbeat from the core client. That doesn't sound good. Have you had this issue with any workunits from rosetta@home?

Also, do you have a new version of the BOINC app? Thanks.

Linux 256 MB RAM, plenty of swap space, WU Freeze at 100% Done !

Wed Apr 26 16:41:36 BRT 2006
crobertp [/home/boinc/BOINC] > cat stdoutdae.txt | grep CASP6
2006-04-26 12:31:54 [ralph@home] Starting result FA_CASP6_v272__435_19_0 using rosetta_beta version 505
2006-04-26 12:34:07 [ralph@home] Pausing result FA_CASP6_v272__435_19_0 (left in memory)
2006-04-26 12:48:20 [ralph@home] Resuming result FA_CASP6_v272__435_19_0 using rosetta_beta version 505
2006-04-26 13:55:38 [ralph@home] Pausing result FA_CASP6_v272__435_19_0 (left in memory)
2006-04-26 15:46:56 [ralph@home] Resuming result FA_CASP6_v272__435_19_0 using rosetta_beta version 505
Wed Apr 26 16:44:41 BRT 2006

CPU usage 0.0000%

What should I do next ?

Perhaps kill app with some special signal to force a core dump
and then e-mail that core dump to ???
Thanks


ID: 1404 · Report as offensive    Reply Quote
1 · 2 · 3 · 4 · Next

Message boards : RALPH@home bug list : Bug reports for Ralph 5.05 and higher



©2024 University of Washington
http://www.bakerlab.org