Bug reports for Ralph 5.03

Message boards : RALPH@home bug list : Bug reports for Ralph 5.03

To post messages, you must log in.

AuthorMessage
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 1309 - Posted: 23 Apr 2006, 3:18:39 UTC

We've tried to make the watchdog a little less aggressive about aborting, and are having it give
us back the reason for aborting. Let us know if
you think these jobs are getting killed too soon, or
too late. Thanks!
ID: 1309 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 1313 - Posted: 23 Apr 2006, 8:33:48 UTC - in response to Message 1309.  

Is anyone out there running with a Mac? Are your jobs from 5.02 or 5.03 running?

We've tried to make the watchdog a little less aggressive about aborting, and are having it give
us back the reason for aborting. Let us know if
you think these jobs are getting killed too soon, or
too late. Thanks!


ID: 1313 · Report as offensive    Reply Quote
tralala

Send message
Joined: 12 Apr 06
Posts: 52
Credit: 15,257
RAC: 0
Message 1319 - Posted: 23 Apr 2006, 14:01:46 UTC - in response to Message 1309.  

This WU was aborted by the watchdog on another machine but fished ok on my machine:

https://ralph.bakerlab.org/workunit.php?wuid=82603

Do you still receive the finished models if the watchdog kills a WU which gest stuck on model x?
ID: 1319 · Report as offensive    Reply Quote
Snake Doctor

Send message
Joined: 16 Feb 06
Posts: 37
Credit: 998,880
RAC: 0
Message 1320 - Posted: 23 Apr 2006, 15:50:18 UTC - in response to Message 1313.  
Last modified: 23 Apr 2006, 15:56:34 UTC

Is anyone out there running with a Mac? Are your jobs from 5.02 or 5.03 running?

We've tried to make the watchdog a little less aggressive about aborting, and are having it give
us back the reason for aborting. Let us know if
you think these jobs are getting killed too soon, or
too late. Thanks!



I am running on Macs. I have a G4 Dual that has a 5.03 job. And a G4 Laptop that was running 2, 5.01 jobs. One of the jobs on the Laptop hung at 1.4295% for 12 hours, I restarted BOINC and it Erred, but I cant get it to report (it is still stuck on my Tasks tab) I have a second on that machine that I aborted and it is stuck in the task tab too.

This may be a boinc thing. I had upgraded to boinc 5.4.4 per instructions from Rom for error checking. It looked ok but it is really not running right.

Anyway. the 5.03 WU on the G4 seems to be running fine. I changed the run time settings for it last night to 4 hours, and set the system to "remove apps from memory, to bang on it a while. It is at about 90% after 3:58 CPU time. I looked at the graphics last night and it seemed to be fine.

EDIT/UPDATE - The two WU stuck in my task tab finally reported

here is the one that was stuck for 12 hours.
here is the one I aborted manually.

Regards
Phil
ID: 1320 · Report as offensive    Reply Quote
Divide Overflow

Send message
Joined: 15 Feb 06
Posts: 12
Credit: 128,027
RAC: 0
Message 1324 - Posted: 23 Apr 2006, 17:34:06 UTC
Last modified: 23 Apr 2006, 17:36:04 UTC

The watchdog seems to be more of a junkyard dog. ;) It killed off two of my v5.02 WU's that seemed to be running just fine:

93821
93726
ID: 1324 · Report as offensive    Reply Quote
Snake Doctor

Send message
Joined: 16 Feb 06
Posts: 37
Credit: 998,880
RAC: 0
Message 1325 - Posted: 23 Apr 2006, 17:36:09 UTC

(sorry for the second post, Darned 1 hour edit limit)

Well my MAC G4 reported in this result for the only 5.03 WU I have had.

It looks very normal to me.

I do know the graphics were working (in fact they seemed faster somehow). I had no problems that I am aware of.

Regards
Phil
ID: 1325 · Report as offensive    Reply Quote
Profile Fuzzy Hollynoodles
Avatar

Send message
Joined: 19 Feb 06
Posts: 37
Credit: 2,089
RAC: 0
Message 1326 - Posted: 23 Apr 2006, 18:17:09 UTC
Last modified: 23 Apr 2006, 18:22:02 UTC

I had this one: https://ralph.bakerlab.org/workunit.php?wuid=83796

Result: https://ralph.bakerlab.org/result.php?resultid=94327

I looked to the graphic when I saw it running, and it was stuck at about 1% without any movements at all. So I suppose the watchdog did it's job by killing it after some time.

It ran about 80 minutes on my computer. It could have been killed a little sooner, I think, as it was totally dead, when I looked after about 45 minutes. I see the others, who ran it, did it in less time before it was killed.



[color=navy][b]"I'm trying to maintain a shred of dignity in this world." - Me[/b][/color]

ID: 1326 · Report as offensive    Reply Quote
MatthewBChambers

Send message
Joined: 13 Mar 06
Posts: 4
Credit: 5,367
RAC: 0
Message 1328 - Posted: 23 Apr 2006, 21:37:13 UTC

Host ID:
https://ralph.bakerlab.org/show_host_detail.php?hostid=2404

Result ID:
https://ralph.bakerlab.org/result.php?resultid=94285



Here is my 5.03 bug (in Windows XP, full details to follow):

4/23/2006 12:59:18 PM|ralph@home|Unrecoverable error for result NO_CHECK_NO_DOG_7486h002_dec123_1.pdb_408_5_2 (<file_xfer_error> <file_name>NO_CHECK_NO_DOG_7486h002_dec123_1.pdb_408_5_2_0</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error>)


Here is the context:
4/23/2006 12:36:34 PM|ralph@home|Resuming result NO_CHECK_NO_DOG_7486h002_dec123_1.pdb_408_5_2 using rosetta_beta version 503
4/23/2006 12:36:34 PM|boincsimap|Pausing result 60420100.007375_0 (left in memory)
4/23/2006 12:59:15 PM|ralph@home|Sending scheduler request to https://ralph.bakerlab.org/ralph_cgi/cgi
4/23/2006 12:59:15 PM|ralph@home|Reason: To fetch work
4/23/2006 12:59:15 PM|ralph@home|Requesting 43200 seconds of new work
4/23/2006 12:59:17 PM||request_reschedule_cpus: process exited
4/23/2006 12:59:17 PM|ralph@home|Computation for result NO_CHECK_NO_DOG_7486h002_dec123_1.pdb_408_5_2 finished
4/23/2006 12:59:17 PM|Predictor @ Home|Resuming result abeta_7_135392_2 using mfoldB125 version 428
4/23/2006 12:59:18 PM|ralph@home|Unrecoverable error for result NO_CHECK_NO_DOG_7486h002_dec123_1.pdb_408_5_2 (<file_xfer_error> <file_name>NO_CHECK_NO_DOG_7486h002_dec123_1.pdb_408_5_2_0</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error>)
4/23/2006 12:59:20 PM|ralph@home|Scheduler request to https://ralph.bakerlab.org/ralph_cgi/cgi succeeded
4/23/2006 12:59:21 PM|ralph@home|No work from project
4/23/2006 1:03:26 PM|ralph@home|Sending scheduler request to https://ralph.bakerlab.org/ralph_cgi/cgi
4/23/2006 1:03:26 PM|ralph@home|Reason: To fetch work
4/23/2006 1:03:26 PM|ralph@home|Requesting 43200 seconds of new work, and reporting 1 results
4/23/2006 1:03:31 PM|ralph@home|Scheduler request to https://ralph.bakerlab.org/ralph_cgi/cgi succeeded
4/23/2006 1:03:31 PM|ralph@home|No work from project


Here is the startup info for the computer:
4/15/2006 8:22:55 PM||Starting BOINC client version 5.2.13 for windows_intelx86
4/15/2006 8:22:55 PM||libcurl/7.14.0 OpenSSL/0.9.8 zlib/1.2.3
4/15/2006 8:22:55 PM||Data directory: C:Program FilesBOINC
4/15/2006 8:22:56 PM||Processor: 1 GenuineIntel x86 Family 6 Model 8 Stepping 6 863MHz
4/15/2006 8:22:56 PM||Memory: 383.30 MB physical, 1.29 GB virtual
4/15/2006 8:22:56 PM||Disk: 24.41 GB total, 19.33 GB free
4/15/2006 8:22:56 PM|rosetta@home|Computer ID: 197494; location: home; project prefs: default
4/15/2006 8:22:56 PM|boincsimap|Computer ID: 17955; location: home; project prefs: default
4/15/2006 8:22:56 PM|Einstein@Home|Computer ID: 594228; location: home; project prefs: default
4/15/2006 8:22:56 PM|LHC@home|Computer ID: 142531; location: home; project prefs: default
4/15/2006 8:22:56 PM|Predictor @ Home|Computer ID: 237773; location: home; project prefs: default
4/15/2006 8:22:56 PM|ralph@home|Computer ID: 2404; location: home; project prefs: default
4/15/2006 8:22:56 PM|SETI@home|Computer ID: 2330542; location: home; project prefs: default
4/15/2006 8:22:56 PM|SZTAKI Desktop Grid|Computer ID: 17392; location: home; project prefs: default
4/15/2006 8:22:56 PM|World Community Grid|Computer ID: 31989; location: ; project prefs: default
4/15/2006 8:22:56 PM||General prefs: from ralph@home (last modified 2006-04-15 20:06:57)
4/15/2006 8:22:56 PM||General prefs: no separate prefs for home; using your defaults
4/15/2006 8:22:57 PM||Remote control not allowed; using loopback address
ID: 1328 · Report as offensive    Reply Quote
Profile Fuzzy Hollynoodles
Avatar

Send message
Joined: 19 Feb 06
Posts: 37
Credit: 2,089
RAC: 0
Message 1330 - Posted: 24 Apr 2006, 2:30:39 UTC

My next one ran without erroring out, so I don't know if there was any watchdog or it was supposed to run normal.

https://ralph.bakerlab.org/workunit.php?wuid=83916

Result: https://ralph.bakerlab.org/result.php?resultid=94405


[color=navy][b]"I'm trying to maintain a shred of dignity in this world." - Me[/b][/color]

ID: 1330 · Report as offensive    Reply Quote
casio7131

Send message
Joined: 20 Mar 06
Posts: 15
Credit: 12,660
RAC: 0
Message 1331 - Posted: 24 Apr 2006, 2:46:00 UTC

24/04/2006 3:14:57 AM|ralph@home|Unrecoverable error for result NOCHECK_DEFAULT_DOG_7486h002_dec184_1.pdb_418_5_0 (<file_xfer_error> <file_name>NOCHECK_DEFAULT_DOG_7486h002_dec184_1.pdb_418_5_0_0</file_name> <error_code>-161</error_code></file_xfer_error>)
https://ralph.bakerlab.org/result.php?resultid=94190
ID: 1331 · Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 16 Feb 06
Posts: 166
Credit: 131,419
RAC: 0
Message 1333 - Posted: 24 Apr 2006, 12:37:39 UTC
Last modified: 24 Apr 2006, 13:03:56 UTC

A new error for me.

https://ralph.bakerlab.org/result.php?resultid=94349

This was error no 2 on this WU.

And another one

https://ralph.bakerlab.org/result.php?resultid=94350


Anders n

Edit no 2
ID: 1333 · Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 17 Feb 06
Posts: 50
Credit: 55,397
RAC: 0
Message 1334 - Posted: 24 Apr 2006, 13:28:59 UTC - in response to Message 1331.  

24/04/2006 3:14:57 AM|ralph@home|Unrecoverable error for result NOCHECK_DEFAULT_DOG_7486h002_dec184_1.pdb_418_5_0 (<file_xfer_error> <file_name>NOCHECK_DEFAULT_DOG_7486h002_dec184_1.pdb_418_5_0_0</file_name> <error_code>-161</error_code></file_xfer_error>)
https://ralph.bakerlab.org/result.php?resultid=94190



I thought credit was supposed to be granted on these?
ID: 1334 · Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 16 Feb 06
Posts: 251
Credit: 0
RAC: 0
Message 1335 - Posted: 24 Apr 2006, 14:31:01 UTC - in response to Message 1334.  

24/04/2006 3:14:57 AM|ralph@home|Unrecoverable error for result NOCHECK_DEFAULT_DOG_7486h002_dec184_1.pdb_418_5_0 (<file_xfer_error> <file_name>NOCHECK_DEFAULT_DOG_7486h002_dec184_1.pdb_418_5_0_0</file_name> <error_code>-161</error_code></file_xfer_error>)
https://ralph.bakerlab.org/result.php?resultid=94190



I thought credit was supposed to be granted on these?

The system IS supposed to award the claimed credits, for "Watchdog" terminated Work Units. But the ones I have seen so far have always had some model information reported back. Yours seems to have a 161 error implying that something file related is in play. Rhiju will have to explain why it did not get awarded.

As you may recall in RALPH the credit will not be awarded after the fact, but we still need to know why it did not get awarded in the first place before this deploys to Rosetta.

Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact
ID: 1335 · Report as offensive    Reply Quote
Yeti
Avatar

Send message
Joined: 19 Feb 06
Posts: 30
Credit: 49,557
RAC: 0
Message 1336 - Posted: 24 Apr 2006, 14:59:10 UTC

HM, if you are looking for results, that have been "killed" by the watchdog and did not get credits, look at:

https://ralph.bakerlab.org/results.php?userid=581

There you can see several results, that have not been credited.




Supporting BOINC, a great concept !
ID: 1336 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 1337 - Posted: 24 Apr 2006, 19:25:55 UTC - in response to Message 1335.  

Thanks for the post; moderator 9 is right about what happened.
Sorry about the annoying file transfer error -- I've fixed it. We will be testing it on ralph 5.04 later today.

24/04/2006 3:14:57 AM|ralph@home|Unrecoverable error for result NOCHECK_DEFAULT_DOG_7486h002_dec184_1.pdb_418_5_0 ( NOCHECK_DEFAULT_DOG_7486h002_dec184_1.pdb_418_5_0_0 -161)
https://ralph.bakerlab.org/result.php?resultid=94190



I thought credit was supposed to be granted on these?

The system IS supposed to award the claimed credits, for "Watchdog" terminated Work Units. But the ones I have seen so far have always had some model information reported back. Yours seems to have a 161 error implying that something file related is in play. Rhiju will have to explain why it did not get awarded.

As you may recall in RALPH the credit will not be awarded after the fact, but we still need to know why it did not get awarded in the first place before this deploys to Rosetta.


ID: 1337 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 1339 - Posted: 25 Apr 2006, 1:52:53 UTC - in response to Message 1326.  

GREAT! I actually forced an infinite loop in that one. Very glad it
was killed by the watchdog.
I had this one: https://ralph.bakerlab.org/workunit.php?wuid=83796

Result: https://ralph.bakerlab.org/result.php?resultid=94327

I looked to the graphic when I saw it running, and it was stuck at about 1% without any movements at all. So I suppose the watchdog did it's job by killing it after some time.

It ran about 80 minutes on my computer. It could have been killed a little sooner, I think, as it was totally dead, when I looked after about 45 minutes. I see the others, who ran it, did it in less time before it was killed.




ID: 1339 · Report as offensive    Reply Quote

Message boards : RALPH@home bug list : Bug reports for Ralph 5.03



©2024 University of Washington
http://www.bakerlab.org