| Author | Message |
|
|
|
We\'ve tried to make the watchdog a little less aggressive about aborting, and are having it give
us back the reason for aborting. Let us know if
you think these jobs are getting killed too soon, or
too late. Thanks!
____________
|
|
|
|
|
|
Is anyone out there running with a Mac? Are your jobs from 5.02 or 5.03 running?
We\'ve tried to make the watchdog a little less aggressive about aborting, and are having it give
us back the reason for aborting. Let us know if
you think these jobs are getting killed too soon, or
too late. Thanks!
____________
|
|
|
|
|
|
This WU was aborted by the watchdog on another machine but fished ok on my machine:
http://ralph.bakerlab.org/workunit.php?wuid=82603
Do you still receive the finished models if the watchdog kills a WU which gest stuck on model x?
____________
|
|
|
|
|
Is anyone out there running with a Mac? Are your jobs from 5.02 or 5.03 running?
We\'ve tried to make the watchdog a little less aggressive about aborting, and are having it give
us back the reason for aborting. Let us know if
you think these jobs are getting killed too soon, or
too late. Thanks!
I am running on Macs. I have a G4 Dual that has a 5.03 job. And a G4 Laptop that was running 2, 5.01 jobs. One of the jobs on the Laptop hung at 1.4295% for 12 hours, I restarted BOINC and it Erred, but I cant get it to report (it is still stuck on my Tasks tab) I have a second on that machine that I aborted and it is stuck in the task tab too.
This may be a boinc thing. I had upgraded to boinc 5.4.4 per instructions from Rom for error checking. It looked ok but it is really not running right.
Anyway. the 5.03 WU on the G4 seems to be running fine. I changed the run time settings for it last night to 4 hours, and set the system to \"remove apps from memory, to bang on it a while. It is at about 90% after 3:58 CPU time. I looked at the graphics last night and it seemed to be fine.
EDIT/UPDATE - The two WU stuck in my task tab finally reported
here is the one that was stuck for 12 hours.
here is the one I aborted manually.
Regards
Phil
____________
|
|
|
|
|
|
The watchdog seems to be more of a junkyard dog. ;) It killed off two of my v5.02 WU\'s that seemed to be running just fine:
93821
93726
____________
|
|
|
|
|
|
(sorry for the second post, Darned 1 hour edit limit)
Well my MAC G4 reported in this result for the only 5.03 WU I have had.
It looks very normal to me.
I do know the graphics were working (in fact they seemed faster somehow). I had no problems that I am aware of.
Regards
Phil
____________
|
|
|
|
|
|
I had this one: http://ralph.bakerlab.org/workunit.php?wuid=83796
Result: http://ralph.bakerlab.org/result.php?resultid=94327
I looked to the graphic when I saw it running, and it was stuck at about 1% without any movements at all. So I suppose the watchdog did it\'s job by killing it after some time.
It ran about 80 minutes on my computer. It could have been killed a little sooner, I think, as it was totally dead, when I looked after about 45 minutes. I see the others, who ran it, did it in less time before it was killed.
____________
"I'm trying to maintain a shred of dignity in this world." - Me
 |
|
|
|
|
|
Host ID:
http://ralph.bakerlab.org/show_host_detail.php?hostid=2404
Result ID:
http://ralph.bakerlab.org/result.php?resultid=94285
Here is my 5.03 bug (in Windows XP, full details to follow):
4/23/2006 12:59:18 PM|ralph@home|Unrecoverable error for result NO_CHECK_NO_DOG_7486h002_dec123_1.pdb_408_5_2 (<file_xfer_error> <file_name>NO_CHECK_NO_DOG_7486h002_dec123_1.pdb_408_5_2_0</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error>)
Here is the context:
4/23/2006 12:36:34 PM|ralph@home|Resuming result NO_CHECK_NO_DOG_7486h002_dec123_1.pdb_408_5_2 using rosetta_beta version 503
4/23/2006 12:36:34 PM|boincsimap|Pausing result 60420100.007375_0 (left in memory)
4/23/2006 12:59:15 PM|ralph@home|Sending scheduler request to http://ralph.bakerlab.org/ralph_cgi/cgi
4/23/2006 12:59:15 PM|ralph@home|Reason: To fetch work
4/23/2006 12:59:15 PM|ralph@home|Requesting 43200 seconds of new work
4/23/2006 12:59:17 PM||request_reschedule_cpus: process exited
4/23/2006 12:59:17 PM|ralph@home|Computation for result NO_CHECK_NO_DOG_7486h002_dec123_1.pdb_408_5_2 finished
4/23/2006 12:59:17 PM|Predictor @ Home|Resuming result abeta_7_135392_2 using mfoldB125 version 428
4/23/2006 12:59:18 PM|ralph@home|Unrecoverable error for result NO_CHECK_NO_DOG_7486h002_dec123_1.pdb_408_5_2 (<file_xfer_error> <file_name>NO_CHECK_NO_DOG_7486h002_dec123_1.pdb_408_5_2_0</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error>)
4/23/2006 12:59:20 PM|ralph@home|Scheduler request to http://ralph.bakerlab.org/ralph_cgi/cgi succeeded
4/23/2006 12:59:21 PM|ralph@home|No work from project
4/23/2006 1:03:26 PM|ralph@home|Sending scheduler request to http://ralph.bakerlab.org/ralph_cgi/cgi
4/23/2006 1:03:26 PM|ralph@home|Reason: To fetch work
4/23/2006 1:03:26 PM|ralph@home|Requesting 43200 seconds of new work, and reporting 1 results
4/23/2006 1:03:31 PM|ralph@home|Scheduler request to http://ralph.bakerlab.org/ralph_cgi/cgi succeeded
4/23/2006 1:03:31 PM|ralph@home|No work from project
Here is the startup info for the computer:
4/15/2006 8:22:55 PM||Starting BOINC client version 5.2.13 for windows_intelx86
4/15/2006 8:22:55 PM||libcurl/7.14.0 OpenSSL/0.9.8 zlib/1.2.3
4/15/2006 8:22:55 PM||Data directory: C:\\Program Files\\BOINC
4/15/2006 8:22:56 PM||Processor: 1 GenuineIntel x86 Family 6 Model 8 Stepping 6 863MHz
4/15/2006 8:22:56 PM||Memory: 383.30 MB physical, 1.29 GB virtual
4/15/2006 8:22:56 PM||Disk: 24.41 GB total, 19.33 GB free
4/15/2006 8:22:56 PM|rosetta@home|Computer ID: 197494; location: home; project prefs: default
4/15/2006 8:22:56 PM|boincsimap|Computer ID: 17955; location: home; project prefs: default
4/15/2006 8:22:56 PM|Einstein@Home|Computer ID: 594228; location: home; project prefs: default
4/15/2006 8:22:56 PM|LHC@home|Computer ID: 142531; location: home; project prefs: default
4/15/2006 8:22:56 PM|Predictor @ Home|Computer ID: 237773; location: home; project prefs: default
4/15/2006 8:22:56 PM|ralph@home|Computer ID: 2404; location: home; project prefs: default
4/15/2006 8:22:56 PM|SETI@home|Computer ID: 2330542; location: home; project prefs: default
4/15/2006 8:22:56 PM|SZTAKI Desktop Grid|Computer ID: 17392; location: home; project prefs: default
4/15/2006 8:22:56 PM|World Community Grid|Computer ID: 31989; location: ; project prefs: default
4/15/2006 8:22:56 PM||General prefs: from ralph@home (last modified 2006-04-15 20:06:57)
4/15/2006 8:22:56 PM||General prefs: no separate prefs for home; using your defaults
4/15/2006 8:22:57 PM||Remote control not allowed; using loopback address |
|
|
|
|
|
My next one ran without erroring out, so I don\'t know if there was any watchdog or it was supposed to run normal.
http://ralph.bakerlab.org/workunit.php?wuid=83916
Result: http://ralph.bakerlab.org/result.php?resultid=94405
____________
"I'm trying to maintain a shred of dignity in this world." - Me
 |
|
|
|
|
|
24/04/2006 3:14:57 AM|ralph@home|Unrecoverable error for result NOCHECK_DEFAULT_DOG_7486h002_dec184_1.pdb_418_5_0 (<file_xfer_error> <file_name>NOCHECK_DEFAULT_DOG_7486h002_dec184_1.pdb_418_5_0_0</file_name> <error_code>-161</error_code></file_xfer_error>)
http://ralph.bakerlab.org/result.php?resultid=94190
____________
|
|
|
|
|
|
A new error for me.
http://ralph.bakerlab.org/result.php?resultid=94349
This was error no 2 on this WU.
And another one
http://ralph.bakerlab.org/result.php?resultid=94350
Anders n
Edit no 2
____________
|
|
|
|
|
24/04/2006 3:14:57 AM|ralph@home|Unrecoverable error for result NOCHECK_DEFAULT_DOG_7486h002_dec184_1.pdb_418_5_0 (<file_xfer_error> <file_name>NOCHECK_DEFAULT_DOG_7486h002_dec184_1.pdb_418_5_0_0</file_name> <error_code>-161</error_code></file_xfer_error>)
http://ralph.bakerlab.org/result.php?resultid=94190
I thought credit was supposed to be granted on these?
____________
 |
|
|
|
|
24/04/2006 3:14:57 AM|ralph@home|Unrecoverable error for result NOCHECK_DEFAULT_DOG_7486h002_dec184_1.pdb_418_5_0 (<file_xfer_error> <file_name>NOCHECK_DEFAULT_DOG_7486h002_dec184_1.pdb_418_5_0_0</file_name> <error_code>-161</error_code></file_xfer_error>)
http://ralph.bakerlab.org/result.php?resultid=94190
I thought credit was supposed to be granted on these?
The system IS supposed to award the claimed credits, for \"Watchdog\" terminated Work Units. But the ones I have seen so far have always had some model information reported back. Yours seems to have a 161 error implying that something file related is in play. Rhiju will have to explain why it did not get awarded.
As you may recall in RALPH the credit will not be awarded after the fact, but we still need to know why it did not get awarded in the first place before this deploys to Rosetta.
____________
Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact |
|
|
|
|
|
HM, if you are looking for results, that have been \"killed\" by the watchdog and did not get credits, look at:
http://ralph.bakerlab.org/results.php?userid=581
There you can see several results, that have not been credited.
____________

Supporting BOINC, a great concept ! |
|
|
|
|
|
Thanks for the post; moderator 9 is right about what happened.
Sorry about the annoying file transfer error -- I\'ve fixed it. We will be testing it on ralph 5.04 later today.
24/04/2006 3:14:57 AM|ralph@home|Unrecoverable error for result NOCHECK_DEFAULT_DOG_7486h002_dec184_1.pdb_418_5_0 (<file_xfer_error> <file_name>NOCHECK_DEFAULT_DOG_7486h002_dec184_1.pdb_418_5_0_0</file_name> <error_code>-161</error_code></file_xfer_error>)
http://ralph.bakerlab.org/result.php?resultid=94190
I thought credit was supposed to be granted on these?
The system IS supposed to award the claimed credits, for \"Watchdog\" terminated Work Units. But the ones I have seen so far have always had some model information reported back. Yours seems to have a 161 error implying that something file related is in play. Rhiju will have to explain why it did not get awarded.
As you may recall in RALPH the credit will not be awarded after the fact, but we still need to know why it did not get awarded in the first place before this deploys to Rosetta.
____________
|
|
|
|
|
|
GREAT! I actually forced an infinite loop in that one. Very glad it
was killed by the watchdog.
I had this one: http://ralph.bakerlab.org/workunit.php?wuid=83796
Result: http://ralph.bakerlab.org/result.php?resultid=94327
I looked to the graphic when I saw it running, and it was stuck at about 1% without any movements at all. So I suppose the watchdog did it\'s job by killing it after some time.
It ran about 80 minutes on my computer. It could have been killed a little sooner, I think, as it was totally dead, when I looked after about 45 minutes. I see the others, who ran it, did it in less time before it was killed.
____________
|
|
|