| Author | Message |
|
|
|
This version has some boinc-related fixes in the watchdog and graphics. |
|
|
|
|
|
Hi,
I have Ralph WU Result 149188 that is stuck in BOINC Mgr queue at 100% and time 1:20:42. It has status \"Running\". But in Graphics window the result is completed at 67.2% with time 1:20:45.
CPU usage is 50% and only another WU is really running. (I have P4-2.6 GHz HT with 2 logical CPUs). Or other words, 2 Ralph WUs are running but only one uses CPU at 50% and another at 0%.
PS: This result is completed now successfully and reported with messages:
BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
{Edit} No other problems.
{Edit 2} There was something like described in this post.
____________
 |
|
|
|
|
This version has some boinc-related fixes in the watchdog and graphics. I confirmed graphics has been fixed. It works more smoothly than before.
____________
|
|
|
|
|
My Ralph calculates the WUs to 100% but doesnt send them, and they are still \"active\" but there is no further calculation, the programm continues with my rosetta WUs...
Oh it DID send the WU after some time, sry!!!
____________
|
|
|
|
|
|
I ran 3 work units.
Two actually completed but I suspect the \"dormant\" bug is still present as this first work (149120) unit completed in EXACTLY 1 hour with 36 min of CPU time, and this other one (149885) completed in EXACTLY 2 hours with 81 min of CPU time.
The third (149194) errored out with:
Unrecoverable error for result t296__CASP7_ABINITIO_SAVE_ALL_OUT_hom013__614_2_1 (One or more arguments are invalid (0x80000003) - exit code -2147483645 (0x80000003))
The upload indicated a watchdog shut down.
Mike
____________
 |
|
|
|
|
I ran 3 work units.
Two actually completed but I suspect the \"dormant\" bug is still present as this first work (149120) unit completed in EXACTLY 1 hour with 36 min of CPU time, and this other one (149885) completed in EXACTLY 2 hours with 81 min of CPU time.
...
Mike
The \"dormant\" bug was in this one also: http://ralph.bakerlab.org/workunit.php?wuid=131456
Result: http://ralph.bakerlab.org/result.php?resultid=148875
And unmonitored my computer went into sleepmode, so it started to upload, when I got back to my computer again. This means that my computer was idle for some time, where it could have crunched something else. :-(
So I aborted the next WU and I have set Ralph to No new work, untill you have this sorted out. I will not have a computer being in sleepmode for a longer time untill I can get to it again, so it can continue crunching. In worst case it can be for a whole day. :-(
EDIT: Can\'t you make a watchdog to activate the WU again, after it has been idle for, let\'s say 3 minutes? Or 5 minutes? Not crunching my computer goes into sleepmode after 15 minutes.
____________
"I'm trying to maintain a shred of dignity in this world." - Me
 |
|
|
|
|
EDIT: Can\'t you make a watchdog to activate the WU again, after it has been idle for, let\'s say 3 minutes? Or 5 minutes? Not crunching my computer goes into sleepmode after 15 minutes.
This is a bug which was invented after 5.16 so I hope they can spot it and fix it completely rather than adding another safety mechanism.
____________
|
|
|
|
|
My Ralph calculates the WUs to 100% but doesnt send them, and they are still \"active\" but there is no further calculation, the programm continues with my rosetta WUs...
Oh it DID send the WU after some time, sry!!!
I\'ve noticed this too with 5.19 and 5.20. My pref is set to 2 hours and my crunching interval is 2:01. The wus I\'ve been getting happen to finish early, say 1:45, go to 100% but then pause instead of completing. Nothing else such as downloads has triggered early rescheduling. The next time the wu gets crunch-time it completes immediately and uploads.
Not causing any problems but it\'s definitely different behavior, and after about 5 in a row not counting one that errored out, it doesn\'t seem coincidental.
sample result |
|
|
|
|
|
3WUs went fine, 4th got stucked at 100% for hours - http://ralph.bakerlab.org/result.php?resultid=150036.
3 more to go...
____________
|
|
|
|
|
|
(too late to edit). Another one sitting idle at 100% - http://ralph.bakerlab.org/result.php?resultid=150039 so 2 of 6 got stucked at finish in my case.
____________
|
|
|
|
|
|
Rom tells me it is waiting for the watchdog to finish for debugging.
Here is his response:
\"When I added code .... to wait until
the thread is finished, it stalls for up to 30 minutes waiting until
watchdog makes its next check.\"
I think the watchdog can take up to 2x the cpu run time pref, which may explain the longer stalls.
____________
|
|
|
|
|
Rom tells me it is waiting for the watchdog to finish for debugging.
Here is his response:
\"When I added code .... to wait until
the thread is finished, it stalls for up to 30 minutes waiting until
watchdog makes its next check.\"
I think the watchdog can take up to 2x the cpu run time pref, which may explain the longer stalls.
Does this mean it was intentionally implemented for debugging purposes? You could have saved us some investigation if you would have told us. Anyway it\'s good to know that the reason is known and won\'t delay any further development.
____________
|
|
|
|
|
|
How long before you should abort WU\'s stuck at 100%? Why does my firewall show a lot of traffic for bonic ralph client even though its stuck at 100% I have all other projects suspended to see if the WU will report.
____________
|
|
|
|
|
How long before you should abort WU\'s stuck at 100%? Why does my firewall show a lot of traffic for bonic ralph client even though its stuck at 100% I have all other projects suspended to see if the WU will report.
Wild guess: The client is downloading (BIIIG) symbol tables for the debug output??
Norbert
____________
|
|
|
|
|
|
Rosetta_betta_5.20 Windows
# This process generated 2 decoys from 2 attempts
BOINC :: Watchdog shutting down...
Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x77F9193C
Engaging BOINC Windows Runtime Debugger...
</stderr_txt>
I abborted this result cause it was running using 0.0000% of CPU ie: STUCK
http://ralph.bakerlab.org/result.php?resultid=150083
With 5.19 I waited for 6 hours, what happens, and rebooted too -:(
*My preference runtime for ralph is 1 hour
But I will not do this anymore -:)
CPU Temperature changes can crack silicon
and renders my 7 ghz putter innoperant
*Also I am crunching to rosetta too. (CASP7) - in need of more cpu power !
Now,
If CPU temperature decreases to below 60 C, and the alarm sounds,
immediattely I act to find the cause -:)
So, IF I go to asleep, I stop crunching for ralph first.
*May be a sutck condition occurs while I asleep
Thanks
____________
Click signature for global team stats
  |
|
|
|
|
Rom tells me it is waiting for the watchdog to finish for debugging.
Here is his response:
\"When I added code .... to wait until
the thread is finished, it stalls for up to 30 minutes waiting until
watchdog makes its next check.\"
I think the watchdog can take up to 2x the cpu run time pref, which may explain the longer stalls.
Yes, but my problem is that my computer goes into sleepmode after 15 minutes, and what then? Then it takes untill I get to it and can start it again. And then, if I\'m unlucky, I can sit and wait with an idle computer for one hour untill the clock triggers the upload.
No, I\'m still on No new work here. :-(
____________
"I'm trying to maintain a shred of dignity in this world." - Me
 |
|
|
|
|
|
Hi everybody: Rom and I fixed this silly watchdog thing. I\'m sending out work now with ralph 5.21! Thanks for helping us out with this.
Rom tells me it is waiting for the watchdog to finish for debugging.
Here is his response:
\"When I added code .... to wait until
the thread is finished, it stalls for up to 30 minutes waiting until
watchdog makes its next check.\"
I think the watchdog can take up to 2x the cpu run time pref, which may explain the longer stalls.
Yes, but my problem is that my computer goes into sleepmode after 15 minutes, and what then? Then it takes untill I get to it and can start it again. And then, if I\'m unlucky, I can sit and wait with an idle computer for one hour untill the clock triggers the upload.
No, I\'m still on No new work here. :-(
____________
|
|
|
|
|
|
Had 4-5 computing errors over weekend with 5.20.
All with similar error. See below.
WU: 132525
Outcome Client error
Client state Computing
Exit status 1 (0x1)
Computer ID 913
Report deadline 8 Jun 2006 23:40:23 UTC
CPU time 0.550792
stderr out <core_client_version>5.4.9</core_client_version>
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
ERROR:: Exit at: .\\fragments.cc line:767
</stderr_txt>
Validate state Invalid
--
RodEllery |
|
|
|
|
Had 4-5 computing errors over weekend with 5.20.
All with similar error. See below.
WU: 132525
Outcome Client error
Client state Computing
Exit status 1 (0x1)
Computer ID 913
Report deadline 8 Jun 2006 23:40:23 UTC
CPU time 0.550792
stderr out <core_client_version>5.4.9</core_client_version>
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
ERROR:: Exit at: .\\fragments.cc line:767
</stderr_txt>
Validate state Invalid
--
RodEllery
I got that error when I killed 5.20.exe in windows task manger. I thought it was stuck. Next unit I wanted about 2 hours after it was 100% and it reported.
____________
|
|
|
|
|
Hi everybody: Rom and I fixed this silly watchdog thing. I\'m sending out work now with ralph 5.21! Thanks for helping us out with this.
Ok, let me give it a try again.
____________
"I'm trying to maintain a shred of dignity in this world." - Me
 |
|
|
|
|
You could have saved us some investigation if you would have told us. Anyway it\'s good to know that the reason is known and won\'t delay any further development.
It wasn\'t intentional. We posted as soon as we found the cause.
edit: actually I think it was intentional to help diagnose issues with the watchdog. I think it was taking a bit longer than expected and has since been fixed by Rhiju and Rom.
____________
|
|
|
|
|
Rom tells me it is waiting for the watchdog to finish for debugging. edit: to debug the watchdog.
Here is his response:
\"When I added code .... to wait until
the thread is finished, it stalls for up to 30 minutes waiting until
watchdog makes its next check.\"
I think the watchdog can take up to 2x the cpu run time pref, which may explain the longer stalls.
____________
|
|
|