Discussion of the \"1% Hang\" issue

Message boards : RALPH@home bug list : Discussion of the \"1% Hang\" issue

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Dimitris Hatzopoulos

Send message
Joined: 16 Feb 06
Posts: 31
Credit: 2,308
RAC: 0
Message 341 - Posted: 20 Feb 2006, 0:05:57 UTC
Last modified: 20 Feb 2006, 0:27:48 UTC

Sorry for intervening, but I'm trying to understand how to tell the difference of various bugs.

Carlos, does your Rosetta executable keep running? consuming 100% of CPU time? (as seen via Win Task Manager (alt-ctrl-del etc) or using some tool like ProcessExplorer (free, standalone exe, no install required, I've been using it for years)

Because I've never encountered a Rosetta WU that "stuck", consuming 100% CPU time, ad infinitum. The ones I've seen "stuck" were all stopped (loaded in memory, BOINC thought they were running, but "top" or "ps" revealed that Rosetta wasn't running, it was "SN"=stopped,nice).

And, killing just the Rosetta-task (not ./boinc or anything else, which has been happily running for 1+ month now continuously) will have BOINC re-start the WU with different random-seed and it'll finish OK this time (on the handful of ocassions I encountered sofar).

ID: 341 · Report as offensive    Reply Quote
Profile Astro

Send message
Joined: 16 Feb 06
Posts: 141
Credit: 32,977
RAC: 0
Message 342 - Posted: 20 Feb 2006, 0:14:37 UTC - in response to Message 340.  


*Isn´t better I aborting it now ?

*What I expected from him was instructions on how to do a interactive trace
of the run, step by step

-or- using Drwatson to get a memory dump of my 512M of RAM and e-mailing him
that dump

*Never a brute-force test -:(

Carlos, you have a winner there, please don't abort it, keep it in memory, you may have the WU we testers need to fix this. I'd wait until instructed what to do next. Remember it's sunday. Leaving Ralph or that WU suspended is important to Ralph and is the whole reason Ralph even exists.

I wish I had what you have, I really do.

tony
ID: 342 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 344 - Posted: 20 Feb 2006, 1:13:48 UTC - in response to Message 341.  

Sorry for intervening, but I'm trying to understand how to tell the difference of various bugs.

Carlos, does your Rosetta executable keep running? consuming 100% of CPU time? (as seen via Win Task Manager (alt-ctrl-del etc) or using some tool like ProcessExplorer (free, standalone exe, no install required, I've been using it for years)

Because I've never encountered a Rosetta WU that "stuck", consuming 100% CPU time, ad infinitum. The ones I've seen "stuck" were all stopped (loaded in memory, BOINC thought they were running, but "top" or "ps" revealed that Rosetta wasn't running, it was "SN"=stopped,nice).

And, killing just the Rosetta-task (not ./boinc or anything else, which has been happily running for 1+ month now continuously) will have BOINC re-start the WU with different random-seed and it'll finish OK this time (on the handful of ocassions I encountered sofar).


I use this one
http://www.iarsn.com/taskinfo.html
and YES rosetta is "stuck", consuming 100% CPU time, ad infinitum

*Not exactly 100% but 99.98% ... remaining 0.02% are used by network.
Click signature for global team stats
ID: 344 · Report as offensive    Reply Quote
Profile Astro

Send message
Joined: 16 Feb 06
Posts: 141
Credit: 32,977
RAC: 0
Message 345 - Posted: 20 Feb 2006, 1:17:18 UTC

Carlos, go to the "Work tab", highlight the stuck wu, then select "suspend". It should stop it, but keep it in memory until they get a chance to respond, and you can continue crunching other work.

tony
ID: 345 · Report as offensive    Reply Quote
Profile Astro

Send message
Joined: 16 Feb 06
Posts: 141
Credit: 32,977
RAC: 0
Message 348 - Posted: 20 Feb 2006, 2:29:53 UTC - in response to Message 347.  
Last modified: 20 Feb 2006, 2:31:47 UTC

I have one that is stuck WU, Computer. It has been going for 2 days, 20 hours, 58 minutes and 4 seconds of CPU time. This machine is currently estimating 8 hours for completion of other results.

Awaiting further instructions.

jm7

Hi John, two other had this issue, Mod9 sent for help. David Kim responded with this He hasn't advised further. You could read the whole thread and get a better feel for his intentions.

tony

[Edit] Mod9 wants to keep this thread just about reporting bugs. He started this thread for discussions about this bug. I have much material there.
ID: 348 · Report as offensive    Reply Quote
Stargazer257

Send message
Joined: 16 Feb 06
Posts: 6
Credit: 17,492
RAC: 0
Message 746 - Posted: 28 Feb 2006, 17:29:37 UTC
Last modified: 28 Feb 2006, 17:40:00 UTC

I have two ver 4.90 wu's that "appear" to hang @ 1%, but they have actually just appear to have slowed down to a crawl. Both of them are acting similar in that they race up to Step 34,000 (Model 1) in about 30 minutes and then sloooowly creep forward acomplishing only 50-100 additional steps in 30 additional minutes of processing time.

I had rebooted both hosts when they "appeared" to be stuck (@ 4+ hours of processing time), and they both reset to 0:00 (since they must not have "checkpointed"). I will keep them running as long as they progress forward, and will report my results irregardless.

They are still in Model 1 at this time.

WU10642

WU11437

BTW, is there a fixed number of steps in Model 1, i.e., a goal if you will, to know how close a WU is to completing Model 1 and checkpointing?


Join Us! - Click the Sig!
ID: 746 · Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 16 Feb 06
Posts: 251
Credit: 0
RAC: 0
Message 766 - Posted: 1 Mar 2006, 14:26:27 UTC - in response to Message 746.  
Last modified: 1 Mar 2006, 14:36:47 UTC

I have two ver 4.90 wu's that "appear" to hang @ 1%, but they have actually just appear to have slowed down to a crawl. Both of them are acting similar in that they race up to Step 34,000 (Model 1) in about 30 minutes and then sloooowly creep forward acomplishing only 50-100 additional steps in 30 additional minutes of processing time.

I had rebooted both hosts when they "appeared" to be stuck (@ 4+ hours of processing time), and they both reset to 0:00 (since they must not have "checkpointed"). I will keep them running as long as they progress forward, and will report my results irregardless.

They are still in Model 1 at this time.

WU10642

WU11437

BTW, is there a fixed number of steps in Model 1, i.e., a goal if you will, to know how close a WU is to completing Model 1 and checkpointing?


There are a specific number of steps fo a model but it is different for each kind of WU and run parameter set up combination for a WU. So from the user side you can only determine how many steps are in a model when the WU finishes the first model.

Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact
ID: 766 · Report as offensive    Reply Quote
Stargazer257

Send message
Joined: 16 Feb 06
Posts: 6
Credit: 17,492
RAC: 0
Message 771 - Posted: 1 Mar 2006, 17:15:47 UTC

Thanks.

BTW, both WU's listed two posts ago have since completed, one in about 6 hours, the other at least 3 times that.


Join Us! - Click the Sig!
ID: 771 · Report as offensive    Reply Quote
Previous · 1 · 2

Message boards : RALPH@home bug list : Discussion of the \"1% Hang\" issue



©2024 University of Washington
http://www.bakerlab.org