Discussion of the \"1% Hang\" issue

Message boards : RALPH@home bug list : Discussion of the \"1% Hang\" issue

To post messages, you must log in.

AuthorMessage
Profile UBT - Halifax--lad

Send message
Joined: 15 Feb 06
Posts: 29
Credit: 2,723
RAC: 0
Message 26 - Posted: 16 Feb 2006, 8:11:01 UTC - in response to Message 14.  

I have already broken off them, sorry. The next WU is already with 4.59%!




BYE H (?!?)


Please remember this is a test of WU's so if any do that keep them going to see what happens

Join us in Chat (see the forum) Click the Sig


Join UBT
ID: 26 · Report as offensive    Reply Quote
Profile Astro

Send message
Joined: 16 Feb 06
Posts: 141
Credit: 32,977
RAC: 0
Message 291 - Posted: 19 Feb 2006, 4:56:01 UTC
Last modified: 19 Feb 2006, 5:03:56 UTC

Are the people with the "1% bug" ever getting past Model 1 and starting Model 2???

See what I found out in this post
I might have only discovered by myself what everyone else already knew.
tony
ID: 291 · Report as offensive    Reply Quote
Profile Astro

Send message
Joined: 16 Feb 06
Posts: 141
Credit: 32,977
RAC: 0
Message 293 - Posted: 19 Feb 2006, 5:45:42 UTC - in response to Message 292.  
Last modified: 19 Feb 2006, 5:48:19 UTC


THis does beg the question that if the default setting is for a 1 hour run, can a slow system run even one model. With Rosetta it became obvious to many that if the application swapped every 60 min, the work unit would never get past 1%, so clearly if a model completes at a checkpoint, then 1 hour may not be enough time for slow machines.


Atleast with 4.85 Barcode, my SLOOOW celeron always (so far) has gotten to model two in one hour. How slow a processor would it take not to get past model 1? At the exact second I passed a model is when my % done updated. I can see this on my celeron easily because it's slow, and pretty easy on my P4 1.8, I can even see it on my AMD64 3700 which takes about a minute/model on a HBLR wu.
ID: 293 · Report as offensive    Reply Quote
Profile Astro

Send message
Joined: 16 Feb 06
Posts: 141
Credit: 32,977
RAC: 0
Message 294 - Posted: 19 Feb 2006, 5:51:12 UTC

I don't think "switch" time has anything to do with the 1% bug, since many users report it stuck for multiple hours (I've seen 14 hours) reported.

tony
ID: 294 · Report as offensive    Reply Quote
Profile Astro

Send message
Joined: 16 Feb 06
Posts: 141
Credit: 32,977
RAC: 0
Message 296 - Posted: 19 Feb 2006, 6:08:28 UTC - in response to Message 295.  


Some of the 1% hangs were in fact a direct result of not keeping the application in memory and allowing swaps to occur at 60 min. This combination prevented slower machine from checkpointing, and at every swap the work unit would have to start over at 1%. This is the reason that the Rosetta project has said that you should set the system to keep application in memory and/or set application switching to more than 1:30.

Part of what Ralph is all about is fixing that particular problem. The rest of the 1% hangs are from a different cause.

OK it's late. I can see your point a PII 400 might not checkpoint in one hour, and I don't know which WUs are more prone to these errors, or if certain CPU types/OS/etc are more affected. I hope the Project team has access to see what settings the users have set up, what processors they have, and to try to draw some correlation. I don't and can't have that info. I can hope what I've brought up lights a spark in one of the people looking into this to say "hey, that isn't it, but it reminds me that...."
ID: 296 · Report as offensive    Reply Quote
Profile Astro

Send message
Joined: 16 Feb 06
Posts: 141
Credit: 32,977
RAC: 0
Message 301 - Posted: 19 Feb 2006, 7:33:25 UTC
Last modified: 19 Feb 2006, 7:43:15 UTC

I'd be curious to know the model# and Step number it's frozen at, but don't want you to lose the possiblility of them asking you to do something first. This data is on the graphic.
Is it a 4.83, 4.85??
What's your switch between projects time?
Are you doing more than one project?
Is this a Hyperthreading host?
CPU type?
ID: 301 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 310 - Posted: 19 Feb 2006, 15:15:06 UTC

There are 3 type(s) of 1% bug

1) The obvious cause of a sloow pc not chekponting before removing app from ram
was already discussed into this thread.

Obvius fixes for this type.

a) Users increasing the " Switch between applications every " to a adequate time
b) Users answering YES to " Leave applications in memory while preempted "
*and *not* rebooting u pc that too frequently
c) developpers reducing "chekpoint Interval" to what users
sayd into "Write to disk at most every" nn seconds.
*The default, 60 seconds, should be enough even for a slow pc

2) The second type I see on my Linux PC, kernel 2.4.x
*The app stop of using CPU and the whole system goes to IDLE !!!

*The sucessfull fix, I am using, until developers come in with something beter
is:
pkill boinc
ps xu
keep doing ps xu until no more boinc or rosetta tasks, or any other boinc app
appears as result of ps xu
Then, restart boinc ... eg: ./boinc -redirectio &
*These WUs ends sucessfully -:)
*Only too much babysittting until they end OK

3) The third type ,(until now I get only one), on Windows
the app keeps into ram consuming 99.98% of CPU, but does not progress
*After 4 hours or more of CPU time @ 5ghz or more, that jobs are still at 1%

*This can only be fixed by developpers, -> A new rosetta app version
*I believe this is some sort of endless loop into Ab initio routine

ps: I just reported this one type, on the 1% bug thread -:}


Click signature for global team stats
ID: 310 · Report as offensive    Reply Quote
Profile Astro

Send message
Joined: 16 Feb 06
Posts: 141
Credit: 32,977
RAC: 0
Message 311 - Posted: 19 Feb 2006, 15:28:37 UTC

Thanks, I have so many questions, but don't want to seem like a pain asking them. That helps

tony
ID: 311 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 333 - Posted: 19 Feb 2006, 20:25:12 UTC

I was out for about 1 hour to do love.
I'm back, now.
WU still undisturbed, suspended into RAM.
Anything to do ?
Click signature for global team stats
ID: 333 · Report as offensive    Reply Quote
Profile David@home
Avatar

Send message
Joined: 16 Feb 06
Posts: 24
Credit: 409
RAC: 0
Message 335 - Posted: 19 Feb 2006, 21:30:38 UTC - in response to Message 333.  
Last modified: 19 Feb 2006, 21:32:55 UTC

I was out for about 1 hour to do love.
I'm back, now.
WU still undisturbed, suspended into RAM.
Anything to do ?


Hi,

There is a request from dekim a few posts down in this thread:

https://ralph.bakerlab.org/forum_thread.php?id=1#328p

My best guess is that this is for both of us.



ID: 335 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 336 - Posted: 19 Feb 2006, 21:59:04 UTC - in response to Message 335.  

I was out for about 1 hour to do love.
I'm back, now.
WU still undisturbed, suspended into RAM.
Anything to do ?


Hi,

There is a request from dekim a few posts down in this thread:

https://ralph.bakerlab.org/forum_thread.php?id=1#328p

My best guess is that this is for both of us.


I am not sure!
I prefer to wait some more, to risk removing this job from RAM.
*In case I am wrong, he will post again, following this one post.

btw: the english name of the roman (archaic latin) name "argentum" is silver
Click signature for global team stats
ID: 336 · Report as offensive    Reply Quote
Dimitris Hatzopoulos

Send message
Joined: 16 Feb 06
Posts: 31
Credit: 2,308
RAC: 0
Message 341 - Posted: 20 Feb 2006, 0:05:57 UTC
Last modified: 20 Feb 2006, 0:27:48 UTC

Sorry for intervening, but I'm trying to understand how to tell the difference of various bugs.

Carlos, does your Rosetta executable keep running? consuming 100% of CPU time? (as seen via Win Task Manager (alt-ctrl-del etc) or using some tool like ProcessExplorer (free, standalone exe, no install required, I've been using it for years)

Because I've never encountered a Rosetta WU that "stuck", consuming 100% CPU time, ad infinitum. The ones I've seen "stuck" were all stopped (loaded in memory, BOINC thought they were running, but "top" or "ps" revealed that Rosetta wasn't running, it was "SN"=stopped,nice).

And, killing just the Rosetta-task (not ./boinc or anything else, which has been happily running for 1+ month now continuously) will have BOINC re-start the WU with different random-seed and it'll finish OK this time (on the handful of ocassions I encountered sofar).

ID: 341 · Report as offensive    Reply Quote
Profile Astro

Send message
Joined: 16 Feb 06
Posts: 141
Credit: 32,977
RAC: 0
Message 342 - Posted: 20 Feb 2006, 0:14:37 UTC - in response to Message 340.  


*Isn´t better I aborting it now ?

*What I expected from him was instructions on how to do a interactive trace
of the run, step by step

-or- using Drwatson to get a memory dump of my 512M of RAM and e-mailing him
that dump

*Never a brute-force test -:(

Carlos, you have a winner there, please don't abort it, keep it in memory, you may have the WU we testers need to fix this. I'd wait until instructed what to do next. Remember it's sunday. Leaving Ralph or that WU suspended is important to Ralph and is the whole reason Ralph even exists.

I wish I had what you have, I really do.

tony
ID: 342 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 344 - Posted: 20 Feb 2006, 1:13:48 UTC - in response to Message 341.  

Sorry for intervening, but I'm trying to understand how to tell the difference of various bugs.

Carlos, does your Rosetta executable keep running? consuming 100% of CPU time? (as seen via Win Task Manager (alt-ctrl-del etc) or using some tool like ProcessExplorer (free, standalone exe, no install required, I've been using it for years)

Because I've never encountered a Rosetta WU that "stuck", consuming 100% CPU time, ad infinitum. The ones I've seen "stuck" were all stopped (loaded in memory, BOINC thought they were running, but "top" or "ps" revealed that Rosetta wasn't running, it was "SN"=stopped,nice).

And, killing just the Rosetta-task (not ./boinc or anything else, which has been happily running for 1+ month now continuously) will have BOINC re-start the WU with different random-seed and it'll finish OK this time (on the handful of ocassions I encountered sofar).


I use this one
http://www.iarsn.com/taskinfo.html
and YES rosetta is "stuck", consuming 100% CPU time, ad infinitum

*Not exactly 100% but 99.98% ... remaining 0.02% are used by network.
Click signature for global team stats
ID: 344 · Report as offensive    Reply Quote
Profile Astro

Send message
Joined: 16 Feb 06
Posts: 141
Credit: 32,977
RAC: 0
Message 345 - Posted: 20 Feb 2006, 1:17:18 UTC

Carlos, go to the "Work tab", highlight the stuck wu, then select "suspend". It should stop it, but keep it in memory until they get a chance to respond, and you can continue crunching other work.

tony
ID: 345 · Report as offensive    Reply Quote
Profile Astro

Send message
Joined: 16 Feb 06
Posts: 141
Credit: 32,977
RAC: 0
Message 348 - Posted: 20 Feb 2006, 2:29:53 UTC - in response to Message 347.  
Last modified: 20 Feb 2006, 2:31:47 UTC

I have one that is stuck WU, Computer. It has been going for 2 days, 20 hours, 58 minutes and 4 seconds of CPU time. This machine is currently estimating 8 hours for completion of other results.

Awaiting further instructions.

jm7

Hi John, two other had this issue, Mod9 sent for help. David Kim responded with this He hasn't advised further. You could read the whole thread and get a better feel for his intentions.

tony

[Edit] Mod9 wants to keep this thread just about reporting bugs. He started this thread for discussions about this bug. I have much material there.
ID: 348 · Report as offensive    Reply Quote
Stargazer257

Send message
Joined: 16 Feb 06
Posts: 6
Credit: 17,492
RAC: 0
Message 746 - Posted: 28 Feb 2006, 17:29:37 UTC
Last modified: 28 Feb 2006, 17:40:00 UTC

I have two ver 4.90 wu's that "appear" to hang @ 1%, but they have actually just appear to have slowed down to a crawl. Both of them are acting similar in that they race up to Step 34,000 (Model 1) in about 30 minutes and then sloooowly creep forward acomplishing only 50-100 additional steps in 30 additional minutes of processing time.

I had rebooted both hosts when they "appeared" to be stuck (@ 4+ hours of processing time), and they both reset to 0:00 (since they must not have "checkpointed"). I will keep them running as long as they progress forward, and will report my results irregardless.

They are still in Model 1 at this time.

WU10642

WU11437

BTW, is there a fixed number of steps in Model 1, i.e., a goal if you will, to know how close a WU is to completing Model 1 and checkpointing?


Join Us! - Click the Sig!
ID: 746 · Report as offensive    Reply Quote
Stargazer257

Send message
Joined: 16 Feb 06
Posts: 6
Credit: 17,492
RAC: 0
Message 771 - Posted: 1 Mar 2006, 17:15:47 UTC

Thanks.

BTW, both WU's listed two posts ago have since completed, one in about 6 hours, the other at least 3 times that.


Join Us! - Click the Sig!
ID: 771 · Report as offensive    Reply Quote

Message boards : RALPH@home bug list : Discussion of the \"1% Hang\" issue



©2024 University of Washington
http://www.bakerlab.org