Discussion of the \"1% Hang\" issue

Message boards : RALPH@home bug list : Discussion of the \"1% Hang\" issue

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile UBT - Halifax--lad

Send message
Joined: 15 Feb 06
Posts: 29
Credit: 2,723
RAC: 0
Message 26 - Posted: 16 Feb 2006, 8:11:01 UTC - in response to Message 14.  

I have already broken off them, sorry. The next WU is already with 4.59%!




BYE H (?!?)


Please remember this is a test of WU's so if any do that keep them going to see what happens

Join us in Chat (see the forum) Click the Sig


Join UBT
ID: 26 · Report as offensive    Reply Quote
Profile Astro

Send message
Joined: 16 Feb 06
Posts: 141
Credit: 32,977
RAC: 0
Message 291 - Posted: 19 Feb 2006, 4:56:01 UTC
Last modified: 19 Feb 2006, 5:03:56 UTC

Are the people with the "1% bug" ever getting past Model 1 and starting Model 2???

See what I found out in this post
I might have only discovered by myself what everyone else already knew.
tony
ID: 291 · Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 16 Feb 06
Posts: 251
Credit: 0
RAC: 0
Message 292 - Posted: 19 Feb 2006, 5:39:13 UTC - in response to Message 291.  

Are the people with the "1% bug" ever getting past Model 1 and starting Model 2???

See what I found out in this post
I might have only discovered by myself what everyone else already knew.
tony



I can verify at least some of you conjecture. When the Work Unit starts it will always jump to 1%. At some point it will increment as you say. With Rosetta that was an indication of checkpointing. With RALPH I do not know that these two events are related. In any case the speed of percentage movement and the speed of model generation have always been processor speed dependent.

With the new Ralph process, the number of models completed in say an 8 hour run would be dependent on processor speed. If one system is slow and another system is fast, the slow system might take 30 min to generate a model, and the fast one might take 5 min. So in an 8 hour run the slow system will generate 16 models, while the fast one would generate 48. So the number of models completed is a function of the time setting and the speed of the machine.

When a work unit "hangs" at 1% the user will not complete model 1. THis does beg the question that if the default setting is for a 1 hour run, can a slow system run even one model. With Rosetta it became obvious to many that if the application swapped every 60 min, the work unit would never get past 1%, so clearly if a model completes at a checkpoint, then 1 hour may not be enough time for slow machines.


Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact
ID: 292 · Report as offensive    Reply Quote
Profile Astro

Send message
Joined: 16 Feb 06
Posts: 141
Credit: 32,977
RAC: 0
Message 293 - Posted: 19 Feb 2006, 5:45:42 UTC - in response to Message 292.  
Last modified: 19 Feb 2006, 5:48:19 UTC


THis does beg the question that if the default setting is for a 1 hour run, can a slow system run even one model. With Rosetta it became obvious to many that if the application swapped every 60 min, the work unit would never get past 1%, so clearly if a model completes at a checkpoint, then 1 hour may not be enough time for slow machines.


Atleast with 4.85 Barcode, my SLOOOW celeron always (so far) has gotten to model two in one hour. How slow a processor would it take not to get past model 1? At the exact second I passed a model is when my % done updated. I can see this on my celeron easily because it's slow, and pretty easy on my P4 1.8, I can even see it on my AMD64 3700 which takes about a minute/model on a HBLR wu.
ID: 293 · Report as offensive    Reply Quote
Profile Astro

Send message
Joined: 16 Feb 06
Posts: 141
Credit: 32,977
RAC: 0
Message 294 - Posted: 19 Feb 2006, 5:51:12 UTC

I don't think "switch" time has anything to do with the 1% bug, since many users report it stuck for multiple hours (I've seen 14 hours) reported.

tony
ID: 294 · Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 16 Feb 06
Posts: 251
Credit: 0
RAC: 0
Message 295 - Posted: 19 Feb 2006, 6:01:32 UTC - in response to Message 294.  
Last modified: 19 Feb 2006, 6:09:42 UTC

I don't think "switch" time has anything to do with the 1% bug, since many users report it stuck for multiple hours (I've seen 14 hours) reported.

tony


Some of the 1% hangs were in fact a direct result of not keeping the applications in memory and allowing swaps to occur at 60 min. This combination prevented slower machines from checkpointing, and at every swap the work unit would have to start over at 1%. This is the reason that the Rosetta project has said that you should set the system to keep applications in memory and/or set application switching to more than 90 min. The increase in checkpoints in RALPH are part of the fix.

Part of what Ralph is all about is fixing that particular problem. The rest of the 1% hangs are from a different cause.


Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact
ID: 295 · Report as offensive    Reply Quote
Profile Astro

Send message
Joined: 16 Feb 06
Posts: 141
Credit: 32,977
RAC: 0
Message 296 - Posted: 19 Feb 2006, 6:08:28 UTC - in response to Message 295.  


Some of the 1% hangs were in fact a direct result of not keeping the application in memory and allowing swaps to occur at 60 min. This combination prevented slower machine from checkpointing, and at every swap the work unit would have to start over at 1%. This is the reason that the Rosetta project has said that you should set the system to keep application in memory and/or set application switching to more than 1:30.

Part of what Ralph is all about is fixing that particular problem. The rest of the 1% hangs are from a different cause.

OK it's late. I can see your point a PII 400 might not checkpoint in one hour, and I don't know which WUs are more prone to these errors, or if certain CPU types/OS/etc are more affected. I hope the Project team has access to see what settings the users have set up, what processors they have, and to try to draw some correlation. I don't and can't have that info. I can hope what I've brought up lights a spark in one of the people looking into this to say "hey, that isn't it, but it reminds me that...."
ID: 296 · Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 16 Feb 06
Posts: 251
Credit: 0
RAC: 0
Message 297 - Posted: 19 Feb 2006, 6:11:10 UTC - in response to Message 296.  
Last modified: 19 Feb 2006, 6:21:25 UTC


Some of the 1% hangs were in fact a direct result of not keeping the application in memory and allowing swaps to occur at 60 min. This combination prevented slower machine from checkpointing, and at every swap the work unit would have to start over at 1%. This is the reason that the Rosetta project has said that you should set the system to keep application in memory and/or set application switching to more than 1:30.

Part of what Ralph is all about is fixing that particular problem. The rest of the 1% hangs are from a different cause.

OK it's late. I can see your point a PII 400 might not checkpoint in one hour, and I don't know which WUs are more prone to these errors, or if certain CPU types/OS/etc are more affected. I hope the Project team has access to see what settings the users have set up, what processors they have, and to try to draw some correlation. I don't and can't have that info. I can hope what I've brought up lights a spark in one of the people looking into this to say "hey, that isn't it, but it reminds me that...."



I think I am going to move our discussion to another thread to keep this one trim. We are really off topic for this thread because it is for reporting of hangs not discussion of the problem. I will start a discussion thread for this issue here.

But they are looking at all aspects of the 1% hang so please keep reporting

Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact
ID: 297 · Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 16 Feb 06
Posts: 251
Credit: 0
RAC: 0
Message 298 - Posted: 19 Feb 2006, 6:15:33 UTC

I have moved posts here from the reporting thread for the 1% hang problem. Please discuss the issue here and use the other thread only for reporting hangs.
Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact
ID: 298 · Report as offensive    Reply Quote
Profile Astro

Send message
Joined: 16 Feb 06
Posts: 141
Credit: 32,977
RAC: 0
Message 301 - Posted: 19 Feb 2006, 7:33:25 UTC
Last modified: 19 Feb 2006, 7:43:15 UTC

I'd be curious to know the model# and Step number it's frozen at, but don't want you to lose the possiblility of them asking you to do something first. This data is on the graphic.
Is it a 4.83, 4.85??
What's your switch between projects time?
Are you doing more than one project?
Is this a Hyperthreading host?
CPU type?
ID: 301 · Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 16 Feb 06
Posts: 251
Credit: 0
RAC: 0
Message 305 - Posted: 19 Feb 2006, 14:46:01 UTC
Last modified: 19 Feb 2006, 14:48:19 UTC

The people with hung work units in memory waiting for instructions:

I will send a note to David Kim to get his attention to this thread and provide you furthur instructions on what to do. As I write this it is 7:00 am Sunday on the West coast, so assuming he checks his mail on Sunday mornings he should get back to you soon. The information you can provide him is valuable so please hang in there till he gets back to you.


Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact
ID: 305 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 310 - Posted: 19 Feb 2006, 15:15:06 UTC

There are 3 type(s) of 1% bug

1) The obvious cause of a sloow pc not chekponting before removing app from ram
was already discussed into this thread.

Obvius fixes for this type.

a) Users increasing the " Switch between applications every " to a adequate time
b) Users answering YES to " Leave applications in memory while preempted "
*and *not* rebooting u pc that too frequently
c) developpers reducing "chekpoint Interval" to what users
sayd into "Write to disk at most every" nn seconds.
*The default, 60 seconds, should be enough even for a slow pc

2) The second type I see on my Linux PC, kernel 2.4.x
*The app stop of using CPU and the whole system goes to IDLE !!!

*The sucessfull fix, I am using, until developers come in with something beter
is:
pkill boinc
ps xu
keep doing ps xu until no more boinc or rosetta tasks, or any other boinc app
appears as result of ps xu
Then, restart boinc ... eg: ./boinc -redirectio &
*These WUs ends sucessfully -:)
*Only too much babysittting until they end OK

3) The third type ,(until now I get only one), on Windows
the app keeps into ram consuming 99.98% of CPU, but does not progress
*After 4 hours or more of CPU time @ 5ghz or more, that jobs are still at 1%

*This can only be fixed by developpers, -> A new rosetta app version
*I believe this is some sort of endless loop into Ab initio routine

ps: I just reported this one type, on the 1% bug thread -:}


Click signature for global team stats
ID: 310 · Report as offensive    Reply Quote
Profile Astro

Send message
Joined: 16 Feb 06
Posts: 141
Credit: 32,977
RAC: 0
Message 311 - Posted: 19 Feb 2006, 15:28:37 UTC

Thanks, I have so many questions, but don't want to seem like a pain asking them. That helps

tony
ID: 311 · Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 16 Feb 06
Posts: 251
Credit: 0
RAC: 0
Message 314 - Posted: 19 Feb 2006, 15:48:32 UTC - in response to Message 304.  

I get a truly 1% bug !
https://ralph.bakerlab.org/result.php?resultid=5090
Clicking above does not say much, except for cpu type O/S and things like.

Below info I read on the screen saver, and I am typing here carefully

CPU time 4 hours 2 minutes 36 seconds (this number is increasing at each second)

*all the rest of the screen is absolutely frozen

1% complete
Stage: Ab initio
model: 1 step : 2001
Accepted RMSD: 6.134
Accepted Energy: -11.31033....


Carlos:

I just looked at the two Work Units that failed to download. They seem to have failed on the same file. This is probably something on the server, like a dropped connection or bad file. I have brought a possible cause to the attention of the server administrator. I suspect you will not see this again, but please report it if you do. Just keep the WU that is hung warm until Dr Kim can post some instructions on what to do with it. Thanks for your help.

Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact
ID: 314 · Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 16 Feb 06
Posts: 251
Credit: 0
RAC: 0
Message 315 - Posted: 19 Feb 2006, 16:04:04 UTC - in response to Message 310.  
Last modified: 19 Feb 2006, 16:08:56 UTC

Carlos Wrote:
There are 3 type(s) of 1% bug

1) The obvious cause of a sloow pc not chekponting before removing app from ram
was already discussed into this thread.


Yes this issue has been discussed a lot, but many people are just now focusing on it so additional discussion is only natural.

Obvius fixes for this type.

a) Users increasing the " Switch between applications every " to a adequate time
b) Users answering YES to " Leave applications in memory while preempted "
*and *not* rebooting u pc that too frequently
c) developpers reducing "chekpoint Interval" to what users
sayd into "Write to disk at most every" nn seconds.
*The default, 60 seconds, should be enough even for a slow pc


Assuming you meant to say 60 min (not seconds) these are the best answers for the problem at present.

2) The second type I see on my Linux PC, kernel 2.4.x
*The app stop of using CPU and the whole system goes to IDLE !!!

*The sucessfull fix, I am using, until developers come in with something beter
is:
pkill boinc
ps xu
keep doing ps xu until no more boinc or rosetta tasks, or any other boinc app
appears as result of ps xu
Then, restart boinc ... eg: ./boinc -redirectio &
*These WUs ends sucessfully -:)
*Only too much babysittting until they end OK


While this will free up the workunit and usually it will go on to complete sucessfully, for the RALPH project they would prefer it if you can "trap" the workunit. Dr Kim should show up soon with instructions on what to do with a work unit once you have it trapped.

3) The third type ,(until now I get only one), on Windows
the app keeps into ram consuming 99.98% of CPU, but does not progress
*After 4 hours or more of CPU time @ 5ghz or more, that jobs are still at 1%

*This can only be fixed by developpers, -> A new rosetta app version
*I believe this is some sort of endless loop into Ab initio routine


This is the variation of the bug they are working on right now. For what it is worth this is the only kind of hang that seems to affect ALL of the CPU types working on Rosetta. Generally the Macs are almost trouble free but this bug even affects them. Some of the other types should go away as a result of the new time settings and other small adjustment to the application. But this type is more difficult to capture and solve. That is why what you are doing in the RALPH project is so important.


ps: I just reported this one type, on the 1% bug thread -:}



Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact
ID: 315 · Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 16 Feb 06
Posts: 251
Credit: 0
RAC: 0
Message 316 - Posted: 19 Feb 2006, 16:10:09 UTC - in response to Message 311.  

Thanks, I have so many questions, but don't want to seem like a pain asking them. That helps

tony

No problem, I appreciate your understanding the need to keep the other thread uncluttered.
Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact
ID: 316 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 333 - Posted: 19 Feb 2006, 20:25:12 UTC

I was out for about 1 hour to do love.
I'm back, now.
WU still undisturbed, suspended into RAM.
Anything to do ?
Click signature for global team stats
ID: 333 · Report as offensive    Reply Quote
Profile David@home
Avatar

Send message
Joined: 16 Feb 06
Posts: 24
Credit: 409
RAC: 0
Message 335 - Posted: 19 Feb 2006, 21:30:38 UTC - in response to Message 333.  
Last modified: 19 Feb 2006, 21:32:55 UTC

I was out for about 1 hour to do love.
I'm back, now.
WU still undisturbed, suspended into RAM.
Anything to do ?


Hi,

There is a request from dekim a few posts down in this thread:

https://ralph.bakerlab.org/forum_thread.php?id=1#328p

My best guess is that this is for both of us.



ID: 335 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 336 - Posted: 19 Feb 2006, 21:59:04 UTC - in response to Message 335.  

I was out for about 1 hour to do love.
I'm back, now.
WU still undisturbed, suspended into RAM.
Anything to do ?


Hi,

There is a request from dekim a few posts down in this thread:

https://ralph.bakerlab.org/forum_thread.php?id=1#328p

My best guess is that this is for both of us.


I am not sure!
I prefer to wait some more, to risk removing this job from RAM.
*In case I am wrong, he will post again, following this one post.

btw: the english name of the roman (archaic latin) name "argentum" is silver
Click signature for global team stats
ID: 336 · Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 16 Feb 06
Posts: 251
Credit: 0
RAC: 0
Message 337 - Posted: 19 Feb 2006, 22:02:36 UTC - in response to Message 336.  

I was out for about 1 hour to do love.
I'm back, now.
WU still undisturbed, suspended into RAM.
Anything to do ?


Hi,

There is a request from dekim a few posts down in this thread:

https://ralph.bakerlab.org/forum_thread.php?id=1#328p

My best guess is that this is for both of us.


I am not sure!
I prefer to wait some more, to risk removing this job from RAM.
*In case I am wrong, he will post again, following this one post.

btw: the english name of the roman (archaic latin) name "argentum" is silver


The post from Dr. Kim was for both of you. he wants you to restart the Work Unit and see if it will run.

Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact
ID: 337 · Report as offensive    Reply Quote
1 · 2 · Next

Message boards : RALPH@home bug list : Discussion of the \"1% Hang\" issue



©2024 University of Washington
http://www.bakerlab.org