Message boards : RALPH@home bug list : Discussion of the \"1% Hang\" issue
Author | Message |
---|---|
UBT - Halifax--lad Send message Joined: 15 Feb 06 Posts: 29 Credit: 2,723 RAC: 0 |
I have already broken off them, sorry. The next WU is already with 4.59%! Please remember this is a test of WU's so if any do that keep them going to see what happens Join us in Chat (see the forum) Click the Sig Join UBT |
Astro Send message Joined: 16 Feb 06 Posts: 141 Credit: 32,977 RAC: 0 |
Are the people with the "1% bug" ever getting past Model 1 and starting Model 2??? See what I found out in this post I might have only discovered by myself what everyone else already knew. tony |
Moderator9 Volunteer moderator Send message Joined: 16 Feb 06 Posts: 251 Credit: 0 RAC: 0 |
Are the people with the "1% bug" ever getting past Model 1 and starting Model 2??? I can verify at least some of you conjecture. When the Work Unit starts it will always jump to 1%. At some point it will increment as you say. With Rosetta that was an indication of checkpointing. With RALPH I do not know that these two events are related. In any case the speed of percentage movement and the speed of model generation have always been processor speed dependent. With the new Ralph process, the number of models completed in say an 8 hour run would be dependent on processor speed. If one system is slow and another system is fast, the slow system might take 30 min to generate a model, and the fast one might take 5 min. So in an 8 hour run the slow system will generate 16 models, while the fast one would generate 48. So the number of models completed is a function of the time setting and the speed of the machine. When a work unit "hangs" at 1% the user will not complete model 1. THis does beg the question that if the default setting is for a 1 hour run, can a slow system run even one model. With Rosetta it became obvious to many that if the application swapped every 60 min, the work unit would never get past 1%, so clearly if a model completes at a checkpoint, then 1 hour may not be enough time for slow machines. Moderator9 RALPH@home FAQs RALPH@home Guidelines Moderator Contact |
Astro Send message Joined: 16 Feb 06 Posts: 141 Credit: 32,977 RAC: 0 |
Atleast with 4.85 Barcode, my SLOOOW celeron always (so far) has gotten to model two in one hour. How slow a processor would it take not to get past model 1? At the exact second I passed a model is when my % done updated. I can see this on my celeron easily because it's slow, and pretty easy on my P4 1.8, I can even see it on my AMD64 3700 which takes about a minute/model on a HBLR wu. |
Astro Send message Joined: 16 Feb 06 Posts: 141 Credit: 32,977 RAC: 0 |
I don't think "switch" time has anything to do with the 1% bug, since many users report it stuck for multiple hours (I've seen 14 hours) reported. tony |
Moderator9 Volunteer moderator Send message Joined: 16 Feb 06 Posts: 251 Credit: 0 RAC: 0 |
I don't think "switch" time has anything to do with the 1% bug, since many users report it stuck for multiple hours (I've seen 14 hours) reported. Some of the 1% hangs were in fact a direct result of not keeping the applications in memory and allowing swaps to occur at 60 min. This combination prevented slower machines from checkpointing, and at every swap the work unit would have to start over at 1%. This is the reason that the Rosetta project has said that you should set the system to keep applications in memory and/or set application switching to more than 90 min. The increase in checkpoints in RALPH are part of the fix. Part of what Ralph is all about is fixing that particular problem. The rest of the 1% hangs are from a different cause. Moderator9 RALPH@home FAQs RALPH@home Guidelines Moderator Contact |
Astro Send message Joined: 16 Feb 06 Posts: 141 Credit: 32,977 RAC: 0 |
OK it's late. I can see your point a PII 400 might not checkpoint in one hour, and I don't know which WUs are more prone to these errors, or if certain CPU types/OS/etc are more affected. I hope the Project team has access to see what settings the users have set up, what processors they have, and to try to draw some correlation. I don't and can't have that info. I can hope what I've brought up lights a spark in one of the people looking into this to say "hey, that isn't it, but it reminds me that...." |
Moderator9 Volunteer moderator Send message Joined: 16 Feb 06 Posts: 251 Credit: 0 RAC: 0 |
I think I am going to move our discussion to another thread to keep this one trim. We are really off topic for this thread because it is for reporting of hangs not discussion of the problem. I will start a discussion thread for this issue here. But they are looking at all aspects of the 1% hang so please keep reporting Moderator9 RALPH@home FAQs RALPH@home Guidelines Moderator Contact |
Moderator9 Volunteer moderator Send message Joined: 16 Feb 06 Posts: 251 Credit: 0 RAC: 0 |
I have moved posts here from the reporting thread for the 1% hang problem. Please discuss the issue here and use the other thread only for reporting hangs. Moderator9 RALPH@home FAQs RALPH@home Guidelines Moderator Contact |
Astro Send message Joined: 16 Feb 06 Posts: 141 Credit: 32,977 RAC: 0 |
I'd be curious to know the model# and Step number it's frozen at, but don't want you to lose the possiblility of them asking you to do something first. This data is on the graphic. Is it a 4.83, 4.85?? What's your switch between projects time? Are you doing more than one project? Is this a Hyperthreading host? CPU type? |
Moderator9 Volunteer moderator Send message Joined: 16 Feb 06 Posts: 251 Credit: 0 RAC: 0 |
The people with hung work units in memory waiting for instructions: I will send a note to David Kim to get his attention to this thread and provide you furthur instructions on what to do. As I write this it is 7:00 am Sunday on the West coast, so assuming he checks his mail on Sunday mornings he should get back to you soon. The information you can provide him is valuable so please hang in there till he gets back to you. Moderator9 RALPH@home FAQs RALPH@home Guidelines Moderator Contact |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
There are 3 type(s) of 1% bug 1) The obvious cause of a sloow pc not chekponting before removing app from ram was already discussed into this thread. Obvius fixes for this type. a) Users increasing the " Switch between applications every " to a adequate time b) Users answering YES to " Leave applications in memory while preempted " *and *not* rebooting u pc that too frequently c) developpers reducing "chekpoint Interval" to what users sayd into "Write to disk at most every" nn seconds. *The default, 60 seconds, should be enough even for a slow pc 2) The second type I see on my Linux PC, kernel 2.4.x *The app stop of using CPU and the whole system goes to IDLE !!! *The sucessfull fix, I am using, until developers come in with something beter is: pkill boinc ps xu keep doing ps xu until no more boinc or rosetta tasks, or any other boinc app appears as result of ps xu Then, restart boinc ... eg: ./boinc -redirectio & *These WUs ends sucessfully -:) *Only too much babysittting until they end OK 3) The third type ,(until now I get only one), on Windows the app keeps into ram consuming 99.98% of CPU, but does not progress *After 4 hours or more of CPU time @ 5ghz or more, that jobs are still at 1% *This can only be fixed by developpers, -> A new rosetta app version *I believe this is some sort of endless loop into Ab initio routine ps: I just reported this one type, on the 1% bug thread -:} Click signature for global team stats |
Astro Send message Joined: 16 Feb 06 Posts: 141 Credit: 32,977 RAC: 0 |
Thanks, I have so many questions, but don't want to seem like a pain asking them. That helps tony |
Moderator9 Volunteer moderator Send message Joined: 16 Feb 06 Posts: 251 Credit: 0 RAC: 0 |
I get a truly 1% bug ! Carlos: I just looked at the two Work Units that failed to download. They seem to have failed on the same file. This is probably something on the server, like a dropped connection or bad file. I have brought a possible cause to the attention of the server administrator. I suspect you will not see this again, but please report it if you do. Just keep the WU that is hung warm until Dr Kim can post some instructions on what to do with it. Thanks for your help. Moderator9 RALPH@home FAQs RALPH@home Guidelines Moderator Contact |
Moderator9 Volunteer moderator Send message Joined: 16 Feb 06 Posts: 251 Credit: 0 RAC: 0 |
Carlos Wrote: There are 3 type(s) of 1% bug Yes this issue has been discussed a lot, but many people are just now focusing on it so additional discussion is only natural. Obvius fixes for this type. Assuming you meant to say 60 min (not seconds) these are the best answers for the problem at present. 2) The second type I see on my Linux PC, kernel 2.4.x While this will free up the workunit and usually it will go on to complete sucessfully, for the RALPH project they would prefer it if you can "trap" the workunit. Dr Kim should show up soon with instructions on what to do with a work unit once you have it trapped. 3) The third type ,(until now I get only one), on Windows This is the variation of the bug they are working on right now. For what it is worth this is the only kind of hang that seems to affect ALL of the CPU types working on Rosetta. Generally the Macs are almost trouble free but this bug even affects them. Some of the other types should go away as a result of the new time settings and other small adjustment to the application. But this type is more difficult to capture and solve. That is why what you are doing in the RALPH project is so important. ps: I just reported this one type, on the 1% bug thread -:} Moderator9 RALPH@home FAQs RALPH@home Guidelines Moderator Contact |
Moderator9 Volunteer moderator Send message Joined: 16 Feb 06 Posts: 251 Credit: 0 RAC: 0 |
Thanks, I have so many questions, but don't want to seem like a pain asking them. That helps No problem, I appreciate your understanding the need to keep the other thread uncluttered. Moderator9 RALPH@home FAQs RALPH@home Guidelines Moderator Contact |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
|
David@home Send message Joined: 16 Feb 06 Posts: 24 Credit: 409 RAC: 0 |
I was out for about 1 hour to do love. Hi, There is a request from dekim a few posts down in this thread: https://ralph.bakerlab.org/forum_thread.php?id=1#328p My best guess is that this is for both of us. |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
I was out for about 1 hour to do love. I am not sure! I prefer to wait some more, to risk removing this job from RAM. *In case I am wrong, he will post again, following this one post. btw: the english name of the roman (archaic latin) name "argentum" is silver Click signature for global team stats |
Moderator9 Volunteer moderator Send message Joined: 16 Feb 06 Posts: 251 Credit: 0 RAC: 0 |
I was out for about 1 hour to do love. The post from Dr. Kim was for both of you. he wants you to restart the Work Unit and see if it will run. Moderator9 RALPH@home FAQs RALPH@home Guidelines Moderator Contact |
Message boards :
RALPH@home bug list :
Discussion of the \"1% Hang\" issue
©2024 University of Washington
http://www.bakerlab.org