Message boards : RALPH@home bug list : Discussion of the \"1% Hang\" issue
Author | Message |
---|---|
UBT - Halifax--lad Send message Joined: 15 Feb 06 Posts: 29 Credit: 2,723 RAC: 0 |
I have already broken off them, sorry. The next WU is already with 4.59%! Please remember this is a test of WU's so if any do that keep them going to see what happens Join us in Chat (see the forum) Click the Sig Join UBT |
Astro Send message Joined: 16 Feb 06 Posts: 141 Credit: 32,977 RAC: 0 |
Are the people with the "1% bug" ever getting past Model 1 and starting Model 2??? See what I found out in this post I might have only discovered by myself what everyone else already knew. tony |
Astro Send message Joined: 16 Feb 06 Posts: 141 Credit: 32,977 RAC: 0 |
Atleast with 4.85 Barcode, my SLOOOW celeron always (so far) has gotten to model two in one hour. How slow a processor would it take not to get past model 1? At the exact second I passed a model is when my % done updated. I can see this on my celeron easily because it's slow, and pretty easy on my P4 1.8, I can even see it on my AMD64 3700 which takes about a minute/model on a HBLR wu. |
Astro Send message Joined: 16 Feb 06 Posts: 141 Credit: 32,977 RAC: 0 |
I don't think "switch" time has anything to do with the 1% bug, since many users report it stuck for multiple hours (I've seen 14 hours) reported. tony |
Astro Send message Joined: 16 Feb 06 Posts: 141 Credit: 32,977 RAC: 0 |
OK it's late. I can see your point a PII 400 might not checkpoint in one hour, and I don't know which WUs are more prone to these errors, or if certain CPU types/OS/etc are more affected. I hope the Project team has access to see what settings the users have set up, what processors they have, and to try to draw some correlation. I don't and can't have that info. I can hope what I've brought up lights a spark in one of the people looking into this to say "hey, that isn't it, but it reminds me that...." |
Astro Send message Joined: 16 Feb 06 Posts: 141 Credit: 32,977 RAC: 0 |
I'd be curious to know the model# and Step number it's frozen at, but don't want you to lose the possiblility of them asking you to do something first. This data is on the graphic. Is it a 4.83, 4.85?? What's your switch between projects time? Are you doing more than one project? Is this a Hyperthreading host? CPU type? |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
There are 3 type(s) of 1% bug 1) The obvious cause of a sloow pc not chekponting before removing app from ram was already discussed into this thread. Obvius fixes for this type. a) Users increasing the " Switch between applications every " to a adequate time b) Users answering YES to " Leave applications in memory while preempted " *and *not* rebooting u pc that too frequently c) developpers reducing "chekpoint Interval" to what users sayd into "Write to disk at most every" nn seconds. *The default, 60 seconds, should be enough even for a slow pc 2) The second type I see on my Linux PC, kernel 2.4.x *The app stop of using CPU and the whole system goes to IDLE !!! *The sucessfull fix, I am using, until developers come in with something beter is: pkill boinc ps xu keep doing ps xu until no more boinc or rosetta tasks, or any other boinc app appears as result of ps xu Then, restart boinc ... eg: ./boinc -redirectio & *These WUs ends sucessfully -:) *Only too much babysittting until they end OK 3) The third type ,(until now I get only one), on Windows the app keeps into ram consuming 99.98% of CPU, but does not progress *After 4 hours or more of CPU time @ 5ghz or more, that jobs are still at 1% *This can only be fixed by developpers, -> A new rosetta app version *I believe this is some sort of endless loop into Ab initio routine ps: I just reported this one type, on the 1% bug thread -:} Click signature for global team stats |
Astro Send message Joined: 16 Feb 06 Posts: 141 Credit: 32,977 RAC: 0 |
Thanks, I have so many questions, but don't want to seem like a pain asking them. That helps tony |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
|
David@home Send message Joined: 16 Feb 06 Posts: 24 Credit: 409 RAC: 0 |
I was out for about 1 hour to do love. Hi, There is a request from dekim a few posts down in this thread: https://ralph.bakerlab.org/forum_thread.php?id=1#328p My best guess is that this is for both of us. |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
I was out for about 1 hour to do love. I am not sure! I prefer to wait some more, to risk removing this job from RAM. *In case I am wrong, he will post again, following this one post. btw: the english name of the roman (archaic latin) name "argentum" is silver Click signature for global team stats |
Dimitris Hatzopoulos Send message Joined: 16 Feb 06 Posts: 31 Credit: 2,308 RAC: 0 |
Sorry for intervening, but I'm trying to understand how to tell the difference of various bugs. Carlos, does your Rosetta executable keep running? consuming 100% of CPU time? (as seen via Win Task Manager (alt-ctrl-del etc) or using some tool like ProcessExplorer (free, standalone exe, no install required, I've been using it for years) Because I've never encountered a Rosetta WU that "stuck", consuming 100% CPU time, ad infinitum. The ones I've seen "stuck" were all stopped (loaded in memory, BOINC thought they were running, but "top" or "ps" revealed that Rosetta wasn't running, it was "SN"=stopped,nice). And, killing just the Rosetta-task (not ./boinc or anything else, which has been happily running for 1+ month now continuously) will have BOINC re-start the WU with different random-seed and it'll finish OK this time (on the handful of ocassions I encountered sofar). |
Astro Send message Joined: 16 Feb 06 Posts: 141 Credit: 32,977 RAC: 0 |
Carlos, you have a winner there, please don't abort it, keep it in memory, you may have the WU we testers need to fix this. I'd wait until instructed what to do next. Remember it's sunday. Leaving Ralph or that WU suspended is important to Ralph and is the whole reason Ralph even exists. I wish I had what you have, I really do. tony |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
Sorry for intervening, but I'm trying to understand how to tell the difference of various bugs. I use this one http://www.iarsn.com/taskinfo.html and YES rosetta is "stuck", consuming 100% CPU time, ad infinitum *Not exactly 100% but 99.98% ... remaining 0.02% are used by network. Click signature for global team stats |
Astro Send message Joined: 16 Feb 06 Posts: 141 Credit: 32,977 RAC: 0 |
Carlos, go to the "Work tab", highlight the stuck wu, then select "suspend". It should stop it, but keep it in memory until they get a chance to respond, and you can continue crunching other work. tony |
Astro Send message Joined: 16 Feb 06 Posts: 141 Credit: 32,977 RAC: 0 |
I have one that is stuck WU, Computer. It has been going for 2 days, 20 hours, 58 minutes and 4 seconds of CPU time. This machine is currently estimating 8 hours for completion of other results. Hi John, two other had this issue, Mod9 sent for help. David Kim responded with this He hasn't advised further. You could read the whole thread and get a better feel for his intentions. tony [Edit] Mod9 wants to keep this thread just about reporting bugs. He started this thread for discussions about this bug. I have much material there. |
Stargazer257 Send message Joined: 16 Feb 06 Posts: 6 Credit: 17,492 RAC: 0 |
I have two ver 4.90 wu's that "appear" to hang @ 1%, but they have actually just appear to have slowed down to a crawl. Both of them are acting similar in that they race up to Step 34,000 (Model 1) in about 30 minutes and then sloooowly creep forward acomplishing only 50-100 additional steps in 30 additional minutes of processing time. I had rebooted both hosts when they "appeared" to be stuck (@ 4+ hours of processing time), and they both reset to 0:00 (since they must not have "checkpointed"). I will keep them running as long as they progress forward, and will report my results irregardless. They are still in Model 1 at this time. WU10642 WU11437 BTW, is there a fixed number of steps in Model 1, i.e., a goal if you will, to know how close a WU is to completing Model 1 and checkpointing? Join Us! - Click the Sig! |
Stargazer257 Send message Joined: 16 Feb 06 Posts: 6 Credit: 17,492 RAC: 0 |
|
Message boards :
RALPH@home bug list :
Discussion of the \"1% Hang\" issue
©2024 University of Washington
http://www.bakerlab.org