Message boards : RALPH@home bug list : Report \"stuck at 1%\" bugs here
Previous · 1 · 2 · 3 · 4 · 5 . . . 7 · Next
Author | Message |
---|---|
STE\/E Send message Joined: 16 Feb 06 Posts: 27 Credit: 2,214,911 RAC: 0 |
I finally have (or had) a WU hang @ 1% for 6:30:20. I restarted BOINC to see if it would get past the 1% mark but it has run for 30 min's now & it's still hung @ 1%. I Suspended it for now to run out the rest of the WU's on that Computer. |
![]() ![]() Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
Update a lorry knocked down a pole of lighting and the company of electricity it stayed more than half an hour to do the repair After power restore and reboot , I resumed that WU ... Now it is at: cpu time: 0 hr 27 min 43 sec 18.7% Complete Stage: Full Atom Relax Model: 6 Step: 34000 Accepted RMSD: 11.27 Accepted Energy: -227.6638 I regret we lose the "Captured Bug" Remains however the WU(s) of Jonh McLeoy and the one of PoorBoy, with a "Captured Bug" Do someone knows WHY no additional instructions was sent to none of our 3 until now ??? This way, is better aborting unconditionaly, any WU that stays at 1% for more than 5 minutes -:( *May be I will write a script to do this automatically , or to restart boinc automatically , so we does not lose days of CPU power on a endless loop Click signature for global team stats ![]() ![]() |
Moderator9 Volunteer moderator Send message Joined: 16 Feb 06 Posts: 251 Credit: 0 RAC: 0 |
To everyone on this thread with a stuck/hung WU that is Suspended! David Kim e-mailed me the following instructions for ALL of you - "If user's suspect hitting the 1% bug, they should let it continue for a few hours or evan a day on Ralph and then restart boinc to see if it continues on after a restart. They should also post the result id on the forum so we can look at them when we get a chance to and explain what happens. Thanks! David K This comes direct form the Project Team. Please record what you can and post it with links here BRIEFLY" I will be moving a few of the previous messages to the 1% hang discussion thread, to keep this thread trim. So if you posted something here that did not contain specific Reporting information about a hung WU and can't find it look there. Moderator9 RALPH@home FAQs RALPH@home Guidelines Moderator Contact |
Moderator9 Volunteer moderator Send message Joined: 16 Feb 06 Posts: 251 Credit: 0 RAC: 0 |
...Do someone knows WHY no additional instructions was sent to none of our 3 Instructions have been provided. You should follow instructions provided by David Kim, David Baker, or any of the forum Moderators. The Moderators are in direct contact with the project team and have been given the required guidance. Thank you. Moderator9 RALPH@home FAQs RALPH@home Guidelines Moderator Contact |
![]() ![]() Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
Is possible increasing that time ?
I realized that 5 minutes is too few time ... I have one WU @ 43 minutes at 1% and this is not the 1% bug *How to find IF a WU is stuck Select it and press show graphics If all graphics are frozen , and the only thing moving is the CPU time: After u look for that graphics for, say a 5 minutes, u can conclude that, that WU is stuck. *IF CPU time is not moving, that WU is either suspended or paused -> do not post them Else, WU is really stuck, Then I suggest posting: cpu time stage model step Accepted RMSD Accepted Energy rosetta version: workunit: *These 2 last ones are shown on the header of the graphics screen Well, Then do what Dr. Kim asked for kill boinc, wait some time to ram clears and start boinc again In case, after this procedure, that WU remains stuck, suspend it, post the results, and wait for additional instructions, that hopefully, will be sent before u pc reboot again for some reason -:( IMHO: I belive that the need of restarting boinc to allow rosetta surpass its stuck point is a bug !!! However this is what Dr. Kim recomend. *Other apps does not need of restarting boinc. Click signature for global team stats ![]() ![]() |
Moderator9 Volunteer moderator Send message Joined: 16 Feb 06 Posts: 251 Credit: 0 RAC: 0 |
... Of course it is a bug. That is what this project is trying to fix. The purpose for restarting and letting it run to the finish is so the WU will report in so they can look at it, not as a work around for the issue itself. They want the error data that reports back with the WU. That is why they want you to post a BRIEF explanation of the problem, and a LINK to the result. NEVER ABORT THE WUs ALWAYS LET THEM RUN UNTIL DONE OR CRASH ON THEIR OWN AND REPORT BACK. Moderator9 RALPH@home FAQs RALPH@home Guidelines Moderator Contact |
John McLeod VII![]() Send message Joined: 16 Feb 06 Posts: 8 Credit: 39,560 RAC: 25 |
|
Snake Doctor Send message Joined: 16 Feb 06 Posts: 37 Credit: 998,880 RAC: 0 |
I don't know if I should report this here but I just had a 1% hang in the Rosetta Project App version 4.82. The info for the WU is posted here. EDIT: Oops, I just found one in RALPH too. This one hung at 4.25 %. Both of these are on Mac OS 10.4.5, both machines are G4s , one is a laptop, one is a dual desktop The Dual is running Application 4.83. I reset the time parameter because my system wen into EDF because I was testing the longer deadlines. When I changed that one of the two WU I had finished and uploaded, and this one stopped running for an App swap. I will watch it when it restarts. The WU is here, and the result will be here when it reports. |
![]() ![]() Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
Stuck at 1% https://ralph.bakerlab.org/result.php?resultid=5967 *Computer IDLE -> load average: 0.00, 0.00, 0.00 Restarted boinc Click signature for global team stats ![]() ![]() |
![]() ![]() Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
result of my previous post has been finished,uploaded,and reported, by now You can no longer edit this post. If this was more than of 60 minutes, I could include this information into my post itself. Thanks |
Dimitris Hatzopoulos Send message Joined: 16 Feb 06 Posts: 31 Credit: 2,308 RAC: 0 |
Stuck at 1% Carlos, how exactly do you determine "stuck at 1%" under your Linux host? Do you check stdout.txt by hand? I've had several WUs "stuck" (ps shows "SN"=sleeping,nice for R task - I got it right this time :-)) under Linux. CPU is idle, as per your example and BOINC queue will freeze until I kill R task. Common things between your setup and mine are BOINC v5.2.14 (optimized), Linux kernel 2.4.x (x=27 in my case) and just 256MB RAM. In those cases, just kill the R process and let BOINC restart it. All times but once, the WU completed fine (obviously with different seed). Oddly, I've NEVER had R get stuck under WinXP. A month or more ago, I followed dekim/baker instructions, to run R from cmd-line by hand with same WU/seed that bombed under BOINC, in R own dir, and it finished OK. |
![]() ![]() Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
Carlos, how exactly do you determine "stuck at 1%" under your Linux host? Do you check stdout.txt by hand? Yellow line on my boincview, miles away from host Oddly, I've NEVER had R get stuck under WinXP see now, them. stuck at 1% - New rosetta 4.82 released Feb 18, 2006 https://boinc.bakerlab.org/rosetta/result.php?resultid=11877712 06:30:59 hours of cpu time at 100% Pentium IV 1800 mhz (stock speed) I did restarted boinc two times by now, to no avail - still stuck at 1% *I have rebooted too ! Help! Click signature for global team stats ![]() ![]() |
![]() ![]() Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
still stuck at 1% CPU Time 12:36:44 p4 1.8G stock speed - HELP ! https://boinc.bakerlab.org/rosetta/result.php?resultid=11877712 |
River~~ Send message Joined: 20 Feb 06 Posts: 20 Credit: 503 RAC: 0 |
... how exactly do you determine "stuck at 1%" under your Linux host? Do you check stdout.txt by hand? There are various ways 1. Use the BOINC Manager (this works locally if you have a GUI, or can be used remotely - you can even use the standard manager from a Win box to monitor a Linux box - see details on remote control in the wiki) 2. Use BoincView from a Win box 3. In linux command line from the BOINC directory, use ./boinc_cmd --get_state|less and look for the fraction complete (presumably will stick at 0.01 ??) 4. again in command line, from the BOINC directory, use less client_state.xml /active / till get to the one you want to look at, and scroll down till see the fraction done tag. (If you don't understand what I mean here, please see man less or info less for how to drive the less utility.) hope that helps someone (& feel free to borrow for any FAQ or wiki) R~~ ![]() |
Dimitris Hatzopoulos Send message Joined: 16 Feb 06 Posts: 31 Credit: 2,308 RAC: 0 |
River thanks for suggestion, sofar had just been grep'ping for "pct_co" the stdout.txt file (I think recently R has changed the location and now stores it as WU description, so I have to look inside slots/x/stdout.txt to find the "real" filename), e.g. $ fgrep pct_comp ~boinc/BOINC/projects/ralph.bakerlab.org/BARCODE_30_4ubpA_215_6_0_0 | tail BOINC :: [2006-02-23 18:49:47] :: num_decoys: 65 :: number_of_output: 71 :: pct_complete: 0.907516 BOINC :: [2006-02-23 18:54:16] :: num_decoys: 66 :: number_of_output: 71 :: pct_complete: 0.916776 BOINC :: [2006-02-23 19:02:34] :: num_decoys: 67 :: number_of_output: 71 :: pct_complete: 0.933897 BOINC :: [2006-02-23 19:09:26] :: num_decoys: 68 :: number_of_output: 71 :: pct_complete: 0.948104 BOINC :: [2006-02-23 19:12:11] :: num_decoys: 69 :: number_of_output: 72 :: pct_complete: 0.953801 Edit: prior example was for RALPH/R4.84, current R has stdout in /slots dir fgrep pct_comp ~boinc/BOINC/slots/0/stdout.txt BOINC :: [2006-02-23 19:48:12] :: mode: abinitio :: nstartnm: 1 :: number_of_output: 16 :: num_decoys: 0 :: pct_complete: 0.01 BOINC :: [2006-02-23 20:05:36] :: num_decoys: 1 :: number_of_output: 27 :: pct_complete: 0.0359882 BOINC :: [2006-02-23 20:30:11] :: num_decoys: 2 :: number_of_output: 22 :: pct_complete: 0.0870625 |
genes![]() Send message Joined: 16 Feb 06 Posts: 45 Credit: 43,300 RAC: 0 |
I don't know how long it's supposed to spend "initializing", but I got a new 4.90 WU which has been initializing (at 1%, with the dots blinking) now for over 30 minutes. There is a molecule in the "native" box and one in the "searching" box, but the other boxes are empty. The lines defining the edges of the boxes are also oddly shaped, on the empty boxes the upper right corners are folded down and to the left. This machine has "leave in memory" set to YES. I'll let it keep running, we'll see what happens. ![]() |
STE\/E Send message Joined: 16 Feb 06 Posts: 27 Credit: 2,214,911 RAC: 0 |
Right now I have 12 of them stuck @ 1% ... Some of them for as long as 3 hours, none of them are making it past the 1% mark so far. I have my preferences set to 2 hours run time so something is not right if their still @ 1% for 3 hours, at least I would think so anyway. |
Moderator9 Volunteer moderator Send message Joined: 16 Feb 06 Posts: 251 Credit: 0 RAC: 0 |
I don't know how long it's supposed to spend "initializing", but I got a new 4.90 WU which has been initializing (at 1%, with the dots blinking) now for over 30 minutes. There is a molecule in the "native" box and one in the "searching" box, but the other boxes are empty. The lines defining the edges of the boxes are also oddly shaped, on the empty boxes the upper right corners are folded down and to the left. I saw one do that the other day, but it started and while it was initializing it did an application swap. This left the display just as you described it until the WU started up again. See if this is what is happening on your system. Moderator9 RALPH@home FAQs RALPH@home Guidelines Moderator Contact |
IceQueen41![]() Send message Joined: 22 Feb 06 Posts: 6 Credit: 9,473 RAC: 0 |
I don't know how long it's supposed to spend "initializing", but I got a new 4.90 WU which has been initializing (at 1%, with the dots blinking) now for over 30 minutes. There is a molecule in the "native" box and one in the "searching" box, but the other boxes are empty. The lines defining the edges of the boxes are also oddly shaped, on the empty boxes the upper right corners are folded down and to the left. Mine started the same way (WU, Result). It should get past the initialization sooner or later (mine took over 10 minutes on a decently fast processor), and then it only does one trajectory, which is why it appears to be stuck at 1% (mine was at 1% literally until it finished). It also seems to use a much slower algorithm with long intervals between steps, so don't restart or abort it until you know it's not going anywhere. Hope this helps! ![]() |
genes![]() Send message Joined: 16 Feb 06 Posts: 45 Credit: 43,300 RAC: 0 |
I don't know how long it's supposed to spend "initializing", but I got a new 4.90 WU which has been initializing (at 1%, with the dots blinking) now for over 30 minutes. There is a molecule in the "native" box and one in the "searching" box, but the other boxes are empty. The lines defining the edges of the boxes are also oddly shaped, on the empty boxes the upper right corners are folded down and to the left. It is swapped out right now. We'll see what happens when it comes back. The machine is a Dual P3, 1GHz. Run-time prefs set to 4 hours. [edit] It's back. It has gotten past the "initializing" stage, and is on step 25000 or so. Still at 1%, but running (steps counting, graphics moving). Verrry slowwwly. I suspect debugging code has been put into it, much like when Seti Boinc first started out (and was crashing constantly). [/edit] ![]() |
Message boards :
RALPH@home bug list :
Report \"stuck at 1%\" bugs here
©2023 University of Washington
http://www.bakerlab.org