Message boards : RALPH@home bug list : Report \"stuck at 1%\" bugs here
Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next
Author | Message |
---|---|
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
Is possible increasing that time ?
I realized that 5 minutes is too few time ... I have one WU @ 43 minutes at 1% and this is not the 1% bug *How to find IF a WU is stuck Select it and press show graphics If all graphics are frozen , and the only thing moving is the CPU time: After u look for that graphics for, say a 5 minutes, u can conclude that, that WU is stuck. *IF CPU time is not moving, that WU is either suspended or paused -> do not post them Else, WU is really stuck, Then I suggest posting: cpu time stage model step Accepted RMSD Accepted Energy rosetta version: workunit: *These 2 last ones are shown on the header of the graphics screen Well, Then do what Dr. Kim asked for kill boinc, wait some time to ram clears and start boinc again In case, after this procedure, that WU remains stuck, suspend it, post the results, and wait for additional instructions, that hopefully, will be sent before u pc reboot again for some reason -:( IMHO: I belive that the need of restarting boinc to allow rosetta surpass its stuck point is a bug !!! However this is what Dr. Kim recomend. *Other apps does not need of restarting boinc. Click signature for global team stats |
John McLeod VII Send message Joined: 16 Feb 06 Posts: 8 Credit: 39,560 RAC: 0 |
|
Snake Doctor Send message Joined: 16 Feb 06 Posts: 37 Credit: 998,880 RAC: 0 |
I don't know if I should report this here but I just had a 1% hang in the Rosetta Project App version 4.82. The info for the WU is posted here. EDIT: Oops, I just found one in RALPH too. This one hung at 4.25 %. Both of these are on Mac OS 10.4.5, both machines are G4s , one is a laptop, one is a dual desktop The Dual is running Application 4.83. I reset the time parameter because my system wen into EDF because I was testing the longer deadlines. When I changed that one of the two WU I had finished and uploaded, and this one stopped running for an App swap. I will watch it when it restarts. The WU is here, and the result will be here when it reports. |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
Stuck at 1% https://ralph.bakerlab.org/result.php?resultid=5967 *Computer IDLE -> load average: 0.00, 0.00, 0.00 Restarted boinc Click signature for global team stats |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
result of my previous post has been finished,uploaded,and reported, by now You can no longer edit this post. If this was more than of 60 minutes, I could include this information into my post itself. Thanks |
Dimitris Hatzopoulos Send message Joined: 16 Feb 06 Posts: 31 Credit: 2,308 RAC: 0 |
Stuck at 1% Carlos, how exactly do you determine "stuck at 1%" under your Linux host? Do you check stdout.txt by hand? I've had several WUs "stuck" (ps shows "SN"=sleeping,nice for R task - I got it right this time :-)) under Linux. CPU is idle, as per your example and BOINC queue will freeze until I kill R task. Common things between your setup and mine are BOINC v5.2.14 (optimized), Linux kernel 2.4.x (x=27 in my case) and just 256MB RAM. In those cases, just kill the R process and let BOINC restart it. All times but once, the WU completed fine (obviously with different seed). Oddly, I've NEVER had R get stuck under WinXP. A month or more ago, I followed dekim/baker instructions, to run R from cmd-line by hand with same WU/seed that bombed under BOINC, in R own dir, and it finished OK. |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
Carlos, how exactly do you determine "stuck at 1%" under your Linux host? Do you check stdout.txt by hand? Yellow line on my boincview, miles away from host Oddly, I've NEVER had R get stuck under WinXP see now, them. stuck at 1% - New rosetta 4.82 released Feb 18, 2006 https://boinc.bakerlab.org/rosetta/result.php?resultid=11877712 06:30:59 hours of cpu time at 100% Pentium IV 1800 mhz (stock speed) I did restarted boinc two times by now, to no avail - still stuck at 1% *I have rebooted too ! Help! Click signature for global team stats |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
still stuck at 1% CPU Time 12:36:44 p4 1.8G stock speed - HELP ! https://boinc.bakerlab.org/rosetta/result.php?resultid=11877712 |
River~~ Send message Joined: 20 Feb 06 Posts: 20 Credit: 503 RAC: 0 |
... how exactly do you determine "stuck at 1%" under your Linux host? Do you check stdout.txt by hand? There are various ways 1. Use the BOINC Manager (this works locally if you have a GUI, or can be used remotely - you can even use the standard manager from a Win box to monitor a Linux box - see details on remote control in the wiki) 2. Use BoincView from a Win box 3. In linux command line from the BOINC directory, use ./boinc_cmd --get_state|less and look for the fraction complete (presumably will stick at 0.01 ??) 4. again in command line, from the BOINC directory, use less client_state.xml /active / till get to the one you want to look at, and scroll down till see the fraction done tag. (If you don't understand what I mean here, please see man less or info less for how to drive the less utility.) hope that helps someone (& feel free to borrow for any FAQ or wiki) R~~ |
Dimitris Hatzopoulos Send message Joined: 16 Feb 06 Posts: 31 Credit: 2,308 RAC: 0 |
River thanks for suggestion, sofar had just been grep'ping for "pct_co" the stdout.txt file (I think recently R has changed the location and now stores it as WU description, so I have to look inside slots/x/stdout.txt to find the "real" filename), e.g. $ fgrep pct_comp ~boinc/BOINC/projects/ralph.bakerlab.org/BARCODE_30_4ubpA_215_6_0_0 | tail BOINC :: [2006-02-23 18:49:47] :: num_decoys: 65 :: number_of_output: 71 :: pct_complete: 0.907516 BOINC :: [2006-02-23 18:54:16] :: num_decoys: 66 :: number_of_output: 71 :: pct_complete: 0.916776 BOINC :: [2006-02-23 19:02:34] :: num_decoys: 67 :: number_of_output: 71 :: pct_complete: 0.933897 BOINC :: [2006-02-23 19:09:26] :: num_decoys: 68 :: number_of_output: 71 :: pct_complete: 0.948104 BOINC :: [2006-02-23 19:12:11] :: num_decoys: 69 :: number_of_output: 72 :: pct_complete: 0.953801 Edit: prior example was for RALPH/R4.84, current R has stdout in /slots dir fgrep pct_comp ~boinc/BOINC/slots/0/stdout.txt BOINC :: [2006-02-23 19:48:12] :: mode: abinitio :: nstartnm: 1 :: number_of_output: 16 :: num_decoys: 0 :: pct_complete: 0.01 BOINC :: [2006-02-23 20:05:36] :: num_decoys: 1 :: number_of_output: 27 :: pct_complete: 0.0359882 BOINC :: [2006-02-23 20:30:11] :: num_decoys: 2 :: number_of_output: 22 :: pct_complete: 0.0870625 |
genes Send message Joined: 16 Feb 06 Posts: 45 Credit: 43,706 RAC: 20 |
I don't know how long it's supposed to spend "initializing", but I got a new 4.90 WU which has been initializing (at 1%, with the dots blinking) now for over 30 minutes. There is a molecule in the "native" box and one in the "searching" box, but the other boxes are empty. The lines defining the edges of the boxes are also oddly shaped, on the empty boxes the upper right corners are folded down and to the left. This machine has "leave in memory" set to YES. I'll let it keep running, we'll see what happens. |
STE\/E Send message Joined: 16 Feb 06 Posts: 27 Credit: 2,226,442 RAC: 783 |
Right now I have 12 of them stuck @ 1% ... Some of them for as long as 3 hours, none of them are making it past the 1% mark so far. I have my preferences set to 2 hours run time so something is not right if their still @ 1% for 3 hours, at least I would think so anyway. |
IceQueen41 Send message Joined: 22 Feb 06 Posts: 6 Credit: 9,473 RAC: 0 |
I don't know how long it's supposed to spend "initializing", but I got a new 4.90 WU which has been initializing (at 1%, with the dots blinking) now for over 30 minutes. There is a molecule in the "native" box and one in the "searching" box, but the other boxes are empty. The lines defining the edges of the boxes are also oddly shaped, on the empty boxes the upper right corners are folded down and to the left. Mine started the same way (WU, Result). It should get past the initialization sooner or later (mine took over 10 minutes on a decently fast processor), and then it only does one trajectory, which is why it appears to be stuck at 1% (mine was at 1% literally until it finished). It also seems to use a much slower algorithm with long intervals between steps, so don't restart or abort it until you know it's not going anywhere. Hope this helps! |
genes Send message Joined: 16 Feb 06 Posts: 45 Credit: 43,706 RAC: 20 |
I don't know how long it's supposed to spend "initializing", but I got a new 4.90 WU which has been initializing (at 1%, with the dots blinking) now for over 30 minutes. There is a molecule in the "native" box and one in the "searching" box, but the other boxes are empty. The lines defining the edges of the boxes are also oddly shaped, on the empty boxes the upper right corners are folded down and to the left. It is swapped out right now. We'll see what happens when it comes back. The machine is a Dual P3, 1GHz. Run-time prefs set to 4 hours. [edit] It's back. It has gotten past the "initializing" stage, and is on step 25000 or so. Still at 1%, but running (steps counting, graphics moving). Verrry slowwwly. I suspect debugging code has been put into it, much like when Seti Boinc first started out (and was crashing constantly). [/edit] |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
stuck at 1% rosetta_beta_4.84 Linux https://ralph.bakerlab.org/result.php?resultid=12969 *load average: 0.01, 0.09, 0.46 crobertp [/home/boinc/BOINC] > ps xu USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND boinc 27682 0.0 0.4 2616 1036 ? SN Feb17 0:00 /bin/bash ./yasuc.sh boinc 24384 0.0 1.5 6244 3772 ? S Feb25 0:52 ./boinc -redirectio -allow_remote_gui_rpc -return_results_imme boinc 21886 0.0 1.0 7216 2496 ? S 01:08 0:00 /usr/sbin/sshd boinc 21887 0.0 0.8 3500 2052 pts/1 S 01:08 0:00 -bash boinc 22269 44.3 26.1 172160 64896 ? SN 01:53 11:20 rosetta_beta_4.84_i686-pc-linux-gnu xx 1dcj _ -abrelax -string boinc 22270 0.0 26.1 172160 64896 ? SN 01:53 0:00 rosetta_beta_4.84_i686-pc-linux-gnu xx 1dcj _ -abrelax -string boinc 22271 0.0 26.1 172160 64896 ? SN 01:53 0:00 rosetta_beta_4.84_i686-pc-linux-gnu xx 1dcj _ -abrelax -string boinc 22372 0.0 0.2 2084 624 ? SN 02:16 0:00 sleep 600 boinc 22380 0.0 0.2 2548 672 pts/1 R 02:19 0:00 ps xu crobertp [/home/boinc/BOINC] > Restarting boinc ... Click signature for global team stats |
genes Send message Joined: 16 Feb 06 Posts: 45 Credit: 43,706 RAC: 20 |
Question: Should we set our run-time preference higher for these 4.90 WU's? Since they seem to be running slowly (due to debugging code maybe?) they aren't going to get much done in the recommended 2 hours. I have mine set at 4 hours for my P3 machines and even they aren't getting much done. |
STE\/E Send message Joined: 16 Feb 06 Posts: 27 Credit: 2,226,442 RAC: 783 |
Question: Should we set our run-time preference higher for these 4.90 WU's? Since they seem to be running slowly (due to debugging code maybe?) they aren't going to get much done in the recommended 2 hours. I have mine set at 4 hours for my P3 machines and even they aren't getting much done. Yes, I would like some clarification on these v4.90's myself. Like genes asked, would setting the run time higher help getting these WU's past the 1% mark or let them finish. So far I've only had 1 v4.90 WU finish & that one only ran for 1:10:30 then just abruptly finished and Uploaded. It ran the whole time at 1% then just jumped to 100% ... |
IceQueen41 Send message Joined: 22 Feb 06 Posts: 6 Credit: 9,473 RAC: 0 |
Question: Should we set our run-time preference higher for these 4.90 WU's? Since they seem to be running slowly (due to debugging code maybe?) they aren't going to get much done in the recommended 2 hours. I have mine set at 4 hours for my P3 machines and even they aren't getting much done. I don't think it's that it's not getting much done, it's that it only runs one trajectory, or model, and from what I've seen, the percentage updates primarily after a trajectory has finished. This would explain why it's on 1% until it's done. |
STE\/E Send message Joined: 16 Feb 06 Posts: 27 Credit: 2,226,442 RAC: 783 |
Question: Should we set our run-time preference higher for these 4.90 WU's? Since they seem to be running slowly (due to debugging code maybe?) they aren't going to get much done in the recommended 2 hours. I have mine set at 4 hours for my P3 machines and even they aren't getting much done. How you doing IceQueen41, it's hard to tell what these v4.90 Wu's are doing, I have 1 Computer that has 1 Wu @ 5 hr's still showing 1% -- 1 Wu @ 2 hr's showing 47.95% & 1 Wu that finished @ 1 hr 11 min's never showing more than 1% ... Hard to figure them out when they run like that ... I have my Preferences set to run 2 hr's but these v4.90's don't seem to want to adhere to that Preference ... ??? PS: As I posted the above the WU that was @ 5 hr's finished @ 100% & Uploaded. Guess we just have to let them run their course & see what happens to them. |
IceQueen41 Send message Joined: 22 Feb 06 Posts: 6 Credit: 9,473 RAC: 0 |
Question: Should we set our run-time preference higher for these 4.90 WU's? Since they seem to be running slowly (due to debugging code maybe?) they aren't going to get much done in the recommended 2 hours. I have mine set at 4 hours for my P3 machines and even they aren't getting much done. Hmm, so I guess that kills my theory... interesting. I've only run a couple... my prefs are set to 2 hours as well, and one ran about 1:45, and the other ran almost 6 hours. Hopefully this will get figured out soon... |
Message boards :
RALPH@home bug list :
Report \"stuck at 1%\" bugs here
©2024 University of Washington
http://www.bakerlab.org