Report \"stuck at 1%\" bugs here

Message boards : RALPH@home bug list : Report \"stuck at 1%\" bugs here

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

AuthorMessage
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 395 - Posted: 20 Feb 2006, 20:43:12 UTC


You can no longer edit this post.
Posts can only be edited at most 60 minutes after they have been created.

Is possible increasing that time ?


This way, is better aborting unconditionaly,
any WU that stays at 1% for more than 5 minutes -:(
*May be I will write a script to do this automatically , or to restart boinc
automatically , so we does not lose days of CPU power on a endless loop


I realized that 5 minutes is too few time ... I have one WU @ 43 minutes at 1% and this is not the 1% bug

*How to find IF a WU is stuck
Select it and press show graphics

If all graphics are frozen , and the only thing moving is the CPU time:
After u look for that graphics for, say a 5 minutes, u can conclude that, that WU is stuck.

*IF CPU time is not moving, that WU is either suspended or paused -> do not post them

Else, WU is really stuck, Then I suggest posting:
cpu time
stage
model
step
Accepted RMSD
Accepted Energy

rosetta version:
workunit:
*These 2 last ones are shown on the header of the graphics screen

Well,
Then do what Dr. Kim asked for
kill boinc,
wait some time to ram clears
and start boinc again

In case, after this procedure, that WU remains stuck,
suspend it, post the results, and wait for additional instructions,
that hopefully, will be sent before u pc reboot again for some reason -:(

IMHO: I belive that the need of restarting boinc to allow rosetta surpass its stuck point is a bug !!!
However this is what Dr. Kim recomend.
*Other apps does not need of restarting boinc.
Click signature for global team stats
ID: 395 · Report as offensive    Reply Quote
John McLeod VII
Avatar

Send message
Joined: 16 Feb 06
Posts: 8
Credit: 39,560
RAC: 0
Message 400 - Posted: 20 Feb 2006, 21:32:42 UTC

The one that I reported continues on once restarted. The report will be up in a while.


BOINC WIKI
ID: 400 · Report as offensive    Reply Quote
Snake Doctor

Send message
Joined: 16 Feb 06
Posts: 37
Credit: 998,880
RAC: 0
Message 411 - Posted: 21 Feb 2006, 4:29:28 UTC
Last modified: 21 Feb 2006, 4:48:16 UTC

I don't know if I should report this here but I just had a 1% hang in the Rosetta Project App version 4.82. The info for the WU is posted here.

EDIT: Oops, I just found one in RALPH too. This one hung at 4.25 %. Both of these are on Mac OS 10.4.5, both machines are G4s , one is a laptop, one is a dual desktop The Dual is running Application 4.83. I reset the time parameter because my system wen into EDF because I was testing the longer deadlines. When I changed that one of the two WU I had finished and uploaded, and this one stopped running for an App swap. I will watch it when it restarts.

The WU is here, and the result will be here when it reports.
ID: 411 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 481 - Posted: 22 Feb 2006, 15:38:19 UTC

Stuck at 1%
https://ralph.bakerlab.org/result.php?resultid=5967

*Computer IDLE -> load average: 0.00, 0.00, 0.00

Restarted boinc

Click signature for global team stats
ID: 481 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 506 - Posted: 22 Feb 2006, 20:52:14 UTC

result of my previous post has been finished,uploaded,and reported, by now

You can no longer edit this post.
Posts can only be edited at most 60 minutes after they have been created


If this was more than of 60 minutes, I could include
this information into my post itself. Thanks
ID: 506 · Report as offensive    Reply Quote
Dimitris Hatzopoulos

Send message
Joined: 16 Feb 06
Posts: 31
Credit: 2,308
RAC: 0
Message 515 - Posted: 23 Feb 2006, 3:40:49 UTC - in response to Message 481.  

Stuck at 1%
https://ralph.bakerlab.org/result.php?resultid=5967

*Computer IDLE -> load average: 0.00, 0.00, 0.00

Restarted boinc


Carlos, how exactly do you determine "stuck at 1%" under your Linux host? Do you check stdout.txt by hand?

I've had several WUs "stuck" (ps shows "SN"=sleeping,nice for R task - I got it right this time :-)) under Linux. CPU is idle, as per your example and BOINC queue will freeze until I kill R task.

Common things between your setup and mine are BOINC v5.2.14 (optimized), Linux kernel 2.4.x (x=27 in my case) and just 256MB RAM.

In those cases, just kill the R process and let BOINC restart it. All times but once, the WU completed fine (obviously with different seed). Oddly, I've NEVER had R get stuck under WinXP.

A month or more ago, I followed dekim/baker instructions, to run R from cmd-line by hand with same WU/seed that bombed under BOINC, in R own dir, and it finished OK.


ID: 515 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 520 - Posted: 23 Feb 2006, 8:35:25 UTC
Last modified: 23 Feb 2006, 8:42:46 UTC

Carlos, how exactly do you determine "stuck at 1%" under your Linux host? Do you check stdout.txt by hand?

Yellow line on my boincview, miles away from host

Oddly, I've NEVER had R get stuck under WinXP

see now, them. stuck at 1% - New rosetta 4.82 released Feb 18, 2006
https://boinc.bakerlab.org/rosetta/result.php?resultid=11877712

06:30:59 hours of cpu time at 100% Pentium IV 1800 mhz (stock speed)

I did restarted boinc two times by now, to no avail - still stuck at 1%

*I have rebooted too !

Help!
Click signature for global team stats
ID: 520 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 530 - Posted: 23 Feb 2006, 14:51:36 UTC

still stuck at 1%
CPU Time 12:36:44 p4 1.8G stock speed - HELP !
https://boinc.bakerlab.org/rosetta/result.php?resultid=11877712
ID: 530 · Report as offensive    Reply Quote
River~~

Send message
Joined: 20 Feb 06
Posts: 20
Credit: 503
RAC: 0
Message 533 - Posted: 23 Feb 2006, 17:16:15 UTC - in response to Message 515.  
Last modified: 23 Feb 2006, 17:17:38 UTC

... how exactly do you determine "stuck at 1%" under your Linux host? Do you check stdout.txt by hand?


There are various ways

1. Use the BOINC Manager (this works locally if you have a GUI, or can be used remotely - you can even use the standard manager from a Win box to monitor a Linux box - see details on remote control in the wiki)

2. Use BoincView from a Win box

3. In linux command line from the BOINC directory, use

./boinc_cmd --get_state|less

and look for the fraction complete (presumably will stick at 0.01 ??)

4. again in command line, from the BOINC directory, use

less client_state.xml
/active
/ till get to the one you want to look at, and scroll down till see the fraction done tag. (If you don't understand what I mean here, please see man less or info less for how to drive the less utility.)

hope that helps someone (& feel free to borrow for any FAQ or wiki)

R~~
ID: 533 · Report as offensive    Reply Quote
Dimitris Hatzopoulos

Send message
Joined: 16 Feb 06
Posts: 31
Credit: 2,308
RAC: 0
Message 536 - Posted: 23 Feb 2006, 18:45:09 UTC
Last modified: 23 Feb 2006, 18:56:59 UTC

River thanks for suggestion, sofar had just been grep'ping for "pct_co" the stdout.txt file (I think recently R has changed the location and now stores it as WU description, so I have to look inside slots/x/stdout.txt to find the "real" filename), e.g.

$ fgrep pct_comp ~boinc/BOINC/projects/ralph.bakerlab.org/BARCODE_30_4ubpA_215_6_0_0 | tail

BOINC :: [2006-02-23 18:49:47] :: num_decoys: 65 :: number_of_output: 71 :: pct_complete: 0.907516
BOINC :: [2006-02-23 18:54:16] :: num_decoys: 66 :: number_of_output: 71 :: pct_complete: 0.916776
BOINC :: [2006-02-23 19:02:34] :: num_decoys: 67 :: number_of_output: 71 :: pct_complete: 0.933897
BOINC :: [2006-02-23 19:09:26] :: num_decoys: 68 :: number_of_output: 71 :: pct_complete: 0.948104
BOINC :: [2006-02-23 19:12:11] :: num_decoys: 69 :: number_of_output: 72 :: pct_complete: 0.953801

Edit: prior example was for RALPH/R4.84, current R has stdout in /slots dir

fgrep pct_comp ~boinc/BOINC/slots/0/stdout.txt
BOINC :: [2006-02-23 19:48:12] :: mode: abinitio :: nstartnm: 1 :: number_of_output: 16 :: num_decoys: 0 :: pct_complete: 0.01
BOINC :: [2006-02-23 20:05:36] :: num_decoys: 1 :: number_of_output: 27 :: pct_complete: 0.0359882
BOINC :: [2006-02-23 20:30:11] :: num_decoys: 2 :: number_of_output: 22 :: pct_complete: 0.0870625

ID: 536 · Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 16 Feb 06
Posts: 45
Credit: 43,706
RAC: 20
Message 712 - Posted: 28 Feb 2006, 2:46:44 UTC
Last modified: 28 Feb 2006, 2:47:58 UTC

I don't know how long it's supposed to spend "initializing", but I got a new 4.90 WU which has been initializing (at 1%, with the dots blinking) now for over 30 minutes. There is a molecule in the "native" box and one in the "searching" box, but the other boxes are empty. The lines defining the edges of the boxes are also oddly shaped, on the empty boxes the upper right corners are folded down and to the left.

This machine has "leave in memory" set to YES. I'll let it keep running, we'll see what happens.


ID: 712 · Report as offensive    Reply Quote
STE\/E

Send message
Joined: 16 Feb 06
Posts: 27
Credit: 2,226,442
RAC: 783
Message 715 - Posted: 28 Feb 2006, 3:40:54 UTC
Last modified: 28 Feb 2006, 4:15:58 UTC

Right now I have 12 of them stuck @ 1% ... Some of them for as long as 3 hours, none of them are making it past the 1% mark so far. I have my preferences set to 2 hours run time so something is not right if their still @ 1% for 3 hours, at least I would think so anyway.

ID: 715 · Report as offensive    Reply Quote
IceQueen41
Avatar

Send message
Joined: 22 Feb 06
Posts: 6
Credit: 9,473
RAC: 0
Message 717 - Posted: 28 Feb 2006, 3:43:32 UTC - in response to Message 712.  

I don't know how long it's supposed to spend "initializing", but I got a new 4.90 WU which has been initializing (at 1%, with the dots blinking) now for over 30 minutes. There is a molecule in the "native" box and one in the "searching" box, but the other boxes are empty. The lines defining the edges of the boxes are also oddly shaped, on the empty boxes the upper right corners are folded down and to the left.

This machine has "leave in memory" set to YES. I'll let it keep running, we'll see what happens.



Mine started the same way (WU, Result). It should get past the initialization sooner or later (mine took over 10 minutes on a decently fast processor), and then it only does one trajectory, which is why it appears to be stuck at 1% (mine was at 1% literally until it finished). It also seems to use a much slower algorithm with long intervals between steps, so don't restart or abort it until you know it's not going anywhere. Hope this helps!
ID: 717 · Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 16 Feb 06
Posts: 45
Credit: 43,706
RAC: 20
Message 718 - Posted: 28 Feb 2006, 4:07:30 UTC - in response to Message 716.  
Last modified: 28 Feb 2006, 4:37:51 UTC

I don't know how long it's supposed to spend "initializing", but I got a new 4.90 WU which has been initializing (at 1%, with the dots blinking) now for over 30 minutes. There is a molecule in the "native" box and one in the "searching" box, but the other boxes are empty. The lines defining the edges of the boxes are also oddly shaped, on the empty boxes the upper right corners are folded down and to the left.

This machine has "leave in memory" set to YES. I'll let it keep running, we'll see what happens.




I saw one do that the other day, but it started and while it was initializing it did an application swap. This left the display just as you described it until the WU started up again.

See if this is what is happening on your system.


It is swapped out right now. We'll see what happens when it comes back. The machine is a Dual P3, 1GHz. Run-time prefs set to 4 hours.

[edit]
It's back. It has gotten past the "initializing" stage, and is on step 25000 or so. Still at 1%, but running (steps counting, graphics moving). Verrry slowwwly. I suspect debugging code has been put into it, much like when Seti Boinc first started out (and was crashing constantly).
[/edit]


ID: 718 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 721 - Posted: 28 Feb 2006, 5:11:09 UTC

stuck at 1% rosetta_beta_4.84 Linux
https://ralph.bakerlab.org/result.php?resultid=12969

*load average: 0.01, 0.09, 0.46

crobertp [/home/boinc/BOINC] > ps xu
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
boinc 27682 0.0 0.4 2616 1036 ? SN Feb17 0:00 /bin/bash ./yasuc.sh
boinc 24384 0.0 1.5 6244 3772 ? S Feb25 0:52 ./boinc -redirectio -allow_remote_gui_rpc -return_results_imme
boinc 21886 0.0 1.0 7216 2496 ? S 01:08 0:00 /usr/sbin/sshd
boinc 21887 0.0 0.8 3500 2052 pts/1 S 01:08 0:00 -bash
boinc 22269 44.3 26.1 172160 64896 ? SN 01:53 11:20 rosetta_beta_4.84_i686-pc-linux-gnu xx 1dcj _ -abrelax -string
boinc 22270 0.0 26.1 172160 64896 ? SN 01:53 0:00 rosetta_beta_4.84_i686-pc-linux-gnu xx 1dcj _ -abrelax -string
boinc 22271 0.0 26.1 172160 64896 ? SN 01:53 0:00 rosetta_beta_4.84_i686-pc-linux-gnu xx 1dcj _ -abrelax -string
boinc 22372 0.0 0.2 2084 624 ? SN 02:16 0:00 sleep 600
boinc 22380 0.0 0.2 2548 672 pts/1 R 02:19 0:00 ps xu
crobertp [/home/boinc/BOINC] >

Restarting boinc ...
Click signature for global team stats
ID: 721 · Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 16 Feb 06
Posts: 45
Credit: 43,706
RAC: 20
Message 730 - Posted: 28 Feb 2006, 12:19:07 UTC

Question: Should we set our run-time preference higher for these 4.90 WU's? Since they seem to be running slowly (due to debugging code maybe?) they aren't going to get much done in the recommended 2 hours. I have mine set at 4 hours for my P3 machines and even they aren't getting much done.

ID: 730 · Report as offensive    Reply Quote
STE\/E

Send message
Joined: 16 Feb 06
Posts: 27
Credit: 2,226,442
RAC: 783
Message 732 - Posted: 28 Feb 2006, 12:34:41 UTC - in response to Message 730.  

Question: Should we set our run-time preference higher for these 4.90 WU's? Since they seem to be running slowly (due to debugging code maybe?) they aren't going to get much done in the recommended 2 hours. I have mine set at 4 hours for my P3 machines and even they aren't getting much done.


Yes, I would like some clarification on these v4.90's myself. Like genes asked, would setting the run time higher help getting these WU's past the 1% mark or let them finish.

So far I've only had 1 v4.90 WU finish & that one only ran for 1:10:30 then just abruptly finished and Uploaded. It ran the whole time at 1% then just jumped to 100% ...
ID: 732 · Report as offensive    Reply Quote
IceQueen41
Avatar

Send message
Joined: 22 Feb 06
Posts: 6
Credit: 9,473
RAC: 0
Message 734 - Posted: 28 Feb 2006, 13:06:14 UTC - in response to Message 730.  

Question: Should we set our run-time preference higher for these 4.90 WU's? Since they seem to be running slowly (due to debugging code maybe?) they aren't going to get much done in the recommended 2 hours. I have mine set at 4 hours for my P3 machines and even they aren't getting much done.



I don't think it's that it's not getting much done, it's that it only runs one trajectory, or model, and from what I've seen, the percentage updates primarily after a trajectory has finished. This would explain why it's on 1% until it's done.
ID: 734 · Report as offensive    Reply Quote
STE\/E

Send message
Joined: 16 Feb 06
Posts: 27
Credit: 2,226,442
RAC: 783
Message 736 - Posted: 28 Feb 2006, 13:24:48 UTC - in response to Message 734.  
Last modified: 28 Feb 2006, 13:33:10 UTC

Question: Should we set our run-time preference higher for these 4.90 WU's? Since they seem to be running slowly (due to debugging code maybe?) they aren't going to get much done in the recommended 2 hours. I have mine set at 4 hours for my P3 machines and even they aren't getting much done.



I don't think it's that it's not getting much done, it's that it only runs one trajectory, or model, and from what I've seen, the percentage updates primarily after a trajectory has finished. This would explain why it's on 1% until it's done.


How you doing IceQueen41, it's hard to tell what these v4.90 Wu's are doing, I have 1 Computer that has 1 Wu @ 5 hr's still showing 1% -- 1 Wu @ 2 hr's showing 47.95% & 1 Wu that finished @ 1 hr 11 min's never showing more than 1% ... Hard to figure them out when they run like that ... I have my Preferences set to run 2 hr's but these v4.90's don't seem to want to adhere to that Preference ... ???

PS: As I posted the above the WU that was @ 5 hr's finished @ 100% & Uploaded. Guess we just have to let them run their course & see what happens to them.

ID: 736 · Report as offensive    Reply Quote
IceQueen41
Avatar

Send message
Joined: 22 Feb 06
Posts: 6
Credit: 9,473
RAC: 0
Message 738 - Posted: 28 Feb 2006, 13:38:31 UTC - in response to Message 736.  

Question: Should we set our run-time preference higher for these 4.90 WU's? Since they seem to be running slowly (due to debugging code maybe?) they aren't going to get much done in the recommended 2 hours. I have mine set at 4 hours for my P3 machines and even they aren't getting much done.



I don't think it's that it's not getting much done, it's that it only runs one trajectory, or model, and from what I've seen, the percentage updates primarily after a trajectory has finished. This would explain why it's on 1% until it's done.


How you doing IceQueen41, it's hard to tell what these v4.90 Wu's are doing, I have 1 Computer that has 1 Wu @ 5 hr's still showing 1% -- 1 Wu @ 2 hr's showing 47.95% & 1 Wu that finished @ 1 hr 11 min's never showing more than 1% ... Hard to figure them out when they run like that ... I have my Preferences set to run 2 hr's but these v4.90's don't seem to want to adhere to that Preference ... ???

PS: As I posted the above the WU that was @ 5 hr's finished @ 100% & Uploaded. Guess we just have to let them run their course & see what happens to them.



Hmm, so I guess that kills my theory... interesting. I've only run a couple... my prefs are set to 2 hours as well, and one ran about 1:45, and the other ran almost 6 hours. Hopefully this will get figured out soon...
ID: 738 · Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

Message boards : RALPH@home bug list : Report \"stuck at 1%\" bugs here



©2024 University of Washington
http://www.bakerlab.org