Report \"stuck at 1%\" bugs here

Message boards : RALPH@home bug list : Report \"stuck at 1%\" bugs here

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 7 · Next

AuthorMessage
STE\/E

Send message
Joined: 16 Feb 06
Posts: 27
Credit: 2,214,911
RAC: 0
Message 362 - Posted: 20 Feb 2006, 11:00:16 UTC
Last modified: 20 Feb 2006, 11:01:22 UTC

I finally have (or had) a WU hang @ 1% for 6:30:20. I restarted BOINC to see if it would get past the 1% mark but it has run for 30 min's now & it's still hung @ 1%. I Suspended it for now to run out the rest of the WU's on that Computer.
ID: 362 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 375 - Posted: 20 Feb 2006, 18:25:41 UTC - in response to Message 346.  
Last modified: 20 Feb 2006, 18:50:25 UTC


*Isn´t better I aborting it now ?

*What I expected from him was instructions on how to do a interactive trace
of the run, step by step

-or- using Drwatson to get a memory dump of my 512M of RAM and e-mailing him
that dump

*Never a brute-force test -:(

Carlos, you have a winner there, please don't abort it, keep it in memory, you may have the WU we testers need to fix this. I'd wait until instructed what to do next. Remember it's sunday. Leaving Ralph or that WU suspended is important to Ralph and is the whole reason Ralph even exists.

I wish I had what you have, I really do.

tony


Calm down, I have nobreak, unless occurs a power failure of more than half-hour
of duration, this WU will be keept into RAM ad infinitum
*IF my diesel generator wasn't being repaired, I could even garantee that!

ps: I posted into this thread the "random seed" as long as other
technical details for this job.
*for example:
WARNING: CONSTRAINT FILE NOT FOUND
Searched for: .1elwA.cst
Running without distance constraints
WARNING: DIPOLAR CONSTRAINT FILE NOT FOUND
Searched for: .1elwA.dpl
Dipolar constraints will not be used

*thus, even if a system reboot occurs,
I am about sure of being able to reproduce the problem again -:)
-constant_seed -jran 3995108


Update
a lorry knocked down a pole of lighting and the company
of electricity it stayed more than half an hour to do the repair

After power restore and reboot , I resumed that WU ...
Now it is at:
cpu time: 0 hr 27 min 43 sec
18.7% Complete
Stage: Full Atom Relax
Model: 6 Step: 34000
Accepted RMSD: 11.27
Accepted Energy: -227.6638

I regret we lose the "Captured Bug"

Remains however the WU(s) of Jonh McLeoy and the one of PoorBoy,
with a "Captured Bug"


Do someone knows WHY no additional instructions was sent to none of our 3
until now ???


This way, is better aborting unconditionaly,
any WU that stays at 1% for more than 5 minutes -:(
*May be I will write a script to do this automatically , or to restart boinc
automatically , so we does not lose days of CPU power on a endless loop
Click signature for global team stats
ID: 375 · Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 16 Feb 06
Posts: 251
Credit: 0
RAC: 0
Message 378 - Posted: 20 Feb 2006, 18:30:09 UTC
Last modified: 20 Feb 2006, 18:59:27 UTC

To everyone on this thread with a stuck/hung WU that is Suspended!

David Kim e-mailed me the following instructions for ALL of you -

"If user's suspect hitting the 1% bug, they should let it continue for a few hours or evan a day on Ralph and then restart boinc to see if it continues on after a restart. They should also post the result id on the forum so we can look at them when we get a chance to and explain what happens.

Thanks!

David K


This comes direct form the Project Team. Please record what you can and post it with links here BRIEFLY"

I will be moving a few of the previous messages to the 1% hang discussion thread, to keep this thread trim. So if you posted something here that did not contain specific Reporting information about a hung WU and can't find it look there.



Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact
ID: 378 · Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 16 Feb 06
Posts: 251
Credit: 0
RAC: 0
Message 381 - Posted: 20 Feb 2006, 18:42:07 UTC - in response to Message 375.  
Last modified: 20 Feb 2006, 18:44:27 UTC

...Do someone knows WHY no additional instructions was sent to none of our 3
until now ??? ...



Instructions have been provided. You should follow instructions provided by David Kim, David Baker, or any of the forum Moderators.

The Moderators are in direct contact with the project team and have been given the required guidance. Thank you.


Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact
ID: 381 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 395 - Posted: 20 Feb 2006, 20:43:12 UTC


You can no longer edit this post.
Posts can only be edited at most 60 minutes after they have been created.

Is possible increasing that time ?


This way, is better aborting unconditionaly,
any WU that stays at 1% for more than 5 minutes -:(
*May be I will write a script to do this automatically , or to restart boinc
automatically , so we does not lose days of CPU power on a endless loop


I realized that 5 minutes is too few time ... I have one WU @ 43 minutes at 1% and this is not the 1% bug

*How to find IF a WU is stuck
Select it and press show graphics

If all graphics are frozen , and the only thing moving is the CPU time:
After u look for that graphics for, say a 5 minutes, u can conclude that, that WU is stuck.

*IF CPU time is not moving, that WU is either suspended or paused -> do not post them

Else, WU is really stuck, Then I suggest posting:
cpu time
stage
model
step
Accepted RMSD
Accepted Energy

rosetta version:
workunit:
*These 2 last ones are shown on the header of the graphics screen

Well,
Then do what Dr. Kim asked for
kill boinc,
wait some time to ram clears
and start boinc again

In case, after this procedure, that WU remains stuck,
suspend it, post the results, and wait for additional instructions,
that hopefully, will be sent before u pc reboot again for some reason -:(

IMHO: I belive that the need of restarting boinc to allow rosetta surpass its stuck point is a bug !!!
However this is what Dr. Kim recomend.
*Other apps does not need of restarting boinc.
Click signature for global team stats
ID: 395 · Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 16 Feb 06
Posts: 251
Credit: 0
RAC: 0
Message 396 - Posted: 20 Feb 2006, 20:48:52 UTC - in response to Message 395.  
Last modified: 20 Feb 2006, 20:52:20 UTC

...
IMHO: I belive that the need of restarting boinc to allow rosetta surpass its stuck point is a bug !!!
However this is what Dr. Kim recomend.
*Other apps does not need of restarting boinc.



Of course it is a bug. That is what this project is trying to fix. The purpose for restarting and letting it run to the finish is so the WU will report in so they can look at it, not as a work around for the issue itself. They want the error data that reports back with the WU. That is why they want you to post a BRIEF explanation of the problem, and a LINK to the result. NEVER ABORT THE WUs ALWAYS LET THEM RUN UNTIL DONE OR CRASH ON THEIR OWN AND REPORT BACK.

Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact
ID: 396 · Report as offensive    Reply Quote
John McLeod VII
Avatar

Send message
Joined: 16 Feb 06
Posts: 8
Credit: 39,560
RAC: 0
Message 400 - Posted: 20 Feb 2006, 21:32:42 UTC

The one that I reported continues on once restarted. The report will be up in a while.


BOINC WIKI
ID: 400 · Report as offensive    Reply Quote
Snake Doctor

Send message
Joined: 16 Feb 06
Posts: 37
Credit: 998,880
RAC: 0
Message 411 - Posted: 21 Feb 2006, 4:29:28 UTC
Last modified: 21 Feb 2006, 4:48:16 UTC

I don't know if I should report this here but I just had a 1% hang in the Rosetta Project App version 4.82. The info for the WU is posted here.

EDIT: Oops, I just found one in RALPH too. This one hung at 4.25 %. Both of these are on Mac OS 10.4.5, both machines are G4s , one is a laptop, one is a dual desktop The Dual is running Application 4.83. I reset the time parameter because my system wen into EDF because I was testing the longer deadlines. When I changed that one of the two WU I had finished and uploaded, and this one stopped running for an App swap. I will watch it when it restarts.

The WU is here, and the result will be here when it reports.
ID: 411 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 481 - Posted: 22 Feb 2006, 15:38:19 UTC

Stuck at 1%
https://ralph.bakerlab.org/result.php?resultid=5967

*Computer IDLE -> load average: 0.00, 0.00, 0.00

Restarted boinc

Click signature for global team stats
ID: 481 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 506 - Posted: 22 Feb 2006, 20:52:14 UTC

result of my previous post has been finished,uploaded,and reported, by now

You can no longer edit this post.
Posts can only be edited at most 60 minutes after they have been created


If this was more than of 60 minutes, I could include
this information into my post itself. Thanks
ID: 506 · Report as offensive    Reply Quote
Dimitris Hatzopoulos

Send message
Joined: 16 Feb 06
Posts: 31
Credit: 2,308
RAC: 0
Message 515 - Posted: 23 Feb 2006, 3:40:49 UTC - in response to Message 481.  

Stuck at 1%
https://ralph.bakerlab.org/result.php?resultid=5967

*Computer IDLE -> load average: 0.00, 0.00, 0.00

Restarted boinc


Carlos, how exactly do you determine "stuck at 1%" under your Linux host? Do you check stdout.txt by hand?

I've had several WUs "stuck" (ps shows "SN"=sleeping,nice for R task - I got it right this time :-)) under Linux. CPU is idle, as per your example and BOINC queue will freeze until I kill R task.

Common things between your setup and mine are BOINC v5.2.14 (optimized), Linux kernel 2.4.x (x=27 in my case) and just 256MB RAM.

In those cases, just kill the R process and let BOINC restart it. All times but once, the WU completed fine (obviously with different seed). Oddly, I've NEVER had R get stuck under WinXP.

A month or more ago, I followed dekim/baker instructions, to run R from cmd-line by hand with same WU/seed that bombed under BOINC, in R own dir, and it finished OK.


ID: 515 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 520 - Posted: 23 Feb 2006, 8:35:25 UTC
Last modified: 23 Feb 2006, 8:42:46 UTC

Carlos, how exactly do you determine "stuck at 1%" under your Linux host? Do you check stdout.txt by hand?

Yellow line on my boincview, miles away from host

Oddly, I've NEVER had R get stuck under WinXP

see now, them. stuck at 1% - New rosetta 4.82 released Feb 18, 2006
https://boinc.bakerlab.org/rosetta/result.php?resultid=11877712

06:30:59 hours of cpu time at 100% Pentium IV 1800 mhz (stock speed)

I did restarted boinc two times by now, to no avail - still stuck at 1%

*I have rebooted too !

Help!
Click signature for global team stats
ID: 520 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 530 - Posted: 23 Feb 2006, 14:51:36 UTC

still stuck at 1%
CPU Time 12:36:44 p4 1.8G stock speed - HELP !
https://boinc.bakerlab.org/rosetta/result.php?resultid=11877712
ID: 530 · Report as offensive    Reply Quote
River~~

Send message
Joined: 20 Feb 06
Posts: 20
Credit: 503
RAC: 0
Message 533 - Posted: 23 Feb 2006, 17:16:15 UTC - in response to Message 515.  
Last modified: 23 Feb 2006, 17:17:38 UTC

... how exactly do you determine "stuck at 1%" under your Linux host? Do you check stdout.txt by hand?


There are various ways

1. Use the BOINC Manager (this works locally if you have a GUI, or can be used remotely - you can even use the standard manager from a Win box to monitor a Linux box - see details on remote control in the wiki)

2. Use BoincView from a Win box

3. In linux command line from the BOINC directory, use

./boinc_cmd --get_state|less

and look for the fraction complete (presumably will stick at 0.01 ??)

4. again in command line, from the BOINC directory, use

less client_state.xml
/active
/ till get to the one you want to look at, and scroll down till see the fraction done tag. (If you don't understand what I mean here, please see man less or info less for how to drive the less utility.)

hope that helps someone (& feel free to borrow for any FAQ or wiki)

R~~
ID: 533 · Report as offensive    Reply Quote
Dimitris Hatzopoulos

Send message
Joined: 16 Feb 06
Posts: 31
Credit: 2,308
RAC: 0
Message 536 - Posted: 23 Feb 2006, 18:45:09 UTC
Last modified: 23 Feb 2006, 18:56:59 UTC

River thanks for suggestion, sofar had just been grep'ping for "pct_co" the stdout.txt file (I think recently R has changed the location and now stores it as WU description, so I have to look inside slots/x/stdout.txt to find the "real" filename), e.g.

$ fgrep pct_comp ~boinc/BOINC/projects/ralph.bakerlab.org/BARCODE_30_4ubpA_215_6_0_0 | tail

BOINC :: [2006-02-23 18:49:47] :: num_decoys: 65 :: number_of_output: 71 :: pct_complete: 0.907516
BOINC :: [2006-02-23 18:54:16] :: num_decoys: 66 :: number_of_output: 71 :: pct_complete: 0.916776
BOINC :: [2006-02-23 19:02:34] :: num_decoys: 67 :: number_of_output: 71 :: pct_complete: 0.933897
BOINC :: [2006-02-23 19:09:26] :: num_decoys: 68 :: number_of_output: 71 :: pct_complete: 0.948104
BOINC :: [2006-02-23 19:12:11] :: num_decoys: 69 :: number_of_output: 72 :: pct_complete: 0.953801

Edit: prior example was for RALPH/R4.84, current R has stdout in /slots dir

fgrep pct_comp ~boinc/BOINC/slots/0/stdout.txt
BOINC :: [2006-02-23 19:48:12] :: mode: abinitio :: nstartnm: 1 :: number_of_output: 16 :: num_decoys: 0 :: pct_complete: 0.01
BOINC :: [2006-02-23 20:05:36] :: num_decoys: 1 :: number_of_output: 27 :: pct_complete: 0.0359882
BOINC :: [2006-02-23 20:30:11] :: num_decoys: 2 :: number_of_output: 22 :: pct_complete: 0.0870625

ID: 536 · Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 16 Feb 06
Posts: 45
Credit: 43,300
RAC: 0
Message 712 - Posted: 28 Feb 2006, 2:46:44 UTC
Last modified: 28 Feb 2006, 2:47:58 UTC

I don't know how long it's supposed to spend "initializing", but I got a new 4.90 WU which has been initializing (at 1%, with the dots blinking) now for over 30 minutes. There is a molecule in the "native" box and one in the "searching" box, but the other boxes are empty. The lines defining the edges of the boxes are also oddly shaped, on the empty boxes the upper right corners are folded down and to the left.

This machine has "leave in memory" set to YES. I'll let it keep running, we'll see what happens.


ID: 712 · Report as offensive    Reply Quote
STE\/E

Send message
Joined: 16 Feb 06
Posts: 27
Credit: 2,214,911
RAC: 0
Message 715 - Posted: 28 Feb 2006, 3:40:54 UTC
Last modified: 28 Feb 2006, 4:15:58 UTC

Right now I have 12 of them stuck @ 1% ... Some of them for as long as 3 hours, none of them are making it past the 1% mark so far. I have my preferences set to 2 hours run time so something is not right if their still @ 1% for 3 hours, at least I would think so anyway.

ID: 715 · Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 16 Feb 06
Posts: 251
Credit: 0
RAC: 0
Message 716 - Posted: 28 Feb 2006, 3:41:56 UTC - in response to Message 712.  

I don't know how long it's supposed to spend "initializing", but I got a new 4.90 WU which has been initializing (at 1%, with the dots blinking) now for over 30 minutes. There is a molecule in the "native" box and one in the "searching" box, but the other boxes are empty. The lines defining the edges of the boxes are also oddly shaped, on the empty boxes the upper right corners are folded down and to the left.

This machine has "leave in memory" set to YES. I'll let it keep running, we'll see what happens.




I saw one do that the other day, but it started and while it was initializing it did an application swap. This left the display just as you described it until the WU started up again.

See if this is what is happening on your system.

Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact
ID: 716 · Report as offensive    Reply Quote
IceQueen41
Avatar

Send message
Joined: 22 Feb 06
Posts: 6
Credit: 9,473
RAC: 0
Message 717 - Posted: 28 Feb 2006, 3:43:32 UTC - in response to Message 712.  

I don't know how long it's supposed to spend "initializing", but I got a new 4.90 WU which has been initializing (at 1%, with the dots blinking) now for over 30 minutes. There is a molecule in the "native" box and one in the "searching" box, but the other boxes are empty. The lines defining the edges of the boxes are also oddly shaped, on the empty boxes the upper right corners are folded down and to the left.

This machine has "leave in memory" set to YES. I'll let it keep running, we'll see what happens.



Mine started the same way (WU, Result). It should get past the initialization sooner or later (mine took over 10 minutes on a decently fast processor), and then it only does one trajectory, which is why it appears to be stuck at 1% (mine was at 1% literally until it finished). It also seems to use a much slower algorithm with long intervals between steps, so don't restart or abort it until you know it's not going anywhere. Hope this helps!
ID: 717 · Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 16 Feb 06
Posts: 45
Credit: 43,300
RAC: 0
Message 718 - Posted: 28 Feb 2006, 4:07:30 UTC - in response to Message 716.  
Last modified: 28 Feb 2006, 4:37:51 UTC

I don't know how long it's supposed to spend "initializing", but I got a new 4.90 WU which has been initializing (at 1%, with the dots blinking) now for over 30 minutes. There is a molecule in the "native" box and one in the "searching" box, but the other boxes are empty. The lines defining the edges of the boxes are also oddly shaped, on the empty boxes the upper right corners are folded down and to the left.

This machine has "leave in memory" set to YES. I'll let it keep running, we'll see what happens.




I saw one do that the other day, but it started and while it was initializing it did an application swap. This left the display just as you described it until the WU started up again.

See if this is what is happening on your system.


It is swapped out right now. We'll see what happens when it comes back. The machine is a Dual P3, 1GHz. Run-time prefs set to 4 hours.

[edit]
It's back. It has gotten past the "initializing" stage, and is on step 25000 or so. Still at 1%, but running (steps counting, graphics moving). Verrry slowwwly. I suspect debugging code has been put into it, much like when Seti Boinc first started out (and was crashing constantly).
[/edit]


ID: 718 · Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 7 · Next

Message boards : RALPH@home bug list : Report \"stuck at 1%\" bugs here



©2024 University of Washington
http://www.bakerlab.org