Report \"stuck at 1%\" bugs here

Message boards : RALPH@home bug list : Report \"stuck at 1%\" bugs here

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 7 · Next

AuthorMessage
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 721 - Posted: 28 Feb 2006, 5:11:09 UTC

stuck at 1% rosetta_beta_4.84 Linux
https://ralph.bakerlab.org/result.php?resultid=12969

*load average: 0.01, 0.09, 0.46

crobertp [/home/boinc/BOINC] > ps xu
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
boinc 27682 0.0 0.4 2616 1036 ? SN Feb17 0:00 /bin/bash ./yasuc.sh
boinc 24384 0.0 1.5 6244 3772 ? S Feb25 0:52 ./boinc -redirectio -allow_remote_gui_rpc -return_results_imme
boinc 21886 0.0 1.0 7216 2496 ? S 01:08 0:00 /usr/sbin/sshd
boinc 21887 0.0 0.8 3500 2052 pts/1 S 01:08 0:00 -bash
boinc 22269 44.3 26.1 172160 64896 ? SN 01:53 11:20 rosetta_beta_4.84_i686-pc-linux-gnu xx 1dcj _ -abrelax -string
boinc 22270 0.0 26.1 172160 64896 ? SN 01:53 0:00 rosetta_beta_4.84_i686-pc-linux-gnu xx 1dcj _ -abrelax -string
boinc 22271 0.0 26.1 172160 64896 ? SN 01:53 0:00 rosetta_beta_4.84_i686-pc-linux-gnu xx 1dcj _ -abrelax -string
boinc 22372 0.0 0.2 2084 624 ? SN 02:16 0:00 sleep 600
boinc 22380 0.0 0.2 2548 672 pts/1 R 02:19 0:00 ps xu
crobertp [/home/boinc/BOINC] >

Restarting boinc ...
Click signature for global team stats
ID: 721 · Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 16 Feb 06
Posts: 45
Credit: 43,300
RAC: 0
Message 730 - Posted: 28 Feb 2006, 12:19:07 UTC

Question: Should we set our run-time preference higher for these 4.90 WU's? Since they seem to be running slowly (due to debugging code maybe?) they aren't going to get much done in the recommended 2 hours. I have mine set at 4 hours for my P3 machines and even they aren't getting much done.

ID: 730 · Report as offensive    Reply Quote
STE\/E

Send message
Joined: 16 Feb 06
Posts: 27
Credit: 2,214,911
RAC: 0
Message 732 - Posted: 28 Feb 2006, 12:34:41 UTC - in response to Message 730.  

Question: Should we set our run-time preference higher for these 4.90 WU's? Since they seem to be running slowly (due to debugging code maybe?) they aren't going to get much done in the recommended 2 hours. I have mine set at 4 hours for my P3 machines and even they aren't getting much done.


Yes, I would like some clarification on these v4.90's myself. Like genes asked, would setting the run time higher help getting these WU's past the 1% mark or let them finish.

So far I've only had 1 v4.90 WU finish & that one only ran for 1:10:30 then just abruptly finished and Uploaded. It ran the whole time at 1% then just jumped to 100% ...
ID: 732 · Report as offensive    Reply Quote
IceQueen41
Avatar

Send message
Joined: 22 Feb 06
Posts: 6
Credit: 9,473
RAC: 0
Message 734 - Posted: 28 Feb 2006, 13:06:14 UTC - in response to Message 730.  

Question: Should we set our run-time preference higher for these 4.90 WU's? Since they seem to be running slowly (due to debugging code maybe?) they aren't going to get much done in the recommended 2 hours. I have mine set at 4 hours for my P3 machines and even they aren't getting much done.



I don't think it's that it's not getting much done, it's that it only runs one trajectory, or model, and from what I've seen, the percentage updates primarily after a trajectory has finished. This would explain why it's on 1% until it's done.
ID: 734 · Report as offensive    Reply Quote
STE\/E

Send message
Joined: 16 Feb 06
Posts: 27
Credit: 2,214,911
RAC: 0
Message 736 - Posted: 28 Feb 2006, 13:24:48 UTC - in response to Message 734.  
Last modified: 28 Feb 2006, 13:33:10 UTC

Question: Should we set our run-time preference higher for these 4.90 WU's? Since they seem to be running slowly (due to debugging code maybe?) they aren't going to get much done in the recommended 2 hours. I have mine set at 4 hours for my P3 machines and even they aren't getting much done.



I don't think it's that it's not getting much done, it's that it only runs one trajectory, or model, and from what I've seen, the percentage updates primarily after a trajectory has finished. This would explain why it's on 1% until it's done.


How you doing IceQueen41, it's hard to tell what these v4.90 Wu's are doing, I have 1 Computer that has 1 Wu @ 5 hr's still showing 1% -- 1 Wu @ 2 hr's showing 47.95% & 1 Wu that finished @ 1 hr 11 min's never showing more than 1% ... Hard to figure them out when they run like that ... I have my Preferences set to run 2 hr's but these v4.90's don't seem to want to adhere to that Preference ... ???

PS: As I posted the above the WU that was @ 5 hr's finished @ 100% & Uploaded. Guess we just have to let them run their course & see what happens to them.

ID: 736 · Report as offensive    Reply Quote
IceQueen41
Avatar

Send message
Joined: 22 Feb 06
Posts: 6
Credit: 9,473
RAC: 0
Message 738 - Posted: 28 Feb 2006, 13:38:31 UTC - in response to Message 736.  

Question: Should we set our run-time preference higher for these 4.90 WU's? Since they seem to be running slowly (due to debugging code maybe?) they aren't going to get much done in the recommended 2 hours. I have mine set at 4 hours for my P3 machines and even they aren't getting much done.



I don't think it's that it's not getting much done, it's that it only runs one trajectory, or model, and from what I've seen, the percentage updates primarily after a trajectory has finished. This would explain why it's on 1% until it's done.


How you doing IceQueen41, it's hard to tell what these v4.90 Wu's are doing, I have 1 Computer that has 1 Wu @ 5 hr's still showing 1% -- 1 Wu @ 2 hr's showing 47.95% & 1 Wu that finished @ 1 hr 11 min's never showing more than 1% ... Hard to figure them out when they run like that ... I have my Preferences set to run 2 hr's but these v4.90's don't seem to want to adhere to that Preference ... ???

PS: As I posted the above the WU that was @ 5 hr's finished @ 100% & Uploaded. Guess we just have to let them run their course & see what happens to them.



Hmm, so I guess that kills my theory... interesting. I've only run a couple... my prefs are set to 2 hours as well, and one ran about 1:45, and the other ran almost 6 hours. Hopefully this will get figured out soon...
ID: 738 · Report as offensive    Reply Quote
Profile Fuzzy Hollynoodles
Avatar

Send message
Joined: 19 Feb 06
Posts: 37
Credit: 2,089
RAC: 0
Message 741 - Posted: 28 Feb 2006, 15:27:51 UTC
Last modified: 28 Feb 2006, 15:34:27 UTC

I have one:

https://ralph.bakerlab.org/workunit.php?wuid=11108

Result: https://ralph.bakerlab.org/result.php?resultid=12738

I got it last night, and it ran for more than an hour on 1%. I opened the graphic to see what was going on, and it seemed to be "alive", with some very small wiggles, and almost no movements of the curves. It ran, when I shut down before I went to bed, as I usually do, and I booted up again when I got up and went out. When I came home again about 30 minutes ago, the other project WU's have run, so everything was reset to zero, CPU time and percentage. It started again, after I manually updated Ralph, and it seems it has started again from scratch, as the CPU time has reset to zero and the percentage is on 1 again.

I have set them all to stay in memory and a Target CPU run time set to default (8 hours).

My computer is https://ralph.bakerlab.org/show_host_detail.php?hostid=797

But the stdout file looks interesting. David Kim, do you want me to mail it to you? It's very long, so I wont post it here.

In the graphic it looks totally dead with not curves at all and no movements. It seems stopped at step 32509.

Shall I leave it running and see what's happening? Or should I just put it out of it's misery? :-(

EDIT:

2/28/2006 4:07:01 PM|ralph@home|Resuming result HOMSdi_homDB003_1di2__228_9_0 using rosetta_beta version 490


And it seems to be "alive" but very slow. It has now moved up to step 32521


[color=navy][b]"I'm trying to maintain a shred of dignity in this world." - Me[/b][/color]

ID: 741 · Report as offensive    Reply Quote
Profile Fuzzy Hollynoodles
Avatar

Send message
Joined: 19 Feb 06
Posts: 37
Credit: 2,089
RAC: 0
Message 745 - Posted: 28 Feb 2006, 16:53:23 UTC - in response to Message 741.  

I have one:

https://ralph.bakerlab.org/workunit.php?wuid=11108

Result: https://ralph.bakerlab.org/result.php?resultid=12738

...


And it finished without I noticed it.

Result: https://ralph.bakerlab.org/result.php?resultid=12738


[color=navy][b]"I'm trying to maintain a shred of dignity in this world." - Me[/b][/color]

ID: 745 · Report as offensive    Reply Quote
Profile [B^S] thierry@home
Avatar

Send message
Joined: 15 Feb 06
Posts: 20
Credit: 17,624
RAC: 0
Message 753 - Posted: 28 Feb 2006, 22:29:17 UTC

I have a WU 4.90 stuck at 1% for 1h05'. The graphics are more or less freezed. THe protein shape moves a little bit every +/- 20 seconds. What do I do with this WU?
I have suspended it until I know what to do.

WU number : HOMSb7_homDB005_1b72_226_2
CPU : P4 3.0Ghz HT
OS : XP SP2


ID: 753 · Report as offensive    Reply Quote
Stargazer257

Send message
Joined: 16 Feb 06
Posts: 6
Credit: 17,492
RAC: 0
Message 755 - Posted: 28 Feb 2006, 22:47:29 UTC - in response to Message 753.  
Last modified: 28 Feb 2006, 22:48:18 UTC

I have a WU 4.90 stuck at 1% for 1h05'. The graphics are more or less freezed. THe protein shape moves a little bit every +/- 20 seconds. What do I do with this WU?
I have suspended it until I know what to do.

WU number : HOMSb7_homDB005_1b72_226_2
CPU : P4 3.0Ghz HT
OS : XP SP2


Continue to run it. I had two that were like that (got to step ~34,000 real quick and then appeared to stop). One of them has since completed at ~5 hrs, the other is still going at 6+ hrs. Check the graphics/screensaver and see if the steps slowly increment. The one I have that is still running has only done ~500 steps since it appeared to slow down/stop. As long as the steps continue to increment (albiet, slowly), it is still running.

And BTW, the progress only showed 1% done until it finished. Then it went to 100%. Hope yours are like that.


Join Us! - Click the Sig!
ID: 755 · Report as offensive    Reply Quote
Profile [B^S] thierry@home
Avatar

Send message
Joined: 15 Feb 06
Posts: 20
Credit: 17,624
RAC: 0
Message 756 - Posted: 28 Feb 2006, 22:53:17 UTC

OK, I've restarted it. Will see....
Thanks

ID: 756 · Report as offensive    Reply Quote
Hickory Explorer [USA]

Send message
Joined: 15 Feb 06
Posts: 2
Credit: 9,562
RAC: 0
Message 757 - Posted: 1 Mar 2006, 0:04:47 UTC

I had a WU at 1% this morning. It finished while at work, so I didn't see it finish. Doesn't look like it completed much work in the 7.57 hours that it ran.

WU ID : 11353
WU name : HOMSdc_homDB008_1dcj__229_7
CPU : P4 3.0Ghz HT
OS : XP SP2

<core_client_version>5.2.13</core_client_version>
<stderr_txt>
# random seed: 3988759
# cpu_run_time_pref: 7200
# DONE :: 1 starting structures built 0 (nstruct) times
# This process generated 1 decoys from 1 attempts

</stderr_txt>



ID: 757 · Report as offensive    Reply Quote
Hickory Explorer [USA]

Send message
Joined: 15 Feb 06
Posts: 2
Credit: 9,562
RAC: 0
Message 761 - Posted: 1 Mar 2006, 2:22:32 UTC

Have a 4.90 unit on another PC that was struck at 1%. It was on model 1 at step 34401. It had been running for 4 hours.

I stopped and restarted Boinc. When the WU restarted, it started at 0. It has been iniatizing now for 30+ minutes. Will let it run.

WU ID: 11340
Results ID: 12974
Result Name: HOMSdc_homDB025_1dcj__229_6_0
Computer ID: 100
CPU: Pentium M 1.73GHz
OS: XP SP2



ID: 761 · Report as offensive    Reply Quote
Stargazer257

Send message
Joined: 16 Feb 06
Posts: 6
Credit: 17,492
RAC: 0
Message 762 - Posted: 1 Mar 2006, 2:41:52 UTC - in response to Message 761.  

Have a 4.90 unit on another PC that was struck at 1%. It was on model 1 at step 34401. It had been running for 4 hours.

I stopped and restarted Boinc. When the WU restarted, it started at 0. It has been iniatizing now for 30+ minutes. Will let it run.

WU ID: 11340
Results ID: 12974
Result Name: HOMSdc_homDB025_1dcj__229_6_0
Computer ID: 100
CPU: Pentium M 1.73GHz
OS: XP SP2



That's what mine did too (reset to 0:00 upon restart). The reason it did this is because the work hadn't reached a "checkpoint" as it were. Upon reboot, it didn't have a place to start and had to begin anew. You will have to let it run longer (of the two WU's I had like that, one ran ~6 hrs, and the other is still running at 9+ hrs). Look at the screensaver/graphic and see if the steps increment (it may seem like it is stopped, but check the step, then check back later to see if it has changed). My WU's raced up to Step 34,000 then seemed to stop. It actually has done 5-600 additional steps over the last 9 hours.

Good luck


Join Us! - Click the Sig!
ID: 762 · Report as offensive    Reply Quote
STE\/E

Send message
Joined: 16 Feb 06
Posts: 27
Credit: 2,214,911
RAC: 0
Message 765 - Posted: 1 Mar 2006, 13:24:26 UTC
Last modified: 1 Mar 2006, 13:58:49 UTC

How long should we let these WU's run ... ???

I have one now at over 11 hours & 1 at over 9 hours, both are still at 1% and the Computers are 3.4 Ghz. They should have been done by now I would think ... ???

PS: The one WU that was @ over 9 hours finally finished @ 9:47 Hr's .. The one @ over 11 hr's is still running, now up close to 12 hours ... :0
ID: 765 · Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 16 Feb 06
Posts: 251
Credit: 0
RAC: 0
Message 767 - Posted: 1 Mar 2006, 15:36:32 UTC - in response to Message 765.  
Last modified: 1 Mar 2006, 16:00:58 UTC

How long should we let these WU's run ... ???

I have one now at over 11 hours & 1 at over 9 hours, both are still at 1% and the Computers are 3.4 Ghz. They should have been done by now I would think ... ???

PS: The one WU that was @ over 9 hours finally finished @ 9:47 Hr's .. The one @ over 11 hr's is still running, now up close to 12 hours ... :0


As most of you know, Ralph is a test project. While some of the testing is for developing the application itself, some is also for designing WU parameters and functions. This will cause a great deal of variation in the run length and speed of the test WUs.

The Ralph WUs are not being evaluated for science content beyond what is necessary to determine if the application is processing them correctly. However, it is important that if they can complete a successful run, that they be allowed to do so to produce valid application test results.

You should try to run all the WUs until they finish, but if one is truly hung (not running at all) after a long period of time, and you can absolutely determine that that it is dead, you should report it here, restart BOINC and see if it will run to completion. If it will not, then this is important test information.

ANY WU THAT IS RESTARTED FOR ANY REASON BEFORE IT REACHES THE FIRST CHECKPOINT WILL START OVER FROM 0%. (the first checkpoint occurs when the percent complete reaches any value GREATER than 1% complete)

Anything that removes the WU from memory before it reaches the first checkpoint is considered to be a restart. (Application swaps with keep in memory set to no, Turning off the computer, Restarting the computer, restarting BOINC, and suspending and restarting the project are all events that remove the WU from memory).

The volume of production (i.e. the number of WU run by a machine) is not important to Ralph testing. What is important is to get WUs processed on as many different platforms and configurations as possible, and have any errors reported back. To accomplish, this the Work flow for Ralph is set to restrict the number of WUs available to any particular machine. So you should not be concerned if you do not get additional work when you complete any work you have. Work will be made available as the testing requires.

In addition there is more error and runtime information being kept and reported for Ralph than there would be in a production environment. At times this will cause the application to run slower than would otherwise be the case. Slower running is an expected side effect of testing. But, it can cause it to appear that a WU is not running, when in fact it is just running very slowly.

As one might expect slower running will cause a WU to run for a long period of time before reaching the first checkpoint (greater than 1% complete). Under these conditions the WUs will run significantly longer than the user adjustable time setting in the preferences. When this happens the WU may indicate 1% complete for a very long time and suddenly jump to 100% and finish. While this may look unusual, it is in fact an expected and known side effect of the current testing process.

If possible you should try NOT to abort Ralph WUs unless they simply cannot be made to run to completion, or you receive notification from the project to abort them. As new test versions are released in Ralph, you will periodically receive instructions to abort certain classes of WU to flush the system for the next series of tests. You may also be asked to reset Ralph occasionally. When this happens, it may take some time for your system to adjust and begin receiving work again. This is also a normal and expected side effect of the testing.


Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact
ID: 767 · Report as offensive    Reply Quote
STE\/E

Send message
Joined: 16 Feb 06
Posts: 27
Credit: 2,214,911
RAC: 0
Message 768 - Posted: 1 Mar 2006, 15:45:27 UTC

As far as I can determine the WU that was over 11 hours is still running according to the Process Manager. It show 50% usage of the CPU for that WU, it's still running & at 13:30 hours now. I'll let it continue to run & see what happens to it & will report back on it one way or the other ...
ID: 768 · Report as offensive    Reply Quote
Profile [B^S] Dr. Bill Skiba
Avatar

Send message
Joined: 15 Feb 06
Posts: 4
Credit: 6,496
RAC: 0
Message 774 - Posted: 1 Mar 2006, 19:39:06 UTC

I just aborted this wu. https://ralph.bakerlab.org/result.php?resultid=12982.

It reset itself to "0" time several times (yes, it was left in memory). I shut down BOINC, restarted the system and encountered the same behavior. After 3 more restarts from "0" time I gave up on it.

ID: 774 · Report as offensive    Reply Quote
Profile Bruno G. Olsen & ESEA @ greenholt

Send message
Joined: 16 Feb 06
Posts: 4
Credit: 45,078
RAC: 0
Message 775 - Posted: 1 Mar 2006, 20:21:40 UTC

work unit: https://ralph.bakerlab.org/workunit.php?wuid=11591
result: https://ralph.bakerlab.org/result.php?resultid=13442
host: https://ralph.bakerlab.org/show_host_detail.php?hostid=285

has been running for 1 hour and 44 minutes and reports 6 hours left
ID: 775 · Report as offensive    Reply Quote
STE\/E

Send message
Joined: 16 Feb 06
Posts: 27
Credit: 2,214,911
RAC: 0
Message 778 - Posted: 1 Mar 2006, 22:44:45 UTC - in response to Message 768.  
Last modified: 1 Mar 2006, 22:47:09 UTC

As far as I can determine the WU that was over 11 hours is still running according to the Process Manager. It show 50% usage of the CPU for that WU, it's still running & at 13:30 hours now. I'll let it continue to run & see what happens to it & will report back on it one way or the other ...


PS: This WU just finally did finish successfully @ the 20:41 Hour Mark, it never did show more than 1% finished the whole time it ran ... :)
ID: 778 · Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 7 · Next

Message boards : RALPH@home bug list : Report \"stuck at 1%\" bugs here



©2024 University of Washington
http://www.bakerlab.org