Message boards : RALPH@home bug list : Report \"stuck at 1%\" bugs here
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 7 · Next
Author | Message |
---|---|
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
stuck at 1% rosetta_beta_4.84 Linux https://ralph.bakerlab.org/result.php?resultid=12969 *load average: 0.01, 0.09, 0.46 crobertp [/home/boinc/BOINC] > ps xu USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND boinc 27682 0.0 0.4 2616 1036 ? SN Feb17 0:00 /bin/bash ./yasuc.sh boinc 24384 0.0 1.5 6244 3772 ? S Feb25 0:52 ./boinc -redirectio -allow_remote_gui_rpc -return_results_imme boinc 21886 0.0 1.0 7216 2496 ? S 01:08 0:00 /usr/sbin/sshd boinc 21887 0.0 0.8 3500 2052 pts/1 S 01:08 0:00 -bash boinc 22269 44.3 26.1 172160 64896 ? SN 01:53 11:20 rosetta_beta_4.84_i686-pc-linux-gnu xx 1dcj _ -abrelax -string boinc 22270 0.0 26.1 172160 64896 ? SN 01:53 0:00 rosetta_beta_4.84_i686-pc-linux-gnu xx 1dcj _ -abrelax -string boinc 22271 0.0 26.1 172160 64896 ? SN 01:53 0:00 rosetta_beta_4.84_i686-pc-linux-gnu xx 1dcj _ -abrelax -string boinc 22372 0.0 0.2 2084 624 ? SN 02:16 0:00 sleep 600 boinc 22380 0.0 0.2 2548 672 pts/1 R 02:19 0:00 ps xu crobertp [/home/boinc/BOINC] > Restarting boinc ... Click signature for global team stats |
genes Send message Joined: 16 Feb 06 Posts: 45 Credit: 43,706 RAC: 20 |
Question: Should we set our run-time preference higher for these 4.90 WU's? Since they seem to be running slowly (due to debugging code maybe?) they aren't going to get much done in the recommended 2 hours. I have mine set at 4 hours for my P3 machines and even they aren't getting much done. |
STE\/E Send message Joined: 16 Feb 06 Posts: 27 Credit: 2,226,442 RAC: 783 |
Question: Should we set our run-time preference higher for these 4.90 WU's? Since they seem to be running slowly (due to debugging code maybe?) they aren't going to get much done in the recommended 2 hours. I have mine set at 4 hours for my P3 machines and even they aren't getting much done. Yes, I would like some clarification on these v4.90's myself. Like genes asked, would setting the run time higher help getting these WU's past the 1% mark or let them finish. So far I've only had 1 v4.90 WU finish & that one only ran for 1:10:30 then just abruptly finished and Uploaded. It ran the whole time at 1% then just jumped to 100% ... |
IceQueen41 Send message Joined: 22 Feb 06 Posts: 6 Credit: 9,473 RAC: 0 |
Question: Should we set our run-time preference higher for these 4.90 WU's? Since they seem to be running slowly (due to debugging code maybe?) they aren't going to get much done in the recommended 2 hours. I have mine set at 4 hours for my P3 machines and even they aren't getting much done. I don't think it's that it's not getting much done, it's that it only runs one trajectory, or model, and from what I've seen, the percentage updates primarily after a trajectory has finished. This would explain why it's on 1% until it's done. |
STE\/E Send message Joined: 16 Feb 06 Posts: 27 Credit: 2,226,442 RAC: 783 |
Question: Should we set our run-time preference higher for these 4.90 WU's? Since they seem to be running slowly (due to debugging code maybe?) they aren't going to get much done in the recommended 2 hours. I have mine set at 4 hours for my P3 machines and even they aren't getting much done. How you doing IceQueen41, it's hard to tell what these v4.90 Wu's are doing, I have 1 Computer that has 1 Wu @ 5 hr's still showing 1% -- 1 Wu @ 2 hr's showing 47.95% & 1 Wu that finished @ 1 hr 11 min's never showing more than 1% ... Hard to figure them out when they run like that ... I have my Preferences set to run 2 hr's but these v4.90's don't seem to want to adhere to that Preference ... ??? PS: As I posted the above the WU that was @ 5 hr's finished @ 100% & Uploaded. Guess we just have to let them run their course & see what happens to them. |
IceQueen41 Send message Joined: 22 Feb 06 Posts: 6 Credit: 9,473 RAC: 0 |
Question: Should we set our run-time preference higher for these 4.90 WU's? Since they seem to be running slowly (due to debugging code maybe?) they aren't going to get much done in the recommended 2 hours. I have mine set at 4 hours for my P3 machines and even they aren't getting much done. Hmm, so I guess that kills my theory... interesting. I've only run a couple... my prefs are set to 2 hours as well, and one ran about 1:45, and the other ran almost 6 hours. Hopefully this will get figured out soon... |
Fuzzy Hollynoodles Send message Joined: 19 Feb 06 Posts: 37 Credit: 2,089 RAC: 0 |
I have one: https://ralph.bakerlab.org/workunit.php?wuid=11108 Result: https://ralph.bakerlab.org/result.php?resultid=12738 I got it last night, and it ran for more than an hour on 1%. I opened the graphic to see what was going on, and it seemed to be "alive", with some very small wiggles, and almost no movements of the curves. It ran, when I shut down before I went to bed, as I usually do, and I booted up again when I got up and went out. When I came home again about 30 minutes ago, the other project WU's have run, so everything was reset to zero, CPU time and percentage. It started again, after I manually updated Ralph, and it seems it has started again from scratch, as the CPU time has reset to zero and the percentage is on 1 again. I have set them all to stay in memory and a Target CPU run time set to default (8 hours). My computer is https://ralph.bakerlab.org/show_host_detail.php?hostid=797 But the stdout file looks interesting. David Kim, do you want me to mail it to you? It's very long, so I wont post it here. In the graphic it looks totally dead with not curves at all and no movements. It seems stopped at step 32509. Shall I leave it running and see what's happening? Or should I just put it out of it's misery? :-( EDIT: 2/28/2006 4:07:01 PM|ralph@home|Resuming result HOMSdi_homDB003_1di2__228_9_0 using rosetta_beta version 490 And it seems to be "alive" but very slow. It has now moved up to step 32521 [color=navy][b]"I'm trying to maintain a shred of dignity in this world." - Me[/b][/color] |
Fuzzy Hollynoodles Send message Joined: 19 Feb 06 Posts: 37 Credit: 2,089 RAC: 0 |
I have one: And it finished without I noticed it. Result: https://ralph.bakerlab.org/result.php?resultid=12738 [color=navy][b]"I'm trying to maintain a shred of dignity in this world." - Me[/b][/color] |
[B^S] thierry@home Send message Joined: 15 Feb 06 Posts: 20 Credit: 17,624 RAC: 0 |
I have a WU 4.90 stuck at 1% for 1h05'. The graphics are more or less freezed. THe protein shape moves a little bit every +/- 20 seconds. What do I do with this WU? I have suspended it until I know what to do. WU number : HOMSb7_homDB005_1b72_226_2 CPU : P4 3.0Ghz HT OS : XP SP2 |
Stargazer257 Send message Joined: 16 Feb 06 Posts: 6 Credit: 17,492 RAC: 0 |
I have a WU 4.90 stuck at 1% for 1h05'. The graphics are more or less freezed. THe protein shape moves a little bit every +/- 20 seconds. What do I do with this WU? Continue to run it. I had two that were like that (got to step ~34,000 real quick and then appeared to stop). One of them has since completed at ~5 hrs, the other is still going at 6+ hrs. Check the graphics/screensaver and see if the steps slowly increment. The one I have that is still running has only done ~500 steps since it appeared to slow down/stop. As long as the steps continue to increment (albiet, slowly), it is still running. And BTW, the progress only showed 1% done until it finished. Then it went to 100%. Hope yours are like that. Join Us! - Click the Sig! |
[B^S] thierry@home Send message Joined: 15 Feb 06 Posts: 20 Credit: 17,624 RAC: 0 |
OK, I've restarted it. Will see.... Thanks |
Hickory Explorer [USA] Send message Joined: 15 Feb 06 Posts: 2 Credit: 9,562 RAC: 0 |
I had a WU at 1% this morning. It finished while at work, so I didn't see it finish. Doesn't look like it completed much work in the 7.57 hours that it ran. WU ID : 11353 WU name : HOMSdc_homDB008_1dcj__229_7 CPU : P4 3.0Ghz HT OS : XP SP2 <core_client_version>5.2.13</core_client_version> <stderr_txt> # random seed: 3988759 # cpu_run_time_pref: 7200 # DONE :: 1 starting structures built 0 (nstruct) times # This process generated 1 decoys from 1 attempts </stderr_txt> |
Hickory Explorer [USA] Send message Joined: 15 Feb 06 Posts: 2 Credit: 9,562 RAC: 0 |
Have a 4.90 unit on another PC that was struck at 1%. It was on model 1 at step 34401. It had been running for 4 hours. I stopped and restarted Boinc. When the WU restarted, it started at 0. It has been iniatizing now for 30+ minutes. Will let it run. WU ID: 11340 Results ID: 12974 Result Name: HOMSdc_homDB025_1dcj__229_6_0 Computer ID: 100 CPU: Pentium M 1.73GHz OS: XP SP2 |
Stargazer257 Send message Joined: 16 Feb 06 Posts: 6 Credit: 17,492 RAC: 0 |
Have a 4.90 unit on another PC that was struck at 1%. It was on model 1 at step 34401. It had been running for 4 hours. That's what mine did too (reset to 0:00 upon restart). The reason it did this is because the work hadn't reached a "checkpoint" as it were. Upon reboot, it didn't have a place to start and had to begin anew. You will have to let it run longer (of the two WU's I had like that, one ran ~6 hrs, and the other is still running at 9+ hrs). Look at the screensaver/graphic and see if the steps increment (it may seem like it is stopped, but check the step, then check back later to see if it has changed). My WU's raced up to Step 34,000 then seemed to stop. It actually has done 5-600 additional steps over the last 9 hours. Good luck Join Us! - Click the Sig! |
STE\/E Send message Joined: 16 Feb 06 Posts: 27 Credit: 2,226,442 RAC: 783 |
How long should we let these WU's run ... ??? I have one now at over 11 hours & 1 at over 9 hours, both are still at 1% and the Computers are 3.4 Ghz. They should have been done by now I would think ... ??? PS: The one WU that was @ over 9 hours finally finished @ 9:47 Hr's .. The one @ over 11 hr's is still running, now up close to 12 hours ... :0 |
Moderator9 Volunteer moderator Send message Joined: 16 Feb 06 Posts: 251 Credit: 0 RAC: 0 |
How long should we let these WU's run ... ??? As most of you know, Ralph is a test project. While some of the testing is for developing the application itself, some is also for designing WU parameters and functions. This will cause a great deal of variation in the run length and speed of the test WUs. The Ralph WUs are not being evaluated for science content beyond what is necessary to determine if the application is processing them correctly. However, it is important that if they can complete a successful run, that they be allowed to do so to produce valid application test results. You should try to run all the WUs until they finish, but if one is truly hung (not running at all) after a long period of time, and you can absolutely determine that that it is dead, you should report it here, restart BOINC and see if it will run to completion. If it will not, then this is important test information. ANY WU THAT IS RESTARTED FOR ANY REASON BEFORE IT REACHES THE FIRST CHECKPOINT WILL START OVER FROM 0%. (the first checkpoint occurs when the percent complete reaches any value GREATER than 1% complete) Anything that removes the WU from memory before it reaches the first checkpoint is considered to be a restart. (Application swaps with keep in memory set to no, Turning off the computer, Restarting the computer, restarting BOINC, and suspending and restarting the project are all events that remove the WU from memory). The volume of production (i.e. the number of WU run by a machine) is not important to Ralph testing. What is important is to get WUs processed on as many different platforms and configurations as possible, and have any errors reported back. To accomplish, this the Work flow for Ralph is set to restrict the number of WUs available to any particular machine. So you should not be concerned if you do not get additional work when you complete any work you have. Work will be made available as the testing requires. In addition there is more error and runtime information being kept and reported for Ralph than there would be in a production environment. At times this will cause the application to run slower than would otherwise be the case. Slower running is an expected side effect of testing. But, it can cause it to appear that a WU is not running, when in fact it is just running very slowly. As one might expect slower running will cause a WU to run for a long period of time before reaching the first checkpoint (greater than 1% complete). Under these conditions the WUs will run significantly longer than the user adjustable time setting in the preferences. When this happens the WU may indicate 1% complete for a very long time and suddenly jump to 100% and finish. While this may look unusual, it is in fact an expected and known side effect of the current testing process. If possible you should try NOT to abort Ralph WUs unless they simply cannot be made to run to completion, or you receive notification from the project to abort them. As new test versions are released in Ralph, you will periodically receive instructions to abort certain classes of WU to flush the system for the next series of tests. You may also be asked to reset Ralph occasionally. When this happens, it may take some time for your system to adjust and begin receiving work again. This is also a normal and expected side effect of the testing. Moderator9 RALPH@home FAQs RALPH@home Guidelines Moderator Contact |
STE\/E Send message Joined: 16 Feb 06 Posts: 27 Credit: 2,226,442 RAC: 783 |
As far as I can determine the WU that was over 11 hours is still running according to the Process Manager. It show 50% usage of the CPU for that WU, it's still running & at 13:30 hours now. I'll let it continue to run & see what happens to it & will report back on it one way or the other ... |
[B^S] Dr. Bill Skiba Send message Joined: 15 Feb 06 Posts: 4 Credit: 6,496 RAC: 0 |
I just aborted this wu. https://ralph.bakerlab.org/result.php?resultid=12982. It reset itself to "0" time several times (yes, it was left in memory). I shut down BOINC, restarted the system and encountered the same behavior. After 3 more restarts from "0" time I gave up on it. |
Bruno G. Olsen & ESEA @ greenholt Send message Joined: 16 Feb 06 Posts: 4 Credit: 45,078 RAC: 0 |
work unit: https://ralph.bakerlab.org/workunit.php?wuid=11591 result: https://ralph.bakerlab.org/result.php?resultid=13442 host: https://ralph.bakerlab.org/show_host_detail.php?hostid=285 has been running for 1 hour and 44 minutes and reports 6 hours left |
STE\/E Send message Joined: 16 Feb 06 Posts: 27 Credit: 2,226,442 RAC: 783 |
As far as I can determine the WU that was over 11 hours is still running according to the Process Manager. It show 50% usage of the CPU for that WU, it's still running & at 13:30 hours now. I'll let it continue to run & see what happens to it & will report back on it one way or the other ... PS: This WU just finally did finish successfully @ the 20:41 Hour Mark, it never did show more than 1% finished the whole time it ran ... :) |
Message boards :
RALPH@home bug list :
Report \"stuck at 1%\" bugs here
©2024 University of Washington
http://www.bakerlab.org