Report \"stuck at 1%\" bugs here

Message boards : RALPH@home bug list : Report \"stuck at 1%\" bugs here

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
Profile Fuzzy Hollynoodles
Avatar

Send message
Joined: 19 Feb 06
Posts: 37
Credit: 2,089
RAC: 0
Message 741 - Posted: 28 Feb 2006, 15:27:51 UTC
Last modified: 28 Feb 2006, 15:34:27 UTC

I have one:

https://ralph.bakerlab.org/workunit.php?wuid=11108

Result: https://ralph.bakerlab.org/result.php?resultid=12738

I got it last night, and it ran for more than an hour on 1%. I opened the graphic to see what was going on, and it seemed to be "alive", with some very small wiggles, and almost no movements of the curves. It ran, when I shut down before I went to bed, as I usually do, and I booted up again when I got up and went out. When I came home again about 30 minutes ago, the other project WU's have run, so everything was reset to zero, CPU time and percentage. It started again, after I manually updated Ralph, and it seems it has started again from scratch, as the CPU time has reset to zero and the percentage is on 1 again.

I have set them all to stay in memory and a Target CPU run time set to default (8 hours).

My computer is https://ralph.bakerlab.org/show_host_detail.php?hostid=797

But the stdout file looks interesting. David Kim, do you want me to mail it to you? It's very long, so I wont post it here.

In the graphic it looks totally dead with not curves at all and no movements. It seems stopped at step 32509.

Shall I leave it running and see what's happening? Or should I just put it out of it's misery? :-(

EDIT:

2/28/2006 4:07:01 PM|ralph@home|Resuming result HOMSdi_homDB003_1di2__228_9_0 using rosetta_beta version 490


And it seems to be "alive" but very slow. It has now moved up to step 32521


[color=navy][b]"I'm trying to maintain a shred of dignity in this world." - Me[/b][/color]

ID: 741 · Report as offensive    Reply Quote
Profile Fuzzy Hollynoodles
Avatar

Send message
Joined: 19 Feb 06
Posts: 37
Credit: 2,089
RAC: 0
Message 745 - Posted: 28 Feb 2006, 16:53:23 UTC - in response to Message 741.  

I have one:

https://ralph.bakerlab.org/workunit.php?wuid=11108

Result: https://ralph.bakerlab.org/result.php?resultid=12738

...


And it finished without I noticed it.

Result: https://ralph.bakerlab.org/result.php?resultid=12738


[color=navy][b]"I'm trying to maintain a shred of dignity in this world." - Me[/b][/color]

ID: 745 · Report as offensive    Reply Quote
Profile [B^S] thierry@home
Avatar

Send message
Joined: 15 Feb 06
Posts: 20
Credit: 17,624
RAC: 0
Message 753 - Posted: 28 Feb 2006, 22:29:17 UTC

I have a WU 4.90 stuck at 1% for 1h05'. The graphics are more or less freezed. THe protein shape moves a little bit every +/- 20 seconds. What do I do with this WU?
I have suspended it until I know what to do.

WU number : HOMSb7_homDB005_1b72_226_2
CPU : P4 3.0Ghz HT
OS : XP SP2


ID: 753 · Report as offensive    Reply Quote
Stargazer257

Send message
Joined: 16 Feb 06
Posts: 6
Credit: 17,492
RAC: 0
Message 755 - Posted: 28 Feb 2006, 22:47:29 UTC - in response to Message 753.  
Last modified: 28 Feb 2006, 22:48:18 UTC

I have a WU 4.90 stuck at 1% for 1h05'. The graphics are more or less freezed. THe protein shape moves a little bit every +/- 20 seconds. What do I do with this WU?
I have suspended it until I know what to do.

WU number : HOMSb7_homDB005_1b72_226_2
CPU : P4 3.0Ghz HT
OS : XP SP2


Continue to run it. I had two that were like that (got to step ~34,000 real quick and then appeared to stop). One of them has since completed at ~5 hrs, the other is still going at 6+ hrs. Check the graphics/screensaver and see if the steps slowly increment. The one I have that is still running has only done ~500 steps since it appeared to slow down/stop. As long as the steps continue to increment (albiet, slowly), it is still running.

And BTW, the progress only showed 1% done until it finished. Then it went to 100%. Hope yours are like that.


Join Us! - Click the Sig!
ID: 755 · Report as offensive    Reply Quote
Profile [B^S] thierry@home
Avatar

Send message
Joined: 15 Feb 06
Posts: 20
Credit: 17,624
RAC: 0
Message 756 - Posted: 28 Feb 2006, 22:53:17 UTC

OK, I've restarted it. Will see....
Thanks

ID: 756 · Report as offensive    Reply Quote
Hickory Explorer [USA]

Send message
Joined: 15 Feb 06
Posts: 2
Credit: 9,562
RAC: 0
Message 757 - Posted: 1 Mar 2006, 0:04:47 UTC

I had a WU at 1% this morning. It finished while at work, so I didn't see it finish. Doesn't look like it completed much work in the 7.57 hours that it ran.

WU ID : 11353
WU name : HOMSdc_homDB008_1dcj__229_7
CPU : P4 3.0Ghz HT
OS : XP SP2

<core_client_version>5.2.13</core_client_version>
<stderr_txt>
# random seed: 3988759
# cpu_run_time_pref: 7200
# DONE :: 1 starting structures built 0 (nstruct) times
# This process generated 1 decoys from 1 attempts

</stderr_txt>



ID: 757 · Report as offensive    Reply Quote
Hickory Explorer [USA]

Send message
Joined: 15 Feb 06
Posts: 2
Credit: 9,562
RAC: 0
Message 761 - Posted: 1 Mar 2006, 2:22:32 UTC

Have a 4.90 unit on another PC that was struck at 1%. It was on model 1 at step 34401. It had been running for 4 hours.

I stopped and restarted Boinc. When the WU restarted, it started at 0. It has been iniatizing now for 30+ minutes. Will let it run.

WU ID: 11340
Results ID: 12974
Result Name: HOMSdc_homDB025_1dcj__229_6_0
Computer ID: 100
CPU: Pentium M 1.73GHz
OS: XP SP2



ID: 761 · Report as offensive    Reply Quote
Stargazer257

Send message
Joined: 16 Feb 06
Posts: 6
Credit: 17,492
RAC: 0
Message 762 - Posted: 1 Mar 2006, 2:41:52 UTC - in response to Message 761.  

Have a 4.90 unit on another PC that was struck at 1%. It was on model 1 at step 34401. It had been running for 4 hours.

I stopped and restarted Boinc. When the WU restarted, it started at 0. It has been iniatizing now for 30+ minutes. Will let it run.

WU ID: 11340
Results ID: 12974
Result Name: HOMSdc_homDB025_1dcj__229_6_0
Computer ID: 100
CPU: Pentium M 1.73GHz
OS: XP SP2



That's what mine did too (reset to 0:00 upon restart). The reason it did this is because the work hadn't reached a "checkpoint" as it were. Upon reboot, it didn't have a place to start and had to begin anew. You will have to let it run longer (of the two WU's I had like that, one ran ~6 hrs, and the other is still running at 9+ hrs). Look at the screensaver/graphic and see if the steps increment (it may seem like it is stopped, but check the step, then check back later to see if it has changed). My WU's raced up to Step 34,000 then seemed to stop. It actually has done 5-600 additional steps over the last 9 hours.

Good luck


Join Us! - Click the Sig!
ID: 762 · Report as offensive    Reply Quote
STE\/E

Send message
Joined: 16 Feb 06
Posts: 27
Credit: 2,226,442
RAC: 783
Message 765 - Posted: 1 Mar 2006, 13:24:26 UTC
Last modified: 1 Mar 2006, 13:58:49 UTC

How long should we let these WU's run ... ???

I have one now at over 11 hours & 1 at over 9 hours, both are still at 1% and the Computers are 3.4 Ghz. They should have been done by now I would think ... ???

PS: The one WU that was @ over 9 hours finally finished @ 9:47 Hr's .. The one @ over 11 hr's is still running, now up close to 12 hours ... :0
ID: 765 · Report as offensive    Reply Quote
STE\/E

Send message
Joined: 16 Feb 06
Posts: 27
Credit: 2,226,442
RAC: 783
Message 768 - Posted: 1 Mar 2006, 15:45:27 UTC

As far as I can determine the WU that was over 11 hours is still running according to the Process Manager. It show 50% usage of the CPU for that WU, it's still running & at 13:30 hours now. I'll let it continue to run & see what happens to it & will report back on it one way or the other ...
ID: 768 · Report as offensive    Reply Quote
Profile [B^S] Dr. Bill Skiba
Avatar

Send message
Joined: 15 Feb 06
Posts: 4
Credit: 6,496
RAC: 0
Message 774 - Posted: 1 Mar 2006, 19:39:06 UTC

I just aborted this wu. https://ralph.bakerlab.org/result.php?resultid=12982.

It reset itself to "0" time several times (yes, it was left in memory). I shut down BOINC, restarted the system and encountered the same behavior. After 3 more restarts from "0" time I gave up on it.

ID: 774 · Report as offensive    Reply Quote
Profile Bruno G. Olsen & ESEA @ greenholt

Send message
Joined: 16 Feb 06
Posts: 4
Credit: 45,078
RAC: 0
Message 775 - Posted: 1 Mar 2006, 20:21:40 UTC

work unit: https://ralph.bakerlab.org/workunit.php?wuid=11591
result: https://ralph.bakerlab.org/result.php?resultid=13442
host: https://ralph.bakerlab.org/show_host_detail.php?hostid=285

has been running for 1 hour and 44 minutes and reports 6 hours left
ID: 775 · Report as offensive    Reply Quote
STE\/E

Send message
Joined: 16 Feb 06
Posts: 27
Credit: 2,226,442
RAC: 783
Message 778 - Posted: 1 Mar 2006, 22:44:45 UTC - in response to Message 768.  
Last modified: 1 Mar 2006, 22:47:09 UTC

As far as I can determine the WU that was over 11 hours is still running according to the Process Manager. It show 50% usage of the CPU for that WU, it's still running & at 13:30 hours now. I'll let it continue to run & see what happens to it & will report back on it one way or the other ...


PS: This WU just finally did finish successfully @ the 20:41 Hour Mark, it never did show more than 1% finished the whole time it ran ... :)
ID: 778 · Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 16 Feb 06
Posts: 45
Credit: 43,706
RAC: 20
Message 779 - Posted: 2 Mar 2006, 0:37:46 UTC - in response to Message 767.  
Last modified: 2 Mar 2006, 0:42:01 UTC


ANY WU THAT IS RESTARTED FOR ANY REASON BEFORE IT REACHES THE FIRST CHECKPOINT WILL START OVER FROM 0%. (the first checkpoint occurs when the percent complete reaches any value GREATER than 1% complete)

Anything that removes the WU from memory before it reaches the first checkpoint is considered to be a restart. (Application swaps with keep in memory set to no, Turning off the computer, Restarting the computer, restarting BOINC, and suspending and restarting the project are all events that remove the WU from memory).



Well, the machine I had set up to test "leave in memory = NO" has restarted a bunch of times, basically every time that the apps switch. I just changed that to "leave in memory = YES".

I would guess that we can't do that test anymore while 4.90 WU's are being sent out.

[edit]
BTW, I'm now running BOINC ver. 5.3.22, since it has the ability to use a "global_prefs_override.xml" file to quickly change preferences like Leave Apps In Memory without worrying what venue a machine belongs to or what other machines the change might affect. FINALLY!
[/edit]


ID: 779 · Report as offensive    Reply Quote
Profile Brotherbard

Send message
Joined: 16 Feb 06
Posts: 15
Credit: 76,109
RAC: 0
Message 787 - Posted: 2 Mar 2006, 18:36:29 UTC

The WU # 11525 1vdi_loop_1m5xA__1001_233_5 has been hung at 1% for 13 hours now.

In the graphics the model is not changing and the stats show: Stage: Relax, Model: 1, Step: 0.

The stderr file is filled with "Could not identify element type from chemical symbol. Setting as undefined". And both the stderr and sdtout files have not been modified since about a half hour from the start of the WU.

It is still running.

--Nathan
ID: 787 · Report as offensive    Reply Quote
Profile Brotherbard

Send message
Joined: 16 Feb 06
Posts: 15
Credit: 76,109
RAC: 0
Message 790 - Posted: 2 Mar 2006, 19:26:39 UTC - in response to Message 789.  

If it is a version 4.90 WU, abort it. If it is a 4.91 WU then try restarting it by restarting BOINC.


It's on a Mac OS X 10.4.5, RAPLH v 4.85

--Nathan

ID: 790 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 885 - Posted: 16 Mar 2006, 21:18:22 UTC

stuck at 1.00%
https://ralph.bakerlab.org/result.php?resultid=17832
Rosetta_beta 4.84 Linux

CPU 98% IDLE

*Restarting boinc
Click signature for global team stats
ID: 885 · Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 17 Feb 06
Posts: 50
Credit: 55,397
RAC: 0
Message 894 - Posted: 17 Mar 2006, 16:03:38 UTC
Last modified: 17 Mar 2006, 16:32:12 UTC

rosetta_beta 4.93 Core Client is 5.2.15

I have a "Stuck at 1%" in progress right now. I have the app set to leave workunits in memory. Its on a Win2000 SP4 machine.

Heres the Result ID: https://ralph.bakerlab.org/result.php?resultid=17137
Workunit ID: https://ralph.bakerlab.org/workunit.php?wuid=11522

Its been running for 16 hours of CPU time.


Is there any info I can gather to help with this one while its in progress? I noticed you include the .pdb file. I can do remote debugging of VS2005 apps on this machine, I just need some clues as to what to look for.

ID: 894 · Report as offensive    Reply Quote
Rom Walton (BOINC)
Volunteer moderator
Project developer

Send message
Joined: 10 Mar 06
Posts: 21
Credit: 5,515
RAC: 0
Message 901 - Posted: 18 Mar 2006, 4:38:09 UTC

Probably the best thing to do is get this tool:
http://www.sysinternals.com/Utilities/ProcessExplorer.html

Open up process explorer.
Right-Click on the Rosetta process and bring up the properties.
Switch to the threads tab.
For each thread that is eating CPU time click on the stack button.
Click on the copy button.

Do that a few times and post the results here.

----- Rom
ID: 901 · Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 17 Feb 06
Posts: 50
Credit: 55,397
RAC: 0
Message 904 - Posted: 18 Mar 2006, 8:26:30 UTC - in response to Message 901.  
Last modified: 18 Mar 2006, 8:29:04 UTC


Do that a few times and post the results here.


Rom,

There are 3 threads:

Pass 1

for CSwitchDelta aprox 90 StartAddress rosetta_beta_4.93_windows_intelx86.exe+0x1de550

Stack:
ntoskrnl.exe+0x68efb
ntoskrnl.exe+0xe3ad2
rosetta_beta_4.93_windows_intelx86.exe+0x32f6c8

for CSwitchDelta 31 StartAddress rosetta_beta_4.93_windows_intelx86.exe+0x49fcf

Stack:
ntoskrnl.exe+0x68e35
win32k.sys+0x19c2
win32k.sys+0xb72
win32k.sys+0x75693
ntoskrnl.exe+0x65014
ntoskrnl.exe+0xe3ad2
USER32.DLL+0x31eb3
rosetta_beta_4.93_windows_intelx86.exe+0x47b2fb
rosetta_beta_4.93_windows_intelx86.exe+0x26c504
KERNEL32.dll+0x28989


for CSwitchDelta 1 StartAddress WINMM.dll+0x927f

Stack:
ntoskrnl.exe+0x68e35
ntoskrnl.exe+0x4fc50
ntoskrnl.exe+0x65014
ntdll.dll+0x8f03


Pass 2

for CSwitchDelta aprox 90 StartAddress rosetta_beta_4.93_windows_intelx86.exe+0x1de550

Stack:
ntoskrnl.exe+0x68efb
ntoskrnl.exe+0xe3ad2
rosetta_beta_4.93_windows_intelx86.exe+0x32f656


for CSwitchDelta 31 StartAddress rosetta_beta_4.93_windows_intelx86.exe+0x49fcf

Stack:
ntoskrnl.exe+0x68e35
win32k.sys+0x19c2
win32k.sys+0xb72
win32k.sys+0x75693
ntoskrnl.exe+0x65014
ntoskrnl.exe+0xe3ad2
USER32.DLL+0x31eb3
rosetta_beta_4.93_windows_intelx86.exe+0x47b2fb
rosetta_beta_4.93_windows_intelx86.exe+0x26c504
KERNEL32.dll+0x28989


for CSwitchDelta 1 StartAddress WINMM.dll+0x927f

Stack:
ntoskrnl.exe+0x68e35
ntoskrnl.exe+0x4fc50
ntoskrnl.exe+0x65014
ntdll.dll+0x8f03

Pass 3

for CSwitchDelta aprox 90 StartAddress rosetta_beta_4.93_windows_intelx86.exe+0x1de550

Stack:
ntoskrnl.exe+0x68efb
ntoskrnl.exe+0xe3ad2
rosetta_beta_4.93_windows_intelx86.exe+0x32f6b6

for CSwitchDelta 31 StartAddress rosetta_beta_4.93_windows_intelx86.exe+0x49fcf

Stack:
ntoskrnl.exe+0x68e35
win32k.sys+0x19c2
win32k.sys+0xb72
win32k.sys+0x75693
ntoskrnl.exe+0x65014
ntoskrnl.exe+0xe3ad2
USER32.DLL+0x31eb3
rosetta_beta_4.93_windows_intelx86.exe+0x47b2fb
rosetta_beta_4.93_windows_intelx86.exe+0x26c504
KERNEL32.dll+0x28989


for CSwitchDelta 1 StartAddress WINMM.dll+0x927f

Stack:
ntoskrnl.exe+0x68e35
ntoskrnl.exe+0x4fc50
ntoskrnl.exe+0x65014
ntdll.dll+0x8f03


By suspending a much higher priority project I can get this work unit to run at will... for right now its in suspended animation and left in virtual memory. It currently has 17hrs 53min 7sec of CPU time on it and is still at 1.00%. Let me know if there is any thing else I can do. I suspect you can get my email address if you need more detailed conversations. I would also be willing to call you.

Additional info. BOINC is running as a service.

Mike


ID: 904 · Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : RALPH@home bug list : Report \"stuck at 1%\" bugs here



©2024 University of Washington
http://www.bakerlab.org