Report \"stuck at 1%\" bugs here

Message boards : RALPH@home bug list : Report \"stuck at 1%\" bugs here

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

AuthorMessage
genes
Avatar

Send message
Joined: 16 Feb 06
Posts: 45
Credit: 43,300
RAC: 0
Message 779 - Posted: 2 Mar 2006, 0:37:46 UTC - in response to Message 767.  
Last modified: 2 Mar 2006, 0:42:01 UTC


ANY WU THAT IS RESTARTED FOR ANY REASON BEFORE IT REACHES THE FIRST CHECKPOINT WILL START OVER FROM 0%. (the first checkpoint occurs when the percent complete reaches any value GREATER than 1% complete)

Anything that removes the WU from memory before it reaches the first checkpoint is considered to be a restart. (Application swaps with keep in memory set to no, Turning off the computer, Restarting the computer, restarting BOINC, and suspending and restarting the project are all events that remove the WU from memory).



Well, the machine I had set up to test "leave in memory = NO" has restarted a bunch of times, basically every time that the apps switch. I just changed that to "leave in memory = YES".

I would guess that we can't do that test anymore while 4.90 WU's are being sent out.

[edit]
BTW, I'm now running BOINC ver. 5.3.22, since it has the ability to use a "global_prefs_override.xml" file to quickly change preferences like Leave Apps In Memory without worrying what venue a machine belongs to or what other machines the change might affect. FINALLY!
[/edit]


ID: 779 · Report as offensive    Reply Quote
Profile Brotherbard

Send message
Joined: 16 Feb 06
Posts: 15
Credit: 76,109
RAC: 0
Message 787 - Posted: 2 Mar 2006, 18:36:29 UTC

The WU # 11525 1vdi_loop_1m5xA__1001_233_5 has been hung at 1% for 13 hours now.

In the graphics the model is not changing and the stats show: Stage: Relax, Model: 1, Step: 0.

The stderr file is filled with "Could not identify element type from chemical symbol. Setting as undefined". And both the stderr and sdtout files have not been modified since about a half hour from the start of the WU.

It is still running.

--Nathan
ID: 787 · Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 16 Feb 06
Posts: 251
Credit: 0
RAC: 0
Message 789 - Posted: 2 Mar 2006, 19:22:00 UTC - in response to Message 787.  

The WU # 11525 1vdi_loop_1m5xA__1001_233_5 has been hung at 1% for 13 hours now.

In the graphics the model is not changing and the stats show: Stage: Relax, Model: 1, Step: 0.

The stderr file is filled with "Could not identify element type from chemical symbol. Setting as undefined". And both the stderr and sdtout files have not been modified since about a half hour from the start of the WU.

It is still running.

--Nathan

If it is a version 4.90 WU, abort it. If it is a 4.91 WU then try restarting it by restarting BOINC.
Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact
ID: 789 · Report as offensive    Reply Quote
Profile Brotherbard

Send message
Joined: 16 Feb 06
Posts: 15
Credit: 76,109
RAC: 0
Message 790 - Posted: 2 Mar 2006, 19:26:39 UTC - in response to Message 789.  

If it is a version 4.90 WU, abort it. If it is a 4.91 WU then try restarting it by restarting BOINC.


It's on a Mac OS X 10.4.5, RAPLH v 4.85

--Nathan

ID: 790 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 885 - Posted: 16 Mar 2006, 21:18:22 UTC

stuck at 1.00%
https://ralph.bakerlab.org/result.php?resultid=17832
Rosetta_beta 4.84 Linux

CPU 98% IDLE

*Restarting boinc
Click signature for global team stats
ID: 885 · Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 17 Feb 06
Posts: 50
Credit: 55,397
RAC: 0
Message 894 - Posted: 17 Mar 2006, 16:03:38 UTC
Last modified: 17 Mar 2006, 16:32:12 UTC

rosetta_beta 4.93 Core Client is 5.2.15

I have a "Stuck at 1%" in progress right now. I have the app set to leave workunits in memory. Its on a Win2000 SP4 machine.

Heres the Result ID: https://ralph.bakerlab.org/result.php?resultid=17137
Workunit ID: https://ralph.bakerlab.org/workunit.php?wuid=11522

Its been running for 16 hours of CPU time.


Is there any info I can gather to help with this one while its in progress? I noticed you include the .pdb file. I can do remote debugging of VS2005 apps on this machine, I just need some clues as to what to look for.

ID: 894 · Report as offensive    Reply Quote
Rom Walton (BOINC)
Volunteer moderator
Project developer

Send message
Joined: 10 Mar 06
Posts: 21
Credit: 5,515
RAC: 0
Message 901 - Posted: 18 Mar 2006, 4:38:09 UTC

Probably the best thing to do is get this tool:
http://www.sysinternals.com/Utilities/ProcessExplorer.html

Open up process explorer.
Right-Click on the Rosetta process and bring up the properties.
Switch to the threads tab.
For each thread that is eating CPU time click on the stack button.
Click on the copy button.

Do that a few times and post the results here.

----- Rom
ID: 901 · Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 17 Feb 06
Posts: 50
Credit: 55,397
RAC: 0
Message 904 - Posted: 18 Mar 2006, 8:26:30 UTC - in response to Message 901.  
Last modified: 18 Mar 2006, 8:29:04 UTC


Do that a few times and post the results here.


Rom,

There are 3 threads:

Pass 1

for CSwitchDelta aprox 90 StartAddress rosetta_beta_4.93_windows_intelx86.exe+0x1de550

Stack:
ntoskrnl.exe+0x68efb
ntoskrnl.exe+0xe3ad2
rosetta_beta_4.93_windows_intelx86.exe+0x32f6c8

for CSwitchDelta 31 StartAddress rosetta_beta_4.93_windows_intelx86.exe+0x49fcf

Stack:
ntoskrnl.exe+0x68e35
win32k.sys+0x19c2
win32k.sys+0xb72
win32k.sys+0x75693
ntoskrnl.exe+0x65014
ntoskrnl.exe+0xe3ad2
USER32.DLL+0x31eb3
rosetta_beta_4.93_windows_intelx86.exe+0x47b2fb
rosetta_beta_4.93_windows_intelx86.exe+0x26c504
KERNEL32.dll+0x28989


for CSwitchDelta 1 StartAddress WINMM.dll+0x927f

Stack:
ntoskrnl.exe+0x68e35
ntoskrnl.exe+0x4fc50
ntoskrnl.exe+0x65014
ntdll.dll+0x8f03


Pass 2

for CSwitchDelta aprox 90 StartAddress rosetta_beta_4.93_windows_intelx86.exe+0x1de550

Stack:
ntoskrnl.exe+0x68efb
ntoskrnl.exe+0xe3ad2
rosetta_beta_4.93_windows_intelx86.exe+0x32f656


for CSwitchDelta 31 StartAddress rosetta_beta_4.93_windows_intelx86.exe+0x49fcf

Stack:
ntoskrnl.exe+0x68e35
win32k.sys+0x19c2
win32k.sys+0xb72
win32k.sys+0x75693
ntoskrnl.exe+0x65014
ntoskrnl.exe+0xe3ad2
USER32.DLL+0x31eb3
rosetta_beta_4.93_windows_intelx86.exe+0x47b2fb
rosetta_beta_4.93_windows_intelx86.exe+0x26c504
KERNEL32.dll+0x28989


for CSwitchDelta 1 StartAddress WINMM.dll+0x927f

Stack:
ntoskrnl.exe+0x68e35
ntoskrnl.exe+0x4fc50
ntoskrnl.exe+0x65014
ntdll.dll+0x8f03

Pass 3

for CSwitchDelta aprox 90 StartAddress rosetta_beta_4.93_windows_intelx86.exe+0x1de550

Stack:
ntoskrnl.exe+0x68efb
ntoskrnl.exe+0xe3ad2
rosetta_beta_4.93_windows_intelx86.exe+0x32f6b6

for CSwitchDelta 31 StartAddress rosetta_beta_4.93_windows_intelx86.exe+0x49fcf

Stack:
ntoskrnl.exe+0x68e35
win32k.sys+0x19c2
win32k.sys+0xb72
win32k.sys+0x75693
ntoskrnl.exe+0x65014
ntoskrnl.exe+0xe3ad2
USER32.DLL+0x31eb3
rosetta_beta_4.93_windows_intelx86.exe+0x47b2fb
rosetta_beta_4.93_windows_intelx86.exe+0x26c504
KERNEL32.dll+0x28989


for CSwitchDelta 1 StartAddress WINMM.dll+0x927f

Stack:
ntoskrnl.exe+0x68e35
ntoskrnl.exe+0x4fc50
ntoskrnl.exe+0x65014
ntdll.dll+0x8f03


By suspending a much higher priority project I can get this work unit to run at will... for right now its in suspended animation and left in virtual memory. It currently has 17hrs 53min 7sec of CPU time on it and is still at 1.00%. Let me know if there is any thing else I can do. I suspect you can get my email address if you need more detailed conversations. I would also be willing to call you.

Additional info. BOINC is running as a service.

Mike


ID: 904 · Report as offensive    Reply Quote
Rom Walton (BOINC)
Volunteer moderator
Project developer

Send message
Joined: 10 Mar 06
Posts: 21
Credit: 5,515
RAC: 0
Message 905 - Posted: 18 Mar 2006, 16:14:35 UTC
Last modified: 18 Mar 2006, 16:16:56 UTC

Oppps, forgot to ask you to do one additional thing....

In Process Explorer there is an Options menu... Configure Symbols...

Can you set the Dbghelp.dll path to:

C:Program FilesBOINCDbgHelp.dll

After that could you rerun the tests again?

When things are working right you'll get something that looks like this:
rosetta_beta_4.93_windows_intelx86.exe!pairenergy+0x126
rosetta_beta_4.93_windows_intelx86.exe!fullatom_energy+0x1979
rosetta_beta_4.93_windows_intelx86.exe!scorefxn+0xb4e

TIA.

----- Rom

ID: 905 · Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 17 Feb 06
Posts: 50
Credit: 55,397
RAC: 0
Message 906 - Posted: 18 Mar 2006, 17:08:22 UTC - in response to Message 905.  


After that could you rerun the tests again?


Rom,

Data with Symbols:

Pass 1


for CSwitchDelta aprox 90 StartAddress rosetta_beta_4.93_windows_intelx86.exe+0x1de550

Stack:
ntoskrnl.exe!KiDispatchInterrupt+0x7b
ntoskrnl.exe!PsSetLegoNotifyRoutine+0x83a
rosetta_beta_4.93_windows_intelx86.exe+0x32f6b6

for CSwitchDelta 31 StartAddress rosetta_beta_4.93_windows_intelx86.exe+0x49fcf

Stack:
ntoskrnl.exe!KiUnexpectedInterrupt+0x183
win32k.sys+0x19c2
win32k.sys+0xb72
win32k.sys!EngGetCurrentCodePage+0x3654
ntoskrnl.exe!KiReleaseSpinLock+0xae4
!local_unwind2+0x5fe830bb
ntoskrnl.exe!PsSetLegoNotifyRoutine+0x83a
USER32.DLL!DispatchMessageW+0x40
rosetta_beta_4.93_windows_intelx86.exe+0x47b2fb
rosetta_beta_4.93_windows_intelx86.exe+0x26c504
KERNEL32.dll!ProcessIdToSessionId+0x17d

for CSwitchDelta 1 StartAddress WINMM.dlltimeSetEvent+0x2b0

Stack:
ntoskrnl.exe!KiUnexpectedInterrupt+0x183
ntoskrnl.exe!ObSetSecurityDescriptorInfo+0x62c
ntoskrnl.exe!KiReleaseSpinLock+0xae4
ntdll.dll!ZwWaitForMultipleObjects+0xb


Pass 2

for CSwitchDelta aprox 90 StartAddress rosetta_beta_4.93_windows_intelx86.exe+0x1de550

Stack:
ntoskrnl.exe!KiDispatchInterrupt+0x7b
!local_unwind2+0x5fe830bb
ntoskrnl.exe!PsSetLegoNotifyRoutine+0x83a
rosetta_beta_4.93_windows_intelx86.exe+0x49aeda
rosetta_beta_4.93_windows_intelx86.exe+0x256bb5

for CSwitchDelta 31 StartAddress rosetta_beta_4.93_windows_intelx86.exe+0x49fcf

Stack:
ntoskrnl.exe!KiUnexpectedInterrupt+0x183
win32k.sys+0x19c2
win32k.sys+0xb72
win32k.sys!EngGetCurrentCodePage+0x3654
ntoskrnl.exe!KiReleaseSpinLock+0xae4
!local_unwind2+0x5fe830bb
ntoskrnl.exe!PsSetLegoNotifyRoutine+0x83a
USER32.DLL!DispatchMessageW+0x40
rosetta_beta_4.93_windows_intelx86.exe+0x47b2fb
rosetta_beta_4.93_windows_intelx86.exe+0x26c504
KERNEL32.dll!ProcessIdToSessionId+0x17d

for CSwitchDelta 1 StartAddress WINMM.dlltimeSetEvent+0x2b0

Stack:
ntoskrnl.exe!KiUnexpectedInterrupt+0x183
ntoskrnl.exe!ZwYieldExecution+0x35f
ntoskrnl.exe!KiUnexpectedInterrupt+0x1ba
ntdll.dll!ZwWaitForMultipleObjects+0xb



Pass 3

for CSwitchDelta aprox 90 StartAddress rosetta_beta_4.93_windows_intelx86.exe+0x1de550

Stack:
ntoskrnl.exe!KiDispatchInterrupt+0x7b
!local_unwind2+0x5fe830bb
ntoskrnl.exe!PsSetLegoNotifyRoutine+0x83a
rosetta_beta_4.93_windows_intelx86.exe+0x256b92

for CSwitchDelta 31 StartAddress rosetta_beta_4.93_windows_intelx86.exe+0x49fcf

Stack:
ntoskrnl.exe!KiUnexpectedInterrupt+0x183
win32k.sys+0x19c2
win32k.sys+0xb72
win32k.sys!EngGetCurrentCodePage+0x3654
ntoskrnl.exe!KiReleaseSpinLock+0xae4
!local_unwind2+0x5fe830bb
ntoskrnl.exe!PsSetLegoNotifyRoutine+0x83a
USER32.DLL!DispatchMessageW+0x40
rosetta_beta_4.93_windows_intelx86.exe+0x47b2fb
rosetta_beta_4.93_windows_intelx86.exe+0x26c504
KERNEL32.dll!ProcessIdToSessionId+0x17d

for CSwitchDelta 1 StartAddress WINMM.dlltimeSetEvent+0x2b0

Stack:
ntoskrnl.exe!KiUnexpectedInterrupt+0x183
ntoskrnl.exe!ZwYieldExecution+0x35f
ntoskrnl.exe!KiUnexpectedInterrupt+0x1ba
ntdll.dll!ZwWaitForMultipleObjects+0xb


Good luck with this!
Mike

ID: 906 · Report as offensive    Reply Quote
Rom Walton (BOINC)
Volunteer moderator
Project developer

Send message
Joined: 10 Mar 06
Posts: 21
Credit: 5,515
RAC: 0
Message 908 - Posted: 19 Mar 2006, 0:35:50 UTC

Mike,

Using Process Explorer again, can you look at the thread state for each thread?

What is the base priority and dynamic priority for each thread in your list?

It should be visible on the Threads tab on the process properties dialog box.

TIA.

----- Rom
ID: 908 · Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 17 Feb 06
Posts: 50
Credit: 55,397
RAC: 0
Message 913 - Posted: 19 Mar 2006, 7:08:07 UTC - in response to Message 908.  

Mike,

Using Process Explorer again, can you look at the thread state for each thread?

What is the base priority and dynamic priority for each thread in your list?

It should be visible on the Threads tab on the process properties dialog box.

TIA.

----- Rom


More Info:

for CSwitchDelta aprox 90 StartAddress rosetta_beta_4.93_windows_intelx86.exe+0x1de550

ThreadID 2716
State Ready
Kernal Time 0:00:01.131 not moving
User Time 18:34:50.250 and climbing fast
Base Priority 1
Dynamic Priority 1

for CSwitchDelta 31 StartAddress rosetta_beta_4.93_windows_intelx86.exe+0x49fcf

ThreadID 2680
State Ready
Kernal Time 0:00:00.828 not moving
User Time 0:00:00.187 not moving
Base Priority 4
Dynamic Priority 6

for CSwitchDelta 1 StartAddress WINMM.dlltimeSetEvent+0x2b0

ThreadID 2720
State Wait:UserRequest
Kernal Time 0:00:00.000 not moving
User Time 0:00:00.000 not moving
Base Priority 15
Dynamic Priority 15


ID: 913 · Report as offensive    Reply Quote
Rom Walton (BOINC)
Volunteer moderator
Project developer

Send message
Joined: 10 Mar 06
Posts: 21
Credit: 5,515
RAC: 0
Message 915 - Posted: 19 Mar 2006, 7:37:58 UTC

Mike,

Are you familiar with the Windows debugging tools?

The reason I ask, is if I could get a dump of the process this might go quite a bit quicker.

Would you be game for trying to get me a dump?

ID: 915 · Report as offensive    Reply Quote
BennyRop

Send message
Joined: 11 Mar 06
Posts: 14
Credit: 674
RAC: 0
Message 916 - Posted: 19 Mar 2006, 8:50:40 UTC

Or temporarily opening two holes in your firewall/router so that the system could be taken over through RealVNC? (emailing Rom the ip#, RealVNC name and password) Granted, it's something I'd only do with someone I trusted. :)
ID: 916 · Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 17 Feb 06
Posts: 50
Credit: 55,397
RAC: 0
Message 924 - Posted: 19 Mar 2006, 17:30:09 UTC - in response to Message 915.  
Last modified: 19 Mar 2006, 17:34:32 UTC

Mike,

Are you familiar with the Windows debugging tools?

The reason I ask, is if I could get a dump of the process this might go quite a bit quicker.

Would you be game for trying to get me a dump?

This is why I was suggesting direct contact. I am familiar with VS tools for remote debugging, but I always have the source where I can attach to a remote process and set breakpoints and such. How to debug without source is something I'm not sure about. (Never had to, so never I figured it out).

ID: 924 · Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 17 Feb 06
Posts: 50
Credit: 55,397
RAC: 0
Message 925 - Posted: 19 Mar 2006, 17:32:39 UTC - in response to Message 916.  

Or temporarily opening two holes in your firewall/router so that the system could be taken over through RealVNC? (emailing Rom the ip#, RealVNC name and password) Granted, it's something I'd only do with someone I trusted. :)


I'm sorry, direct access is not possible. I'm stretching the rules just running foreign code.
ID: 925 · Report as offensive    Reply Quote
Rom Walton (BOINC)
Volunteer moderator
Project developer

Send message
Joined: 10 Mar 06
Posts: 21
Credit: 5,515
RAC: 0
Message 926 - Posted: 19 Mar 2006, 18:29:27 UTC - in response to Message 924.  

Mike,

Are you familiar with the Windows debugging tools?

The reason I ask, is if I could get a dump of the process this might go quite a bit quicker.

Would you be game for trying to get me a dump?

This is why I was suggesting direct contact. I am familiar with VS tools for remote debugging, but I always have the source where I can attach to a remote process and set breakpoints and such. How to debug without source is something I'm not sure about. (Never had to, so never I figured it out).


Sweet.

Attach to the process with Visual Studio.
Break on all threads
From the debug menu select Save Dump As.
Be sure to change the dump type to dump with heap.
And give it some sort of name.

With winzip compression the fire should shrink to 20MB or so.

Do you have a web server I would be able to dl it from? Or should we try email?

----- Rom
ID: 926 · Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 17 Feb 06
Posts: 50
Credit: 55,397
RAC: 0
Message 927 - Posted: 19 Mar 2006, 19:21:24 UTC - in response to Message 926.  
Last modified: 19 Mar 2006, 19:22:23 UTC


Sweet.

Attach to the process with Visual Studio.
Break on all threads
From the debug menu select Save Dump As.
Be sure to change the dump type to dump with heap.
And give it some sort of name.

With winzip compression the fire should shrink to 20MB or so.

Do you have a web server I would be able to dl it from? Or should we try email?

----- Rom



Rom,

Ok, the latest. Like I said, Im unfamiliar with debugging without source code. So.. I attached to the process and broke all threads. I looked for the Dump As. It wasn’t in the debug menu so I did some checking in Help and discovered a passage that essentially said he symbols had to be loaded to allow a dump. So I did a “Continue” and detached from the process to investigate how to load the symbols. After figuring that out, I looked at the run time for the Rosetta Beta process and discovered it had started over at 0 CPU time. Do you know if this represents a true restart? If so, I may no longer be stuck at 0. Anyway, I now have the dump file, its zipped and its size is under 13 meg, easy enough for me to email.

1) Is it possible this is of no more value cause I might no longer be stuck?
2) Should I allow it to keep running and see? ( I have it swapped out at the moment with 11 minutes of run time according to task manager)
3) Do you still want the file?
4) Where to?

Mike

ID: 927 · Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 17 Feb 06
Posts: 50
Credit: 55,397
RAC: 0
Message 928 - Posted: 19 Mar 2006, 20:36:20 UTC

Looking at the stdout file, it appears that it indeed did restart due to a failed heartbeat.
It is however using the exact same command line including seed. So I am going to let it run and see if its still stuck at 0.

ID: 928 · Report as offensive    Reply Quote
Rom Walton (BOINC)
Volunteer moderator
Project developer

Send message
Joined: 10 Mar 06
Posts: 21
Credit: 5,515
RAC: 0
Message 929 - Posted: 19 Mar 2006, 21:28:14 UTC

Ah, okay...

Well hopefully it'll do it again...

Let me know how it goes...

ID: 929 · Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

Message boards : RALPH@home bug list : Report \"stuck at 1%\" bugs here



©2024 University of Washington
http://www.bakerlab.org