Report - Previously Unclassified Work Unit Errors

Message boards : RALPH@home bug list : Report - Previously Unclassified Work Unit Errors

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 785 - Posted: 2 Mar 2006, 13:47:14 UTC
Last modified: 2 Mar 2006, 13:50:13 UTC

Ananas get error exit 1 (0x1) into bbc climate prediction app too

He relates this error in that case, to a missing .dll on c:winntsystem32
mscoree.dll

Seems then, that windows app need be static too

I never see that dll on my windows 3.11 -or- 95/98 .hehe

read here the full thread, if u want
http://boinc.bio.wzw.tum.de/boincsimap/forum/viewtopic.php?t=248
ID: 785 · Report as offensive    Reply Quote
Robert Everly

Send message
Joined: 16 Feb 06
Posts: 10
Credit: 2,333
RAC: 0
Message 796 - Posted: 3 Mar 2006, 2:21:05 UTC

Haven't seen this error before.

I checked in on how things were going, saw a 4.91, gave the graphics a shot. It ran for a couple of minutes, then went all wacky. The accepted protein model disappared as did both graphs. It advanced a couple of steps and hard locked the computer. Also got a bunch of runtime error popup boxes. No screenshots though with the lockup. Had to do a cold reboot.

Anyway, here is the wu.

https://ralph.bakerlab.org/result.php?resultid=13679

and the error result.

<core_client_version>5.2.12</core_client_version>
<message>The system cannot find the path specified. (0x3) - exit code 3 (0x3)
</message>
<stderr_txt>
# random seed: 3988164
# cpu_run_time_pref: 7200
# cpu_run_time_pref: 7200

</stderr_txt>


ID: 796 · Report as offensive    Reply Quote
doc :)

Send message
Joined: 16 Feb 06
Posts: 46
Credit: 4,437
RAC: 0
Message 799 - Posted: 3 Mar 2006, 3:30:52 UTC

got this one with 4.91:

03/03/2006 02:06:15|ralph@home|Unrecoverable error for result BARCODE_30_1iibA_227_10_1 ( - exit code -1073741811 (0xc000000d))

grahpics were open in a window (for more than a hour or so) then it simply crashed, thats the same error i am getting with rosetta 4.82 when i got graphics open, all seems to work fine when i do not open graphics.

WU - result
ID: 799 · Report as offensive    Reply Quote
Profile David@home
Avatar

Send message
Joined: 16 Feb 06
Posts: 24
Credit: 409
RAC: 0
Message 807 - Posted: 3 Mar 2006, 18:29:04 UTC
Last modified: 3 Mar 2006, 18:35:58 UTC

This WU finished using v 4.92 and claimed credit but contained some interesting messages so worth a look by the experts:

# Exception caught in nstruct loop ii=1 i=40
# num_decoys:39 attempts:40 cpu_run_time:26311.8

***UNHANDLED EXCEPTION****
Reason: Access Violation (0xc0000005) at address 0x7C910E03 write attempt to address 0x00000000

# cpu_run_time_pref: 28800

WU result is resultid 14051

ID: 807 · Report as offensive    Reply Quote
Spare_Cycles

Send message
Joined: 16 Feb 06
Posts: 17
Credit: 12,942
RAC: 0
Message 811 - Posted: 3 Mar 2006, 20:43:14 UTC - in response to Message 807.  

This WU finished using v 4.92 and claimed credit but contained some interesting messages so worth a look by the experts:

Looks like the WU errored out and you would have gotten zero credit, but the new code that we're now testing kicked in and salvaged the WU.
ID: 811 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 812 - Posted: 4 Mar 2006, 7:52:26 UTC
Last modified: 4 Mar 2006, 7:57:09 UTC

I crunched 6 WUs using rosetta_beta_4.92 (windows) and have NO errors

However with rosetta_beta_4.84 (Linux) I have several WUs with errors

ALL with the same error -> SIGSEGV
https://ralph.bakerlab.org/result.php?resultid=12969
https://ralph.bakerlab.org/result.php?resultid=13093
https://ralph.bakerlab.org/result.php?resultid=13267
https://ralph.bakerlab.org/result.php?resultid=13987
https://ralph.bakerlab.org/result.php?resultid=14057
https://ralph.bakerlab.org/result.php?resultid=14534
Click signature for global team stats
ID: 812 · Report as offensive    Reply Quote
Profile David@home
Avatar

Send message
Joined: 16 Feb 06
Posts: 24
Credit: 409
RAC: 0
Message 818 - Posted: 5 Mar 2006, 22:20:55 UTC
Last modified: 5 Mar 2006, 22:22:06 UTC

This WU had unrecoverable error result 13723

in BOINC log:

05/03/2006 20:37:03|ralph@home|Unrecoverable error for result BARCODE_30_1c8cA_236_4_0 ( - exit code -1073741819 (0xc0000005))

XP Pro SP2, Intel P4 single CPU no HT. BOINC 5.2.13


ID: 818 · Report as offensive    Reply Quote
Dimitris Hatzopoulos

Send message
Joined: 16 Feb 06
Posts: 31
Credit: 2,308
RAC: 0
Message 820 - Posted: 6 Mar 2006, 4:46:42 UTC - in response to Message 812.  
Last modified: 6 Mar 2006, 4:51:12 UTC

Carlos, I think the most probable explanation for SIGSEGV is because your Linux PC has only 256MB of RAM, whereas your WinXP PC has 512MB RAM.

Rosetta needs (relatively to other apps) a lot of memory, on the WinXP PC next to me it has 2 Rosetta tasks: one with 125MBytes Working Set. The other consumes just 45MBytes. So, if your Linux PC got the former, it'd probably crash with SIGSEGV, if it got the latter, it'd probably run it fine.

With 256MB RAM on a PC, it's a coin toss. I hope that eventually the BOINC/R@h system will become "smarter" so it can send smaller proteins to PCs with less RAM.

Do a

# free

on your Linux machine before running boinc/rosetta and after and let us know.

I crunched 6 WUs using rosetta_beta_4.92 (windows) and have NO errors

However with rosetta_beta_4.84 (Linux) I have several WUs with errors

ALL with the same error -> SIGSEGV
https://ralph.bakerlab.org/result.php?resultid=12969
https://ralph.bakerlab.org/result.php?resultid=13093
https://ralph.bakerlab.org/result.php?resultid=13267
https://ralph.bakerlab.org/result.php?resultid=13987
https://ralph.bakerlab.org/result.php?resultid=14057
https://ralph.bakerlab.org/result.php?resultid=14534


ID: 820 · Report as offensive    Reply Quote
Spare_Cycles

Send message
Joined: 16 Feb 06
Posts: 17
Credit: 12,942
RAC: 0
Message 821 - Posted: 6 Mar 2006, 15:53:21 UTC - in response to Message 820.  

Carlos, I think the most probable explanation for SIGSEGV is because your Linux PC has only 256MB of RAM, whereas your WinXP PC has 512MB RAM.


The lack of physical memory will never cause a SIGSEGV on a properly functioning modern PC. Programs run in virtual memory, and the virtual memory will look exactly the same regardless of how much physical memory there is.

If there isn't enough physical memory then there will be a lot of swapping to disk, which can slow things way down. That can cause a problem if the computer is doing something like burning a CD. It will never cause an error in a crunching program like ralph/rosetta.

That assumes the PC is working. If, for instance, there are errors when reading the hard disk, then pages will be corrupted when they are swapped back in.
ID: 821 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 824 - Posted: 6 Mar 2006, 20:23:37 UTC
Last modified: 6 Mar 2006, 20:30:13 UTC

Carlos, I think the most probable explanation for SIGSEGV is because your Linux PC has only 256MB of RAM, whereas your WinXP PC has 512MB RAM.


However I believe that the most probably cause is because the app
is not linked static

and is using my old libc.6.so
 crobertp [/home/boinc/BOINC] > ls /lib/libc* -lha
-rw-r--r--    1 root     root         1.2M Oct 13  2004 /lib/libc-2.3.2.so
lrwxrwxrwx    1 root     root           13 Oct 18  2004 /lib/libc.so.6 -> libc-2.3.2.so
lrwxrwxrwx    1 root     root           14 May  3  2003 /lib/libcap.so.1 -> libcap.so.1.10
-rw-r--r--    1 root     root         9.2k Jan 31  2003 /lib/libcap.so.1.10
lrwxrwxrwx    1 root     root           17 May  3  2003 /lib/libcom_err.so.2 -> libcom_err.so.2.0
-rw-r--r--    1 root     root         5.3k Jan  6  2003 /lib/libcom_err.so.2.0
-rw-r--r--    1 root     root          18k Oct 13  2004 /lib/libcrypt-2.3.2.so
lrwxrwxrwx    1 root     root           17 Oct 18  2004 /lib/libcrypt.so.1 -> libcrypt-2.3.2.so
crobertp [/home/boinc/BOINC] >

*These libs where not old when I booted my pc by middle of 2004 year
However I know a couple of newer libc.so.6 was developped since then
and contains newer functions that was not even imagined by 2004
*Sure, u get a sigsegv,
each time u use one of these newer libc calls that does not exist on my libc,
*Ofcourse u can use newer calls w/o problems, IF u app is linked static

BTW: I get a Exit status 1 (0x1)
running rosetta 4.82 on the same pc rosetta_beta_4.92 had run OK
https://boinc.bakerlab.org/rosetta/result.php?resultid=12625437



Click signature for global team stats
ID: 824 · Report as offensive    Reply Quote
Dimitris Hatzopoulos

Send message
Joined: 16 Feb 06
Posts: 31
Credit: 2,308
RAC: 0
Message 825 - Posted: 7 Mar 2006, 3:58:54 UTC
Last modified: 7 Mar 2006, 4:00:19 UTC

Carlos, on second thought, you and SpareCycles are probably correct about the 256M RAM not being the reason for SIGSEGV, but on the other hand, my version of RALPH for Linux seems to be statically linked:

$ ldd rosetta_beta_4.84_i686-pc-linux-gnu
not a dynamic executable
$ file rosetta_beta_4.84_i686-pc-linux-gnu
rosetta_beta_4.84_i686-pc-linux-gnu: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.2.5, statically linked, stripped

ID: 825 · Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 16 Feb 06
Posts: 251
Credit: 0
RAC: 0
Message 826 - Posted: 7 Mar 2006, 4:12:05 UTC - in response to Message 825.  

Carlos, on second thought, you and SpareCycles are probably correct about the 256M RAM not being the reason for SIGSEGV, but on the other hand, my version of RALPH for Linux seems to be statically linked:

$ ldd rosetta_beta_4.84_i686-pc-linux-gnu
not a dynamic executable
$ file rosetta_beta_4.84_i686-pc-linux-gnu
rosetta_beta_4.84_i686-pc-linux-gnu: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.2.5, statically linked, stripped


The debug routines are using more memory than would be the case with Rosetta, and in fact the debug information was turned on after 4.82. So this may be the reason you are having the errors, or at least contributing to them.

Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact
ID: 826 · Report as offensive    Reply Quote
hugothehermit

Send message
Joined: 17 Feb 06
Posts: 17
Credit: 2,170
RAC: 0
Message 827 - Posted: 7 Mar 2006, 6:25:50 UTC
Last modified: 7 Mar 2006, 6:37:15 UTC

(SIGSEGV SIGnal SEGmentation Violation) The quote is from here

Signal: Segmentation Violation (SegmentationFault)

This is raised when the program attempts has a bad memory reference such as:

* The pointer is NULL.
* Address not mapped to object (eg, the memory is unallocated, and unmapped by the OS)
* Invalid Permission for mapped object (accessing memory that permissions deny).

This is almost invariably a programming fault.

The default action for this signal is to cause the program to terminate and dump core.

A classic example is to dereference a pointer in C that is either uninitialised, or has already been freed. Here is some C code:

#include <stdio.h>
int main(void) {
int *pointer;

pointer = 0;

printf("Value pointed to by pointer is %dn",
*pointer /* this will cause SEGV */
);
}


Is anyone else getting this error?

If not, I would check your hard disk, memory and reset the project (to re-download the app). As it would be very strange that only one computer "found" a miss-allocated pointer or an array going passed it's limit etc... . Though stranger things have happened :?


Edit: to fix up spelling and a bit of formatting ... and a bit more ... and again
ID: 827 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 829 - Posted: 7 Mar 2006, 15:35:53 UTC - in response to Message 827.  

(SIGSEGV SIGnal SEGmentation Violation) The quote is from here

Signal: Segmentation Violation (SegmentationFault)

This is raised when the program attempts has a bad memory reference such as:

* The pointer is NULL.
* Address not mapped to object (eg, the memory is unallocated, and unmapped by the OS)
* Invalid Permission for mapped object (accessing memory that permissions deny).

This is almost invariably a programming fault.

The default action for this signal is to cause the program to terminate and dump core.

A classic example is to dereference a pointer in C that is either uninitialised, or has already been freed. Here is some C code:

#include <stdio.h>
int main(void) {
int *pointer;

pointer = 0;

printf("Value pointed to by pointer is %dn",
*pointer /* this will cause SEGV */
);
}


Is anyone else getting this error?

If not, I would check your hard disk, memory and reset the project (to re-download the app). As it would be very strange that only one computer "found" a miss-allocated pointer or an array going passed it's limit etc... . Though stranger things have happened :?


Edit: to fix up spelling and a bit of formatting ... and a bit more ... and again


Thanks, however more than *one* pc has erroed out some of my results too
*may be, not every alphatester had the patience and time to report here.
Click on the results I posted, and then, on each one, click Workunit
I did this only for a few ... this one for example ...
https://ralph.bakerlab.org/result.php?resultid=14553
*It reports some stackwalker uninitialized *sure cause of sigsegv
btw: my smartd daemon is not reporting any errors on my hda -:)
Click signature for global team stats
ID: 829 · Report as offensive    Reply Quote
[B^S] sTrey
Avatar

Send message
Joined: 15 Feb 06
Posts: 58
Credit: 15,430
RAC: 0
Message 831 - Posted: 8 Mar 2006, 1:28:42 UTC
Last modified: 8 Mar 2006, 1:32:14 UTC

Unless it's old news, wu 11798 might be worth a look. I was surprised to see it completed for me after erroring out for two others: one on 4.91 the other with 4.92 like me; different cpus, all Windows XP variants, the errors were access violations one read one write. My run-time pref was shorter than theirs but one of the errors happened faster than it took for my wu to complete.

Looks nasty to figure out, good luck.

p.s. Ah, one significant difference I forgot! I had to exit and restart the client for an unrelated reason (did not log out nor reboot however) when the wu was about 90 minutes done. I remember being surprised that it didn't start over; guess Rosetta checkpoints are more implemented than I realized.
ID: 831 · Report as offensive    Reply Quote
hugothehermit

Send message
Joined: 17 Feb 06
Posts: 17
Credit: 2,170
RAC: 0
Message 844 - Posted: 11 Mar 2006, 8:25:43 UTC
Last modified: 11 Mar 2006, 8:51:45 UTC

Thanks, however more than *one* pc has erroed out some of my results too
*may be, not every alphatester had the patience and time to report here.
Click on the results I posted, and then, on each one, click Workunit
I did this only for a few ... this one for example ...
https://ralph.bakerlab.org/result.php?resultid=14553
*It reports some stackwalker uninitialized *sure cause of sigsegv
btw: my smartd daemon is not reporting any errors on my hda -:)


You may well be right, as it seems sensible that the "swap app out of memory" would almost always find a miss-allocated / miss-used piece of memory such as a un(re)defined pointer or an array over run, where you could somethimes get away with it if the memory hasn't been changed.

As it has happened in both of your OS's you could assume that it is the code not the compiler, a code line by line search is in order I think.

ID: 844 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 847 - Posted: 11 Mar 2006, 10:24:25 UTC

app Rosetta_beta_4.84 Linux

Exit status 2 (0x2)
https://ralph.bakerlab.org/result.php?resultid=15867
Exit status 131 (0x83)
https://ralph.bakerlab.org/result.php?resultid=15886


Click signature for global team stats
ID: 847 · Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 17 Feb 06
Posts: 50
Credit: 55,397
RAC: 0
Message 852 - Posted: 11 Mar 2006, 20:48:53 UTC

Rosetta Beta 4.92 under Win 2000 SP 4

https://ralph.bakerlab.org/result.php?resultid=15965

Unexplained error. I have system to leave app in memory so I don't think it's that.

Text from File:

3/11/2006 12:04:54 PM|ralph@home|Pausing result 7449_fullatom_relax_evdec00_2_0001.pdb_246_1_0 (left in memory)
3/11/2006 12:04:57 PM||Running CPU benchmarks
3/11/2006 12:05:55 PM||Benchmark results:
3/11/2006 12:05:55 PM|| Number of CPUs: 1
3/11/2006 12:05:55 PM|| 1166 double precision MIPS (Whetstone) per CPU
3/11/2006 12:05:55 PM|| 2274 integer MIPS (Dhrystone) per CPU
3/11/2006 12:05:55 PM||Finished CPU benchmarks
3/11/2006 12:05:56 PM||Resuming computation and network activity
3/11/2006 12:05:56 PM||request_reschedule_cpus: Resuming activities
3/11/2006 12:05:56 PM|ralph@home|Resuming result 7449_fullatom_relax_evdec00_2_0001.pdb_246_1_0 using rosetta_beta version 492
3/11/2006 12:20:37 PM|ralph@home|Sending scheduler request to https://ralph.bakerlab.org/ralph_cgi/cgi
3/11/2006 12:20:37 PM|ralph@home|Reason: To fetch work
3/11/2006 12:20:37 PM|ralph@home|Requesting 96635 seconds of new work
3/11/2006 12:20:41 PM|ralph@home|Scheduler request to https://ralph.bakerlab.org/ralph_cgi/cgi succeeded
3/11/2006 12:20:41 PM|ralph@home|No work from project
3/11/2006 12:53:07 PM|ralph@home|Unrecoverable error for result 7449_fullatom_relax_evdec00_2_0001.pdb_246_1_0 ( - exit code -1073741811 (0xc000000d))
3/11/2006 12:53:07 PM||request_reschedule_cpus: process exited
3/11/2006 12:53:07 PM|ralph@home|Computation for result 7449_fullatom_relax_evdec00_2_0001.pdb_246_1_0 finished
3/11/2006 12:53:07 PM|SETI@home|Starting result 11ap03ab.5070.14416.928404.1.62_0 using setiathome version 418

ID: 852 · Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 16 Feb 06
Posts: 251
Credit: 0
RAC: 0
Message 855 - Posted: 11 Mar 2006, 22:36:17 UTC - in response to Message 852.  
Last modified: 11 Mar 2006, 22:39:53 UTC

Rosetta Beta 4.92 under Win 2000 SP 4

https://ralph.bakerlab.org/result.php?resultid=15965

Unexplained error. I have system to leave app in memory so I don't think it's that.

Text from File:

3/11/2006 12:04:54 PM|ralph@home|Pausing result 7449_fullatom_relax_evdec00_2_0001.pdb_246_1_0 (left in memory)
3/11/2006 12:04:57 PM||Running CPU benchmarks
3/11/2006 12:05:55 PM||Benchmark results:
3/11/2006 12:05:55 PM|| Number of CPUs: 1
3/11/2006 12:05:55 PM|| 1166 double precision MIPS (Whetstone) per CPU
3/11/2006 12:05:55 PM|| 2274 integer MIPS (Dhrystone) per CPU
3/11/2006 12:05:55 PM||Finished CPU benchmarks
3/11/2006 12:05:56 PM||Resuming computation and network activity
3/11/2006 12:05:56 PM||request_reschedule_cpus: Resuming activities
3/11/2006 12:05:56 PM|ralph@home|Resuming result 7449_fullatom_relax_evdec00_2_0001.pdb_246_1_0 using rosetta_beta version 492
3/11/2006 12:20:37 PM|ralph@home|Sending scheduler request to https://ralph.bakerlab.org/ralph_cgi/cgi
3/11/2006 12:20:37 PM|ralph@home|Reason: To fetch work
3/11/2006 12:20:37 PM|ralph@home|Requesting 96635 seconds of new work
3/11/2006 12:20:41 PM|ralph@home|Scheduler request to https://ralph.bakerlab.org/ralph_cgi/cgi succeeded
3/11/2006 12:20:41 PM|ralph@home|No work from project
3/11/2006 12:53:07 PM|ralph@home|Unrecoverable error for result 7449_fullatom_relax_evdec00_2_0001.pdb_246_1_0 ( - exit code -1073741811 (0xc000000d))
3/11/2006 12:53:07 PM||request_reschedule_cpus: process exited
3/11/2006 12:53:07 PM|ralph@home|Computation for result 7449_fullatom_relax_evdec00_2_0001.pdb_246_1_0 finished
3/11/2006 12:53:07 PM|SETI@home|Starting result 11ap03ab.5070.14416.928404.1.62_0 using setiathome version 418


Mike,

Correct me if I am wrong, but I thought I saw a post from you before indicating that you were running a BOINC version later than 5.2.13. If so you are correct that the error makes no sense. If you are running 5.2.13, then the Work Unit was removed from memory when the benchmark ran and that is why it errored out.


Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact
ID: 855 · Report as offensive    Reply Quote
Profile UBT - Halifax--lad

Send message
Joined: 15 Feb 06
Posts: 29
Credit: 2,723
RAC: 0
Message 859 - Posted: 12 Mar 2006, 8:38:26 UTC - in response to Message 855.  

well if that was the case would it not have errored out straight after the benchmarks??

But seen as though BOINC left it in memory at the benchmark that can't be the reason for the failure can it?



Mike,

Correct me if I am wrong, but I thought I saw a post from you before indicating that you were running a BOINC version later than 5.2.13. If so you are correct that the error makes no sense. If you are running 5.2.13, then the Work Unit was removed from memory when the benchmark ran and that is why it errored out.



Join us in Chat (see the forum) Click the Sig


Join UBT
ID: 859 · Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : RALPH@home bug list : Report - Previously Unclassified Work Unit Errors



©2024 University of Washington
http://www.bakerlab.org