Message boards : RALPH@home bug list : Report - Previously Unclassified Work Unit Errors
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
David@home Send message Joined: 16 Feb 06 Posts: 24 Credit: 409 RAC: 0 |
This WU finished using v 4.92 and claimed credit but contained some interesting messages so worth a look by the experts: # Exception caught in nstruct loop ii=1 i=40 # num_decoys:39 attempts:40 cpu_run_time:26311.8 ***UNHANDLED EXCEPTION**** Reason: Access Violation (0xc0000005) at address 0x7C910E03 write attempt to address 0x00000000 # cpu_run_time_pref: 28800 WU result is resultid 14051 |
Spare_Cycles Send message Joined: 16 Feb 06 Posts: 17 Credit: 12,942 RAC: 0 |
This WU finished using v 4.92 and claimed credit but contained some interesting messages so worth a look by the experts: Looks like the WU errored out and you would have gotten zero credit, but the new code that we're now testing kicked in and salvaged the WU. |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
I crunched 6 WUs using rosetta_beta_4.92 (windows) and have NO errors However with rosetta_beta_4.84 (Linux) I have several WUs with errors ALL with the same error -> SIGSEGV https://ralph.bakerlab.org/result.php?resultid=12969 https://ralph.bakerlab.org/result.php?resultid=13093 https://ralph.bakerlab.org/result.php?resultid=13267 https://ralph.bakerlab.org/result.php?resultid=13987 https://ralph.bakerlab.org/result.php?resultid=14057 https://ralph.bakerlab.org/result.php?resultid=14534 Click signature for global team stats |
David@home Send message Joined: 16 Feb 06 Posts: 24 Credit: 409 RAC: 0 |
This WU had unrecoverable error result 13723 in BOINC log: 05/03/2006 20:37:03|ralph@home|Unrecoverable error for result BARCODE_30_1c8cA_236_4_0 ( - exit code -1073741819 (0xc0000005)) XP Pro SP2, Intel P4 single CPU no HT. BOINC 5.2.13 |
Dimitris Hatzopoulos Send message Joined: 16 Feb 06 Posts: 31 Credit: 2,308 RAC: 0 |
Carlos, I think the most probable explanation for SIGSEGV is because your Linux PC has only 256MB of RAM, whereas your WinXP PC has 512MB RAM. Rosetta needs (relatively to other apps) a lot of memory, on the WinXP PC next to me it has 2 Rosetta tasks: one with 125MBytes Working Set. The other consumes just 45MBytes. So, if your Linux PC got the former, it'd probably crash with SIGSEGV, if it got the latter, it'd probably run it fine. With 256MB RAM on a PC, it's a coin toss. I hope that eventually the BOINC/R@h system will become "smarter" so it can send smaller proteins to PCs with less RAM. Do a # free on your Linux machine before running boinc/rosetta and after and let us know. I crunched 6 WUs using rosetta_beta_4.92 (windows) and have NO errors |
Spare_Cycles Send message Joined: 16 Feb 06 Posts: 17 Credit: 12,942 RAC: 0 |
Carlos, I think the most probable explanation for SIGSEGV is because your Linux PC has only 256MB of RAM, whereas your WinXP PC has 512MB RAM. The lack of physical memory will never cause a SIGSEGV on a properly functioning modern PC. Programs run in virtual memory, and the virtual memory will look exactly the same regardless of how much physical memory there is. If there isn't enough physical memory then there will be a lot of swapping to disk, which can slow things way down. That can cause a problem if the computer is doing something like burning a CD. It will never cause an error in a crunching program like ralph/rosetta. That assumes the PC is working. If, for instance, there are errors when reading the hard disk, then pages will be corrupted when they are swapped back in. |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
Carlos, I think the most probable explanation for SIGSEGV is because your Linux PC has only 256MB of RAM, whereas your WinXP PC has 512MB RAM. However I believe that the most probably cause is because the app is not linked static and is using my old libc.6.so crobertp [/home/boinc/BOINC] > ls /lib/libc* -lha -rw-r--r-- 1 root root 1.2M Oct 13 2004 /lib/libc-2.3.2.so lrwxrwxrwx 1 root root 13 Oct 18 2004 /lib/libc.so.6 -> libc-2.3.2.so lrwxrwxrwx 1 root root 14 May 3 2003 /lib/libcap.so.1 -> libcap.so.1.10 -rw-r--r-- 1 root root 9.2k Jan 31 2003 /lib/libcap.so.1.10 lrwxrwxrwx 1 root root 17 May 3 2003 /lib/libcom_err.so.2 -> libcom_err.so.2.0 -rw-r--r-- 1 root root 5.3k Jan 6 2003 /lib/libcom_err.so.2.0 -rw-r--r-- 1 root root 18k Oct 13 2004 /lib/libcrypt-2.3.2.so lrwxrwxrwx 1 root root 17 Oct 18 2004 /lib/libcrypt.so.1 -> libcrypt-2.3.2.so crobertp [/home/boinc/BOINC] > *These libs where not old when I booted my pc by middle of 2004 year However I know a couple of newer libc.so.6 was developped since then and contains newer functions that was not even imagined by 2004 *Sure, u get a sigsegv, each time u use one of these newer libc calls that does not exist on my libc, *Ofcourse u can use newer calls w/o problems, IF u app is linked static BTW: I get a Exit status 1 (0x1) running rosetta 4.82 on the same pc rosetta_beta_4.92 had run OK https://boinc.bakerlab.org/rosetta/result.php?resultid=12625437 Click signature for global team stats |
Dimitris Hatzopoulos Send message Joined: 16 Feb 06 Posts: 31 Credit: 2,308 RAC: 0 |
Carlos, on second thought, you and SpareCycles are probably correct about the 256M RAM not being the reason for SIGSEGV, but on the other hand, my version of RALPH for Linux seems to be statically linked: $ ldd rosetta_beta_4.84_i686-pc-linux-gnu not a dynamic executable $ file rosetta_beta_4.84_i686-pc-linux-gnu rosetta_beta_4.84_i686-pc-linux-gnu: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.2.5, statically linked, stripped |
hugothehermit Send message Joined: 17 Feb 06 Posts: 17 Credit: 2,170 RAC: 0 |
(SIGSEGV SIGnal SEGmentation Violation) The quote is from here
Is anyone else getting this error? If not, I would check your hard disk, memory and reset the project (to re-download the app). As it would be very strange that only one computer "found" a miss-allocated pointer or an array going passed it's limit etc... . Though stranger things have happened :? Edit: to fix up spelling and a bit of formatting ... and a bit more ... and again |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
(SIGSEGV SIGnal SEGmentation Violation) The quote is from here Thanks, however more than *one* pc has erroed out some of my results too *may be, not every alphatester had the patience and time to report here. Click on the results I posted, and then, on each one, click Workunit I did this only for a few ... this one for example ... https://ralph.bakerlab.org/result.php?resultid=14553 *It reports some stackwalker uninitialized *sure cause of sigsegv btw: my smartd daemon is not reporting any errors on my hda -:) Click signature for global team stats |
[B^S] sTrey Send message Joined: 15 Feb 06 Posts: 58 Credit: 15,430 RAC: 0 |
Unless it's old news, wu 11798 might be worth a look. I was surprised to see it completed for me after erroring out for two others: one on 4.91 the other with 4.92 like me; different cpus, all Windows XP variants, the errors were access violations one read one write. My run-time pref was shorter than theirs but one of the errors happened faster than it took for my wu to complete. Looks nasty to figure out, good luck. p.s. Ah, one significant difference I forgot! I had to exit and restart the client for an unrelated reason (did not log out nor reboot however) when the wu was about 90 minutes done. I remember being surprised that it didn't start over; guess Rosetta checkpoints are more implemented than I realized. |
hugothehermit Send message Joined: 17 Feb 06 Posts: 17 Credit: 2,170 RAC: 0 |
Thanks, however more than *one* pc has erroed out some of my results too You may well be right, as it seems sensible that the "swap app out of memory" would almost always find a miss-allocated / miss-used piece of memory such as a un(re)defined pointer or an array over run, where you could somethimes get away with it if the memory hasn't been changed. As it has happened in both of your OS's you could assume that it is the code not the compiler, a code line by line search is in order I think. |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
app Rosetta_beta_4.84 Linux Exit status 2 (0x2) https://ralph.bakerlab.org/result.php?resultid=15867 Exit status 131 (0x83) https://ralph.bakerlab.org/result.php?resultid=15886 Click signature for global team stats |
Mike Gelvin Send message Joined: 17 Feb 06 Posts: 50 Credit: 55,397 RAC: 0 |
Rosetta Beta 4.92 under Win 2000 SP 4 https://ralph.bakerlab.org/result.php?resultid=15965 Unexplained error. I have system to leave app in memory so I don't think it's that. Text from File: 3/11/2006 12:04:54 PM|ralph@home|Pausing result 7449_fullatom_relax_evdec00_2_0001.pdb_246_1_0 (left in memory) 3/11/2006 12:04:57 PM||Running CPU benchmarks 3/11/2006 12:05:55 PM||Benchmark results: 3/11/2006 12:05:55 PM|| Number of CPUs: 1 3/11/2006 12:05:55 PM|| 1166 double precision MIPS (Whetstone) per CPU 3/11/2006 12:05:55 PM|| 2274 integer MIPS (Dhrystone) per CPU 3/11/2006 12:05:55 PM||Finished CPU benchmarks 3/11/2006 12:05:56 PM||Resuming computation and network activity 3/11/2006 12:05:56 PM||request_reschedule_cpus: Resuming activities 3/11/2006 12:05:56 PM|ralph@home|Resuming result 7449_fullatom_relax_evdec00_2_0001.pdb_246_1_0 using rosetta_beta version 492 3/11/2006 12:20:37 PM|ralph@home|Sending scheduler request to https://ralph.bakerlab.org/ralph_cgi/cgi 3/11/2006 12:20:37 PM|ralph@home|Reason: To fetch work 3/11/2006 12:20:37 PM|ralph@home|Requesting 96635 seconds of new work 3/11/2006 12:20:41 PM|ralph@home|Scheduler request to https://ralph.bakerlab.org/ralph_cgi/cgi succeeded 3/11/2006 12:20:41 PM|ralph@home|No work from project 3/11/2006 12:53:07 PM|ralph@home|Unrecoverable error for result 7449_fullatom_relax_evdec00_2_0001.pdb_246_1_0 ( - exit code -1073741811 (0xc000000d)) 3/11/2006 12:53:07 PM||request_reschedule_cpus: process exited 3/11/2006 12:53:07 PM|ralph@home|Computation for result 7449_fullatom_relax_evdec00_2_0001.pdb_246_1_0 finished 3/11/2006 12:53:07 PM|SETI@home|Starting result 11ap03ab.5070.14416.928404.1.62_0 using setiathome version 418 |
UBT - Halifax--lad Send message Joined: 15 Feb 06 Posts: 29 Credit: 2,723 RAC: 0 |
well if that was the case would it not have errored out straight after the benchmarks?? But seen as though BOINC left it in memory at the benchmark that can't be the reason for the failure can it?
Join us in Chat (see the forum) Click the Sig Join UBT |
hugothehermit Send message Joined: 17 Feb 06 Posts: 17 Credit: 2,170 RAC: 0 |
well if that was the case would it not have errored out straight after the benchmarks?? Yep it can, 5.2.13 fixed the appliations being turfed out of memory no matter what you're options were when it did a benchmark. Is this a definite case no. It just looks probable as BOINC could benchmark then ask for work then find out the app is stuffed. |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
stuck at 10.32% https://ralph.bakerlab.org/result.php?resultid=16410 Rosetta_beta_4.84 Linux load average: 0.00, 0.00, 0.18 (whole system) *re-staring boinc Click signature for global team stats |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
Exit status 131 (0x83) https://ralph.bakerlab.org/result.php?resultid=16558 Rosetta_beta 4.84 Linux Click signature for global team stats |
Mike Gelvin Send message Joined: 17 Feb 06 Posts: 50 Credit: 55,397 RAC: 0 |
I am running 5.2.15 |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
SIGSEGV: segmentation violationStack trace (11 frames): https://ralph.bakerlab.org/result.php?resultid=18330 Rosetta_beta 4.84 Linux *app swapping has not occurred, nor benchmarking -:( Date Host Project ID Message 3/16/2006 5:49:39 PM crobertp.cp3 ralph@home 1098 Finished download of cc1opd_03_05.200_v1_3.gz 3/16/2006 5:49:39 PM crobertp.cp3 ralph@home 1099 Throughput 31053 bytes/sec 3/16/2006 5:50:14 PM crobertp.cp3 ralph@home 1100 Finished download of cc1opd_09_05.200_v1_3.gz 3/16/2006 5:50:14 PM crobertp.cp3 ralph@home 1101 Throughput 32962 bytes/sec 3/16/2006 5:50:15 PM crobertp.cp3 --- 1102 request_reschedule_cpus: files downloaded 3/16/2006 6:25:20 PM crobertp.cp3 ralph@home 1103 Unrecoverable error for result HB_BARCODE_30_1b3aA_351_30_0 (process exited with code 131 (0x83)) 3/16/2006 6:25:20 PM crobertp.cp3 --- 1104 request_reschedule_cpus: process exited 3/16/2006 6:25:20 PM crobertp.cp3 ralph@home 1105 Computation for result HB_BARCODE_30_1b3aA_351_30_0 finished 3/16/2006 6:25:20 PM crobertp.cp3 QMC@HOME 1106 Restarting result one_pwcdna_nodelete.1998_0 using Amolqc-alpha version 505 3/16/2006 6:26:21 PM crobertp.cp3 ralph@home 1107 Sending scheduler request to https://ralph.bakerlab.org/ralph_cgi/cgi 3/16/2006 6:26:21 PM crobertp.cp3 ralph@home 1108 Reason: To report results Click signature for global team stats |
Message boards :
RALPH@home bug list :
Report - Previously Unclassified Work Unit Errors
©2024 University of Washington
http://www.bakerlab.org