Message boards : RALPH@home bug list : Report \"failure when switching projects without keeping applications in memory\" bugs here
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Pieface Send message Joined: 16 Feb 06 Posts: 64 Credit: 203,513 RAC: 0 |
this seems similar to the one reported in message nr 42: By George, I think you've got it! Running two ralph 4.85's this time, still 1/3 each with seti and albert. both ralph units were running and swapped out simultaneously with no abend. I'll watch to make sure both finish OK. |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
Signal 11 https://ralph.bakerlab.org/result.php?resultid=4058 Zero credits https://ralph.bakerlab.org/result.php?resultid=3962 SIGNAL(7) Linux Programmer's Manual SIGNAL(7) NAME signal - list of available signals DESCRIPTION Linux supports both POSIX reliable signals (hereinafter "standard sig- nals") and POSIX real-time signals. Standard Signals Linux supports the standard signals listed below. Several signal num- bers are architecture dependent, as indicated in the "Value" column. (Where three values are given, the first one is usually valid for alpha and sparc, the middle one for i386, ppc and sh, and the last one for mips. A - denotes that a signal is absent on the corresponding archi- tecture.) The entries in the "Action" column of the table specify the default action for the signal, as follows: Term Default action is to terminate the process. Ign Default action is to ignore the signal. Core Default action is to terminate the process and dump core. Stop Default action is to stop the process. First the signals described in the original POSIX.1 standard. Signal Value Action Comment ------------------------------------------------------------------------- SIGHUP 1 Term Hangup detected on controlling terminal or death of controlling process SIGINT 2 Term Interrupt from keyboard SIGQUIT 3 Core Quit from keyboard SIGILL 4 Core Illegal Instruction SIGABRT 6 Core Abort signal from abort(3) SIGFPE 8 Core Floating point exception SIGKILL 9 Term Kill signal SIGSEGV 11 Core Invalid memory reference SIGPIPE 13 Term Broken pipe: write to pipe with no readers SIGALRM 14 Term Timer signal from alarm(2) lines 1-47 Click signature for global team stats |
[B^S] Doug Worrall Send message Joined: 16 Feb 06 Posts: 10 Credit: 1,515 RAC: 0 |
OK, I have just set that machine's "Leave in Memory" to YES. It has had 2/2 failures. Hopefully it'll get some more work soon. Hello, Running Linuxos from Live distro.Have the "save in memory" tab to yes received 2 w/u so-far.Both worked fine. Sincerely Doug |
Hans Sveen Send message Joined: 17 Feb 06 Posts: 11 Credit: 386,241 RAC: 51 |
Hello! Just got a lot of "exit error 1" wrong function; all on my hostid 476, it also errored out on Einstein@home(4 different errors on 4 different wu's). Even bbc's climate change project errored out on 2 wu's, exit status 88. Hope this will help You in a way or other! Hans Sveen Oslo, Norway |
Psycodad Send message Joined: 16 Feb 06 Posts: 14 Credit: 2,157 RAC: 0 |
So, the first WU has been finished correctly, after setting the preferences to "Left in memory" Yes |
Contact Send message Joined: 16 Feb 06 Posts: 19 Credit: 137,458 RAC: 2 |
This host most often switching tasks ok: 18/02/06 5:39:11 PM|climateprediction.net|Pausing task 1kco_100093782_0 (removed from memory) 18/02/06 5:39:11 PM|ralph@home|Restarting task HBLR_1.0_2tif_206_29_0 using rosetta_beta version 483 18/02/06 5:44:11 PM|climateprediction.net|Restarting task 1kco_100093782_0 using hadsm3 version 413 18/02/06 5:44:11 PM|ralph@home|Pausing task HBLR_1.0_2tif_206_29_0 (removed from memory) 18/02/06 5:49:11 PM|climateprediction.net|Pausing task 1kco_100093782_0 (removed from memory) 18/02/06 5:49:11 PM|boincsimap|Restarting task 200602094.002266_2 using simap version 507 18/02/06 5:54:11 PM|boincsimap|Pausing task 200602094.002266_2 (removed from memory) 18/02/06 5:54:11 PM|ralph@home|Restarting task HBLR_1.0_2tif_206_29_0 using rosetta_beta version 483 The errors I was able produce (but not reproduce during many attempts). 1) After manually resuming another previously suspended project: 18/02/06 6:58:26 AM||Rescheduling CPU: project resumed by user 18/02/06 6:58:26 AM|SZTAKI Desktop Grid|Resuming task 893dfca3-f62a-4648-839a-b03728a734f3_0 using search version 101 18/02/06 6:58:26 AM|ralph@home|Pausing task HBLR_1.0_1ogw_206_36_0 (removed from memory) 18/02/06 6:58:27 AM|ralph@home|Unrecoverable error for result HBLR_1.0_1ogw_206_36_0 ( - exit code -1073741819 (0xc0000005)) 18/02/06 6:58:27 AM||Rescheduling CPU: application exited 18/02/06 6:58:27 AM|ralph@home|Computation for task HBLR_1.0_1ogw_206_36_0 finished 2) Shortly after manual scheduler request to another project: (coincidence?) 18/02/06 5:19:32 PM|SETI@home|Sending scheduler request to http://setiboinc.ssl.berkeley.edu/sah_cgi/cgi 18/02/06 5:19:32 PM|SETI@home|Reason: Requested by user 18/02/06 5:19:32 PM|SETI@home|(not requesting new work or reporting completed tasks) 18/02/06 5:19:37 PM|SETI@home|Scheduler request to http://setiboinc.ssl.berkeley.edu/sah_cgi/cgi succeeded 18/02/06 5:22:50 PM|climateprediction.net|Restarting task 1kco_100093782_0 using hadsm3 version 413 18/02/06 5:22:50 PM|ralph@home|Pausing task HBLR_1.0_1mky_206_29_0 (removed from memory) 18/02/06 5:22:51 PM|ralph@home|Unrecoverable error for result HBLR_1.0_1mky_206_29_0 ( - exit code -1073741819 (0xc0000005)) 18/02/06 5:22:51 PM|ralph@home|Deferring scheduler requests for 1 minutes and 0 seconds 18/02/06 5:22:51 PM||Rescheduling CPU: application exited 18/02/06 5:22:51 PM|ralph@home|Computation for task HBLR_1.0_1mky_206_29_0 finished 3) No activity other than switch: 18/02/06 7:21:09 PM|ralph@home|Restarting task HBLR_1.0_2tif_206_29_0 using rosetta_beta version 483 18/02/06 7:26:09 PM|ralph@home|Pausing task HBLR_1.0_2tif_206_29_0 (removed from memory) 18/02/06 7:26:09 PM|boincsimap|Restarting task 200602094.002274_0 using simap version 507 18/02/06 7:31:09 PM|climateprediction.net|Restarting task 1kco_100093782_0 using hadsm3 version 413 18/02/06 7:31:09 PM|boincsimap|Pausing task 200602094.002274_0 (removed from memory) 18/02/06 7:36:09 PM|climateprediction.net|Pausing task 1kco_100093782_0 (removed from memory) 18/02/06 7:36:09 PM|ralph@home|Restarting task HBLR_1.0_2tif_206_29_0 using rosetta_beta version 483 18/02/06 7:41:09 PM|ralph@home|Pausing task HBLR_1.0_2tif_206_29_0 (removed from memory) 18/02/06 7:41:09 PM|boincsimap|Restarting task 200602094.002274_0 using simap version 507 18/02/06 7:41:10 PM|ralph@home|Unrecoverable error for result HBLR_1.0_2tif_206_29_0 ( - exit code -1073741819 (0xc0000005)) 18/02/06 7:41:10 PM|ralph@home|Deferring scheduler requests for 1 minutes and 0 seconds 18/02/06 7:41:10 PM||Rescheduling CPU: application exited 18/02/06 7:41:10 PM|ralph@home|Computation for task HBLR_1.0_2tif_206_29_0 finished |
genes Send message Joined: 16 Feb 06 Posts: 45 Credit: 43,706 RAC: 20 |
My machine that has been failing WU's with "Leave in Memory = NO" has completed a WU successfully with it set to YES. I believe it has demonstrated that it can complete a WU without crashing. The WU: https://ralph.bakerlab.org/result.php?resultid=5292 The machine: https://ralph.bakerlab.org/show_host_detail.php?hostid=76 It's currently half finished with a Rosetta WU, so I'll leave it set to YES until the Rosetta finishes, then I'll switch it back. It's a Dual P3 1GHz, also processing CPDN, Einstein, S@H, and S@H Beta. None of those projects seem to be affected by the "Leave in Memory" setting so far. |
Aaron Finney Send message Joined: 16 Feb 06 Posts: 56 Credit: 1,457 RAC: 0 |
Got a bug here.. 2/19/2006 6:24:26 PM||Suspending computation and network activity - running CPU benchmarks 2/19/2006 6:24:26 PM|ralph@home|Pausing result BARCODE_30_1fna__209_15_0 (removed from memory) 2/19/2006 6:24:26 PM|ralph@home|Pausing result BARCODE_30_1cc8A_209_16_0 (removed from memory) 2/19/2006 6:24:27 PM|ralph@home|Unrecoverable error for result BARCODE_30_1fna__209_15_0 ( - exit code -1073741819 (0xc0000005)) 2/19/2006 6:24:27 PM||request_reschedule_cpus: process exited 2/19/2006 6:24:27 PM|ralph@home|Computation for result BARCODE_30_1fna__209_15_0 finished 2/19/2006 6:24:28 PM||Running CPU benchmarks 2/19/2006 6:25:27 PM||Benchmark results: 2/19/2006 6:25:27 PM|| Number of CPUs: 2 2/19/2006 6:25:27 PM|| 1320 double precision MIPS (Whetstone) per CPU 2/19/2006 6:25:27 PM|| 1249 integer MIPS (Dhrystone) per CPU 2/19/2006 6:25:27 PM||Finished CPU benchmarks 2/19/2006 6:25:28 PM||Resuming computation and network activity 2/19/2006 6:25:28 PM||request_reschedule_cpus: Resuming activities 2/19/2006 6:25:28 PM|ralph@home|Restarting result BARCODE_30_1cc8A_209_16_0 using rosetta_beta version 484 2/19/2006 6:25:28 PM|ralph@home|Starting result BARCODE_30_1a19A_209_16_0 using rosetta_beta version 484 Seems that the problem happened when it was running benchmarks. :( that was a workunit that had been crunching for 25 hours. Now, granted, it was with the 4.84 application version, but I can't seem to get any more work here. |
pisi78 Send message Joined: 16 Feb 06 Posts: 7 Credit: 2,020 RAC: 0 |
i had this crash 2006-02-18 13:50:28 [ralph@home] Restarting result BARCODE_30_2chf__NATIVE_210_15_0 using rosetta_beta version 484 2006-02-18 13:50:29 [---] request_reschedule_cpus: process exited 2006-02-18 14:50:29 [LHC@home] Restarting result wjan1D_v6s4hvnom_mqx_nc__9__64.313_59.323__2_4__6__15_1_sixvf_boinc98654_1 using sixtrack version 467 2006-02-18 14:50:29 [ralph@home] Pausing result BARCODE_30_2chf__NATIVE_210_15_0 (removed from memory) 2006-02-18 14:50:30 [ralph@home] Unrecoverable error for result BARCODE_30_2chf__NATIVE_210_15_0 ( - exit code -164 (0xffffff5c)) 2006-02-18 14:50:30 [---] request_reschedule_cpus: process exited 2006-02-18 14:50:30 [ralph@home] Computation for result BARCODE_30_2chf__NATIVE_210_15_0 finished result https://ralph.bakerlab.org/result.php?resultid=3163 |
pisi78 Send message Joined: 16 Feb 06 Posts: 7 Credit: 2,020 RAC: 0 |
another wu failed https://ralph.bakerlab.org/result.php?resultid=3164 |
River~~ Send message Joined: 20 Feb 06 Posts: 20 Credit: 503 RAC: 0 |
Got one here that survived a reboot, was restarted OK and ran after restart OK, but then died when pre-empted by Einstein. Interesting point for me was that it bombed out at the point of removal from memory rather than when re-loaded (or is this the usual experience with these???) EDIT: This survived a reboot, as stated above, but before reboot keep in mem = YES, after reboot keep in mem = NO, and it failed on first swap out after NO setting. But that seems bizarre - it implies it can be swapped out for a reboot but not for a pre-empt. Does that give our coders any clues, or is it a red herring? Sorry for the non-standard log format, this is a BoincView listing, the machine is in another building and I can't get to it to give you the proper log, and with the work already having reported back the /slot directories will have gone already. If this style of feedback is no use at all to you, please say so and I will take this box away from Ralph... bt-gw is the machine, and times are in UTC. Machine is running Debian Linux and not running any graphics. bt-gw 22/02/2006 19:55:56 --- Starting BOINC client version 5.2.8 for i686-pc-linux-gnu bt-gw 22/02/2006 19:55:56 --- libcurl/7.14.0 OpenSSL/0.9.8 zlib/1.2.3 bt-gw 22/02/2006 19:55:56 --- Data directory: /usr/local/BOINC bt-gw 22/02/2006 19:55:57 --- get_local_network_info(): gethostbyname failed bt-gw 22/02/2006 19:55:57 --- Processor: 1 GenuineIntel Pentium III (Katmai) bt-gw 22/02/2006 19:55:57 --- Memory: 377.75 MB physical, 737.32 MB virtual bt-gw 22/02/2006 19:55:57 --- Disk: 4.07 GB total, 3.24 GB free bt-gw 22/02/2006 19:55:57 LHC@home Computer ID: 79658; location: work; project prefs: default bt-gw 22/02/2006 19:55:57 Einstein@Home Computer ID: 469573; location: work; project prefs: default bt-gw 22/02/2006 19:55:57 ralph@home Computer ID: 896; location: work; project prefs: default bt-gw 22/02/2006 19:55:57 --- General prefs: from Einstein@Home (last modified 2006-02-22 16:24:36) bt-gw 22/02/2006 19:55:57 --- General prefs: using separate prefs for work bt-gw 22/02/2006 19:55:57 --- Remote control allowed bt-gw 22/02/2006 19:55:57 Einstein@Home Resuming computation for result r1_0937.0__80_S4R2a_1 using albert version 440 bt-gw 22/02/2006 19:55:57 ralph@home Deferring computation for result TEST_HOMOLOG_ABINITIO_hom001_1fna__214_54_0 bt-gw 22/02/2006 19:55:57 Einstein@Home Pausing result r1_0937.0__80_S4R2a_1 (removed from memory) bt-gw 22/02/2006 19:55:57 ralph@home Restarting result TEST_HOMOLOG_ABINITIO_hom001_1fna__214_54_0 using rosetta_beta version 484 bt-gw 22/02/2006 19:56:00 --- request_reschedule_cpus: process exited bt-gw 22/02/2006 20:56:01 Einstein@Home Restarting result r1_0937.0__80_S4R2a_1 using albert version 440 bt-gw 22/02/2006 20:56:01 ralph@home Pausing result TEST_HOMOLOG_ABINITIO_hom001_1fna__214_54_0 (removed from memory) bt-gw 22/02/2006 20:56:02 ralph@home Unrecoverable error for result TEST_HOMOLOG_ABINITIO_hom001_1fna__214_54_0 (process exited with code 131 (0x83)) bt-gw 22/02/2006 20:56:02 --- request_reschedule_cpus: process exited bt-gw 22/02/2006 20:56:02 ralph@home Computation for result TEST_HOMOLOG_ABINITIO_hom001_1fna__214_54_0 finished |
River~~ Send message Joined: 20 Feb 06 Posts: 20 Credit: 503 RAC: 0 |
btw Contact - cool sig ;-) |
Psycodad Send message Joined: 16 Feb 06 Posts: 14 Credit: 2,157 RAC: 0 |
|
AMD-USR_JL Send message Joined: 17 Feb 06 Posts: 2 Credit: 1,040 RAC: 0 |
2/3 crash. I had requested the 4 day flavor. I had both of them crash on my dually. One only got to 10,000s, but one got to 20,000. The one on my laptop is still going though. I found that in my boinc manager it was switching projects when it errored out, which is why i posted it here. ---------------------------- |
Contact Send message Joined: 16 Feb 06 Posts: 19 Credit: 137,458 RAC: 2 |
So if you are having a lot of errors please reset your Time setting to 2 hours and see if that helps. I had 120 min switch when i started running ralph and had no errors because boinc never switched apps during ralph computations. It was only after i set to 5 min switch that error was produced on a small percent of the switches. |
Pieface Send message Joined: 16 Feb 06 Posts: 64 Credit: 203,513 RAC: 0 |
I had three error overnite - asked for 4 hr units, dont leave in memory, 1 hr swaps, Win XP: wu-7057 was running with wu-7095, when they swapped at 6:55 GMT 7057 died with a -164. then picked up wu-6994 (w/wu-7095), when they swapped at 9:55 GMT wu-6994 died (0xc0000005). wu-7095 finally 'finished' at 12:42 GMT, but hit a file size/xfer error. I have since set the run time back to 2 hrs as requested in the other thread. |
STE\/E Send message Joined: 16 Feb 06 Posts: 27 Credit: 2,226,442 RAC: 783 |
Abort the WU's you have left Pieface, I checked your computer & the WU's you have left already show cancelled in your Account & they will do nothing but error out too ... See the (4.87 - result exceeds size limit) Thread ... |
pisi78 Send message Joined: 16 Feb 06 Posts: 7 Credit: 2,020 RAC: 0 |
i had this error 24/02/2006 13.35.41|ralph@home|Pausing result BARCODE_30_1a19A_215_6_1 (removed from memory) 24/02/2006 13.35.43|ralph@home|Unrecoverable error for result BARCODE_30_1a19A_215_6_1 ( - exit code -164 (0xffffff5c)) 24/02/2006 13.35.44||request_reschedule_cpus: process exited 24/02/2006 13.35.44|ralph@home|Computation for result BARCODE_30_1a19A_215_6_1 finished 24/02/2006 13.35.44|ralph@home|Output file BARCODE_30_1a19A_215_6_1_0 for result BARCODE_30_1a19A_215_6_1 exceeds size limit. 24/02/2006 13.35.44|ralph@home|File size: 67515879.000000 bytes. Limit: 25000000.000000 bytes |
Pieface Send message Joined: 16 Feb 06 Posts: 64 Credit: 203,513 RAC: 0 |
thanks PoorBoy... I had looked at the 'list' page and didn't see anything odd, but went back after your note and drilled down on the individual WU's and sure enough there was a 'cancelled' message in the error area. All cleaned up now! |
Colin Porter Send message Joined: 16 Feb 06 Posts: 3 Credit: 24 RAC: 0 |
Here is one where I lost power to my laptop (BOINC set to suspend computation if on batteries). 24/02/2006 21:00:38|ralph@home|Starting result BARCODE_30_1acf__215_10_1 using rosetta_beta version 487 24/02/2006 22:53:31||Suspending computation and network activity - on batteries 24/02/2006 22:53:31|ralph@home|Pausing result BARCODE_30_1acf__215_10_1 (removed from memory) 24/02/2006 22:53:43||Resuming computation and network activity 24/02/2006 22:53:43||request_reschedule_cpus: Resuming activities 24/02/2006 22:53:56|ralph@home|Unrecoverable error for result BARCODE_30_1acf__215_10_1 ( - exit code -1073741819 (0xc0000005)) 24/02/2006 22:53:56||request_reschedule_cpus: process exited 24/02/2006 22:53:56|ralph@home|Computation for result BARCODE_30_1acf__215_10_1 finished 24/02/2006 22:54:01|ralph@home|Started upload of BARCODE_30_1acf__215_10_1_0 24/02/2006 22:54:21||request_reschedule_cpus: result op WU result |
Message boards :
RALPH@home bug list :
Report \"failure when switching projects without keeping applications in memory\" bugs here
©2024 University of Washington
http://www.bakerlab.org