Report \"failure when switching projects without keeping applications in memory\" bugs here

Message boards : RALPH@home bug list : Report \"failure when switching projects without keeping applications in memory\" bugs here

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Pieface

Send message
Joined: 16 Feb 06
Posts: 64
Credit: 203,513
RAC: 0
Message 225 - Posted: 18 Feb 2006, 17:19:37 UTC - in response to Message 161.  

this seems similar to the one reported in message nr 42:

i'm running seti/albert/ralph 1/3 each swapping out every hour on a WIN XP P4 with HT. I was watching one ralph WU on screensaver when the swap hit. The screensaver went away (and that unit seems to be ok), but the other ralph WU that was running concurrently bombed out :
Unrecoverable error for result barcode_30_1bq9a_native_208_2_0
( - exit code -164 (0xffff5c))

CPID 263, WU 1920, Res ID 1972


I double checked my settings on this one, my general pref's came from ralph and I have leave in memory set to 'no'.


Just happened again. Looks like when two ralph's are running at the same time and get swapped out simultaneously one of them dies a terrible death.

2006-02-17 16:20:54 [ralph@home] Unrecoverable error for result BARCODE_30_1aiu__NATIVE_210_28_0 ( - exit code -1073741819 (0xc0000005))


By George, I think you've got it!
Running two ralph 4.85's this time, still 1/3 each with seti and albert.
both ralph units were running and swapped out simultaneously with no abend. I'll watch to make sure both finish OK.
ID: 225 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 235 - Posted: 18 Feb 2006, 19:08:33 UTC
Last modified: 18 Feb 2006, 19:10:20 UTC

Signal 11
https://ralph.bakerlab.org/result.php?resultid=4058
Zero credits
https://ralph.bakerlab.org/result.php?resultid=3962

SIGNAL(7) Linux Programmer's Manual SIGNAL(7)

NAME
signal - list of available signals

DESCRIPTION
Linux supports both POSIX reliable signals (hereinafter "standard sig-
nals") and POSIX real-time signals.

Standard Signals
Linux supports the standard signals listed below. Several signal num-
bers are architecture dependent, as indicated in the "Value" column.
(Where three values are given, the first one is usually valid for alpha
and sparc, the middle one for i386, ppc and sh, and the last one for
mips. A - denotes that a signal is absent on the corresponding archi-
tecture.)

The entries in the "Action" column of the table specify the default
action for the signal, as follows:

Term Default action is to terminate the process.

Ign Default action is to ignore the signal.

Core Default action is to terminate the process and dump core.

Stop Default action is to stop the process.

First the signals described in the original POSIX.1 standard.

Signal Value Action Comment
-------------------------------------------------------------------------
SIGHUP 1 Term Hangup detected on controlling terminal
or death of controlling process
SIGINT 2 Term Interrupt from keyboard
SIGQUIT 3 Core Quit from keyboard
SIGILL 4 Core Illegal Instruction
SIGABRT 6 Core Abort signal from abort(3)
SIGFPE 8 Core Floating point exception
SIGKILL 9 Term Kill signal
SIGSEGV 11 Core Invalid memory reference
SIGPIPE 13 Term Broken pipe: write to pipe with no readers
SIGALRM 14 Term Timer signal from alarm(2)
lines 1-47
Click signature for global team stats
ID: 235 · Report as offensive    Reply Quote
Profile [B^S] Doug Worrall
Avatar

Send message
Joined: 16 Feb 06
Posts: 10
Credit: 1,515
RAC: 0
Message 239 - Posted: 18 Feb 2006, 20:22:16 UTC - in response to Message 183.  

OK, I have just set that machine's "Leave in Memory" to YES. It has had 2/2 failures. Hopefully it'll get some more work soon.

I have another machine which has *always* had Leave in Memory set to YES return a good result. This one:

https://ralph.bakerlab.org/show_host_detail.php?hostid=81



Hello,
Running Linuxos from Live distro.Have the "save in memory" tab to yes
received 2 w/u so-far.Both worked fine.

Sincerely

Doug
ID: 239 · Report as offensive    Reply Quote
Hans Sveen

Send message
Joined: 17 Feb 06
Posts: 11
Credit: 386,241
RAC: 51
Message 248 - Posted: 18 Feb 2006, 22:17:26 UTC

Hello!
Just got a lot of "exit error 1" wrong function; all on my hostid 476, it also errored out on Einstein@home(4 different errors on 4 different wu's). Even bbc's climate change project errored out on 2 wu's, exit status 88.
Hope this will help You in a way or other!


Hans Sveen
Oslo, Norway

ID: 248 · Report as offensive    Reply Quote
Psycodad

Send message
Joined: 16 Feb 06
Posts: 14
Credit: 2,157
RAC: 0
Message 263 - Posted: 18 Feb 2006, 23:42:35 UTC

So, the first WU has been finished correctly, after setting the preferences to "Left in memory" Yes
ID: 263 · Report as offensive    Reply Quote
Profile Contact
Avatar

Send message
Joined: 16 Feb 06
Posts: 20
Credit: 137,458
RAC: 2
Message 287 - Posted: 19 Feb 2006, 2:13:03 UTC

This host most often switching tasks ok:

18/02/06 5:39:11 PM|climateprediction.net|Pausing task 1kco_100093782_0 (removed from memory)
18/02/06 5:39:11 PM|ralph@home|Restarting task HBLR_1.0_2tif_206_29_0 using rosetta_beta version 483
18/02/06 5:44:11 PM|climateprediction.net|Restarting task 1kco_100093782_0 using hadsm3 version 413
18/02/06 5:44:11 PM|ralph@home|Pausing task HBLR_1.0_2tif_206_29_0 (removed from memory)
18/02/06 5:49:11 PM|climateprediction.net|Pausing task 1kco_100093782_0 (removed from memory)
18/02/06 5:49:11 PM|boincsimap|Restarting task 200602094.002266_2 using simap version 507
18/02/06 5:54:11 PM|boincsimap|Pausing task 200602094.002266_2 (removed from memory)
18/02/06 5:54:11 PM|ralph@home|Restarting task HBLR_1.0_2tif_206_29_0 using rosetta_beta version 483


The errors I was able produce (but not reproduce during many attempts).
1) After manually resuming another previously suspended project:

18/02/06 6:58:26 AM||Rescheduling CPU: project resumed by user
18/02/06 6:58:26 AM|SZTAKI Desktop Grid|Resuming task 893dfca3-f62a-4648-839a-b03728a734f3_0 using search version 101
18/02/06 6:58:26 AM|ralph@home|Pausing task HBLR_1.0_1ogw_206_36_0 (removed from memory)
18/02/06 6:58:27 AM|ralph@home|Unrecoverable error for result HBLR_1.0_1ogw_206_36_0 ( - exit code -1073741819 (0xc0000005))
18/02/06 6:58:27 AM||Rescheduling CPU: application exited
18/02/06 6:58:27 AM|ralph@home|Computation for task HBLR_1.0_1ogw_206_36_0 finished


2) Shortly after manual scheduler request to another project: (coincidence?)

18/02/06 5:19:32 PM|SETI@home|Sending scheduler request to http://setiboinc.ssl.berkeley.edu/sah_cgi/cgi
18/02/06 5:19:32 PM|SETI@home|Reason: Requested by user
18/02/06 5:19:32 PM|SETI@home|(not requesting new work or reporting completed tasks)
18/02/06 5:19:37 PM|SETI@home|Scheduler request to http://setiboinc.ssl.berkeley.edu/sah_cgi/cgi succeeded
18/02/06 5:22:50 PM|climateprediction.net|Restarting task 1kco_100093782_0 using hadsm3 version 413
18/02/06 5:22:50 PM|ralph@home|Pausing task HBLR_1.0_1mky_206_29_0 (removed from memory)
18/02/06 5:22:51 PM|ralph@home|Unrecoverable error for result HBLR_1.0_1mky_206_29_0 ( - exit code -1073741819 (0xc0000005))
18/02/06 5:22:51 PM|ralph@home|Deferring scheduler requests for 1 minutes and 0 seconds
18/02/06 5:22:51 PM||Rescheduling CPU: application exited
18/02/06 5:22:51 PM|ralph@home|Computation for task HBLR_1.0_1mky_206_29_0 finished


3) No activity other than switch:

18/02/06 7:21:09 PM|ralph@home|Restarting task HBLR_1.0_2tif_206_29_0 using rosetta_beta version 483
18/02/06 7:26:09 PM|ralph@home|Pausing task HBLR_1.0_2tif_206_29_0 (removed from memory)
18/02/06 7:26:09 PM|boincsimap|Restarting task 200602094.002274_0 using simap version 507
18/02/06 7:31:09 PM|climateprediction.net|Restarting task 1kco_100093782_0 using hadsm3 version 413
18/02/06 7:31:09 PM|boincsimap|Pausing task 200602094.002274_0 (removed from memory)
18/02/06 7:36:09 PM|climateprediction.net|Pausing task 1kco_100093782_0 (removed from memory)
18/02/06 7:36:09 PM|ralph@home|Restarting task HBLR_1.0_2tif_206_29_0 using rosetta_beta version 483
18/02/06 7:41:09 PM|ralph@home|Pausing task HBLR_1.0_2tif_206_29_0 (removed from memory)
18/02/06 7:41:09 PM|boincsimap|Restarting task 200602094.002274_0 using simap version 507
18/02/06 7:41:10 PM|ralph@home|Unrecoverable error for result HBLR_1.0_2tif_206_29_0 ( - exit code -1073741819 (0xc0000005))
18/02/06 7:41:10 PM|ralph@home|Deferring scheduler requests for 1 minutes and 0 seconds
18/02/06 7:41:10 PM||Rescheduling CPU: application exited
18/02/06 7:41:10 PM|ralph@home|Computation for task HBLR_1.0_2tif_206_29_0 finished
ID: 287 · Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 16 Feb 06
Posts: 45
Credit: 43,706
RAC: 20
Message 349 - Posted: 20 Feb 2006, 3:24:29 UTC

My machine that has been failing WU's with "Leave in Memory = NO" has completed a WU successfully with it set to YES. I believe it has demonstrated that it can complete a WU without crashing.

The WU:

https://ralph.bakerlab.org/result.php?resultid=5292

The machine:

https://ralph.bakerlab.org/show_host_detail.php?hostid=76

It's currently half finished with a Rosetta WU, so I'll leave it set to YES until the Rosetta finishes, then I'll switch it back. It's a Dual P3 1GHz, also processing CPDN, Einstein, S@H, and S@H Beta. None of those projects seem to be affected by the "Leave in Memory" setting so far.

ID: 349 · Report as offensive    Reply Quote
Aaron Finney

Send message
Joined: 16 Feb 06
Posts: 56
Credit: 1,457
RAC: 0
Message 355 - Posted: 20 Feb 2006, 7:05:01 UTC
Last modified: 20 Feb 2006, 7:07:48 UTC

Got a bug here..

2/19/2006 6:24:26 PM||Suspending computation and network activity - running CPU benchmarks
2/19/2006 6:24:26 PM|ralph@home|Pausing result BARCODE_30_1fna__209_15_0 (removed from memory)
2/19/2006 6:24:26 PM|ralph@home|Pausing result BARCODE_30_1cc8A_209_16_0 (removed from memory)
2/19/2006 6:24:27 PM|ralph@home|Unrecoverable error for result BARCODE_30_1fna__209_15_0 ( - exit code -1073741819 (0xc0000005))
2/19/2006 6:24:27 PM||request_reschedule_cpus: process exited
2/19/2006 6:24:27 PM|ralph@home|Computation for result BARCODE_30_1fna__209_15_0 finished
2/19/2006 6:24:28 PM||Running CPU benchmarks
2/19/2006 6:25:27 PM||Benchmark results:
2/19/2006 6:25:27 PM|| Number of CPUs: 2
2/19/2006 6:25:27 PM|| 1320 double precision MIPS (Whetstone) per CPU
2/19/2006 6:25:27 PM|| 1249 integer MIPS (Dhrystone) per CPU
2/19/2006 6:25:27 PM||Finished CPU benchmarks
2/19/2006 6:25:28 PM||Resuming computation and network activity
2/19/2006 6:25:28 PM||request_reschedule_cpus: Resuming activities
2/19/2006 6:25:28 PM|ralph@home|Restarting result BARCODE_30_1cc8A_209_16_0 using rosetta_beta version 484
2/19/2006 6:25:28 PM|ralph@home|Starting result BARCODE_30_1a19A_209_16_0 using rosetta_beta version 484


Seems that the problem happened when it was running benchmarks. :( that was a workunit that had been crunching for 25 hours. Now, granted, it was with the 4.84 application version, but I can't seem to get any more work here.
ID: 355 · Report as offensive    Reply Quote
pisi78

Send message
Joined: 16 Feb 06
Posts: 7
Credit: 2,020
RAC: 0
Message 367 - Posted: 20 Feb 2006, 12:06:49 UTC

i had this crash

2006-02-18 13:50:28 [ralph@home] Restarting result BARCODE_30_2chf__NATIVE_210_15_0 using rosetta_beta version 484
2006-02-18 13:50:29 [---] request_reschedule_cpus: process exited
2006-02-18 14:50:29 [LHC@home] Restarting result wjan1D_v6s4hvnom_mqx_nc__9__64.313_59.323__2_4__6__15_1_sixvf_boinc98654_1 using sixtrack version 467
2006-02-18 14:50:29 [ralph@home] Pausing result BARCODE_30_2chf__NATIVE_210_15_0 (removed from memory)
2006-02-18 14:50:30 [ralph@home] Unrecoverable error for result BARCODE_30_2chf__NATIVE_210_15_0 ( - exit code -164 (0xffffff5c))
2006-02-18 14:50:30 [---] request_reschedule_cpus: process exited
2006-02-18 14:50:30 [ralph@home] Computation for result BARCODE_30_2chf__NATIVE_210_15_0 finished

result https://ralph.bakerlab.org/result.php?resultid=3163


ID: 367 · Report as offensive    Reply Quote
pisi78

Send message
Joined: 16 Feb 06
Posts: 7
Credit: 2,020
RAC: 0
Message 417 - Posted: 21 Feb 2006, 8:54:27 UTC

ID: 417 · Report as offensive    Reply Quote
River~~

Send message
Joined: 20 Feb 06
Posts: 20
Credit: 503
RAC: 0
Message 507 - Posted: 22 Feb 2006, 20:56:40 UTC
Last modified: 22 Feb 2006, 21:50:25 UTC

Got one here that survived a reboot, was restarted OK and ran after restart OK, but then died when pre-empted by Einstein. Interesting point for me was that it bombed out at the point of removal from memory rather than when re-loaded (or is this the usual experience with these???)

EDIT: This survived a reboot, as stated above, but before reboot keep in mem = YES, after reboot keep in mem = NO, and it failed on first swap out after NO setting.

But that seems bizarre - it implies it can be swapped out for a reboot but not for a pre-empt. Does that give our coders any clues, or is it a red herring?


Sorry for the non-standard log format, this is a BoincView listing, the machine is in another building and I can't get to it to give you the proper log, and with the work already having reported back the /slot directories will have gone already.

If this style of feedback is no use at all to you, please say so and I will take this box away from Ralph...

bt-gw is the machine, and times are in UTC. Machine is running Debian Linux and not running any graphics.

bt-gw 22/02/2006 19:55:56 --- Starting BOINC client version 5.2.8 for i686-pc-linux-gnu
bt-gw 22/02/2006 19:55:56 --- libcurl/7.14.0 OpenSSL/0.9.8 zlib/1.2.3
bt-gw 22/02/2006 19:55:56 --- Data directory: /usr/local/BOINC
bt-gw 22/02/2006 19:55:57 --- get_local_network_info(): gethostbyname failed
bt-gw 22/02/2006 19:55:57 --- Processor: 1 GenuineIntel Pentium III (Katmai)
bt-gw 22/02/2006 19:55:57 --- Memory: 377.75 MB physical, 737.32 MB virtual
bt-gw 22/02/2006 19:55:57 --- Disk: 4.07 GB total, 3.24 GB free
bt-gw 22/02/2006 19:55:57 LHC@home Computer ID: 79658; location: work; project prefs: default
bt-gw 22/02/2006 19:55:57 Einstein@Home Computer ID: 469573; location: work; project prefs: default
bt-gw 22/02/2006 19:55:57 ralph@home Computer ID: 896; location: work; project prefs: default
bt-gw 22/02/2006 19:55:57 --- General prefs: from Einstein@Home (last modified 2006-02-22 16:24:36)
bt-gw 22/02/2006 19:55:57 --- General prefs: using separate prefs for work
bt-gw 22/02/2006 19:55:57 --- Remote control allowed


bt-gw 22/02/2006 19:55:57 Einstein@Home Resuming computation for result r1_0937.0__80_S4R2a_1 using albert version 440
bt-gw 22/02/2006 19:55:57 ralph@home Deferring computation for result TEST_HOMOLOG_ABINITIO_hom001_1fna__214_54_0
bt-gw 22/02/2006 19:55:57 Einstein@Home Pausing result r1_0937.0__80_S4R2a_1 (removed from memory)
bt-gw 22/02/2006 19:55:57 ralph@home Restarting result TEST_HOMOLOG_ABINITIO_hom001_1fna__214_54_0 using rosetta_beta version 484
bt-gw 22/02/2006 19:56:00 --- request_reschedule_cpus: process exited
bt-gw 22/02/2006 20:56:01 Einstein@Home Restarting result r1_0937.0__80_S4R2a_1 using albert version 440
bt-gw 22/02/2006 20:56:01 ralph@home Pausing result TEST_HOMOLOG_ABINITIO_hom001_1fna__214_54_0 (removed from memory)
bt-gw 22/02/2006 20:56:02 ralph@home Unrecoverable error for result TEST_HOMOLOG_ABINITIO_hom001_1fna__214_54_0 (process exited with code 131 (0x83))
bt-gw 22/02/2006 20:56:02 --- request_reschedule_cpus: process exited
bt-gw 22/02/2006 20:56:02 ralph@home Computation for result TEST_HOMOLOG_ABINITIO_hom001_1fna__214_54_0 finished
ID: 507 · Report as offensive    Reply Quote
River~~

Send message
Joined: 20 Feb 06
Posts: 20
Credit: 503
RAC: 0
Message 508 - Posted: 22 Feb 2006, 21:09:24 UTC - in response to Message 287.  

btw Contact - cool sig ;-)
ID: 508 · Report as offensive    Reply Quote
Psycodad

Send message
Joined: 16 Feb 06
Posts: 14
Credit: 2,157
RAC: 0
Message 522 - Posted: 23 Feb 2006, 9:27:58 UTC

An other WU crash again :(

Result

Workunit


Preference was set to "Leave in memory"
ID: 522 · Report as offensive    Reply Quote
AMD-USR_JL

Send message
Joined: 17 Feb 06
Posts: 2
Credit: 1,040
RAC: 0
Message 537 - Posted: 23 Feb 2006, 21:09:10 UTC
Last modified: 23 Feb 2006, 21:09:46 UTC

2/3 crash. I had requested the 4 day flavor. I had both of them crash on my dually. One only got to 10,000s, but one got to 20,000. The one on my laptop is still going though.
I found that in my boinc manager it was switching projects when it errored out, which is why i posted it here.
----------------------------
ID: 537 · Report as offensive    Reply Quote
Profile Contact
Avatar

Send message
Joined: 16 Feb 06
Posts: 20
Credit: 137,458
RAC: 2
Message 567 - Posted: 24 Feb 2006, 11:45:26 UTC - in response to Message 551.  
Last modified: 24 Feb 2006, 11:47:45 UTC

So if you are having a lot of errors please reset your Time setting to 2 hours and see if that helps.

I had 120 min switch when i started running ralph and had no errors because boinc never switched apps during ralph computations.
It was only after i set to 5 min switch that error was produced on a small percent of the switches.


ID: 567 · Report as offensive    Reply Quote
Pieface

Send message
Joined: 16 Feb 06
Posts: 64
Credit: 203,513
RAC: 0
Message 573 - Posted: 24 Feb 2006, 14:26:33 UTC

I had three error overnite - asked for 4 hr units, dont leave in memory, 1 hr swaps, Win XP:

wu-7057 was running with wu-7095, when they swapped at 6:55 GMT 7057 died with a -164.
then picked up wu-6994 (w/wu-7095), when they swapped at 9:55 GMT wu-6994 died (0xc0000005).
wu-7095 finally 'finished' at 12:42 GMT, but hit a file size/xfer error.

I have since set the run time back to 2 hrs as requested in the other thread.
ID: 573 · Report as offensive    Reply Quote
STE\/E

Send message
Joined: 16 Feb 06
Posts: 27
Credit: 2,226,442
RAC: 783
Message 574 - Posted: 24 Feb 2006, 14:30:50 UTC
Last modified: 24 Feb 2006, 14:32:52 UTC

Abort the WU's you have left Pieface, I checked your computer & the WU's you have left already show cancelled in your Account & they will do nothing but error out too ... See the (4.87 - result exceeds size limit) Thread ...
ID: 574 · Report as offensive    Reply Quote
pisi78

Send message
Joined: 16 Feb 06
Posts: 7
Credit: 2,020
RAC: 0
Message 576 - Posted: 24 Feb 2006, 15:03:24 UTC

i had this error

24/02/2006 13.35.41|ralph@home|Pausing result BARCODE_30_1a19A_215_6_1 (removed from memory)
24/02/2006 13.35.43|ralph@home|Unrecoverable error for result BARCODE_30_1a19A_215_6_1 ( - exit code -164 (0xffffff5c))
24/02/2006 13.35.44||request_reschedule_cpus: process exited
24/02/2006 13.35.44|ralph@home|Computation for result BARCODE_30_1a19A_215_6_1 finished
24/02/2006 13.35.44|ralph@home|Output file BARCODE_30_1a19A_215_6_1_0 for result BARCODE_30_1a19A_215_6_1 exceeds size limit.
24/02/2006 13.35.44|ralph@home|File size: 67515879.000000 bytes. Limit: 25000000.000000 bytes

ID: 576 · Report as offensive    Reply Quote
Pieface

Send message
Joined: 16 Feb 06
Posts: 64
Credit: 203,513
RAC: 0
Message 579 - Posted: 24 Feb 2006, 15:43:37 UTC

thanks PoorBoy...
I had looked at the 'list' page and didn't see anything odd, but went back after your note and drilled down on the individual WU's and sure enough there was a 'cancelled' message in the error area. All cleaned up now!
ID: 579 · Report as offensive    Reply Quote
Colin Porter

Send message
Joined: 16 Feb 06
Posts: 3
Credit: 24
RAC: 0
Message 591 - Posted: 24 Feb 2006, 23:00:27 UTC
Last modified: 24 Feb 2006, 23:08:56 UTC

Here is one where I lost power to my laptop (BOINC set to suspend computation if on batteries).

24/02/2006 21:00:38|ralph@home|Starting result BARCODE_30_1acf__215_10_1 using rosetta_beta version 487
24/02/2006 22:53:31||Suspending computation and network activity - on batteries
24/02/2006 22:53:31|ralph@home|Pausing result BARCODE_30_1acf__215_10_1 (removed from memory)
24/02/2006 22:53:43||Resuming computation and network activity
24/02/2006 22:53:43||request_reschedule_cpus: Resuming activities
24/02/2006 22:53:56|ralph@home|Unrecoverable error for result BARCODE_30_1acf__215_10_1 ( - exit code -1073741819 (0xc0000005))
24/02/2006 22:53:56||request_reschedule_cpus: process exited
24/02/2006 22:53:56|ralph@home|Computation for result BARCODE_30_1acf__215_10_1 finished
24/02/2006 22:54:01|ralph@home|Started upload of BARCODE_30_1acf__215_10_1_0
24/02/2006 22:54:21||request_reschedule_cpus: result op


WU result
ID: 591 · Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : RALPH@home bug list : Report \"failure when switching projects without keeping applications in memory\" bugs here



©2024 University of Washington
http://www.bakerlab.org