Report \"failure when switching projects without keeping applications in memory\" bugs here

Message boards : RALPH@home bug list : Report \"failure when switching projects without keeping applications in memory\" bugs here

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile dekim
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 20 Jan 06
Posts: 250
Credit: 543,579
RAC: 0
Message 2 - Posted: 15 Feb 2006, 18:44:24 UTC

We would like volunteers to test this known bug.
ID: 2 · Report as offensive    Reply Quote
Profile William Senn
Avatar

Send message
Joined: 16 Feb 06
Posts: 4
Credit: 30,895
RAC: 0
Message 35 - Posted: 16 Feb 2006, 11:51:07 UTC - in response to Message 2.  

We would like volunteers to test this known bug.


I could do this for you, if you want me to, got everything setup, there's nothing downloaded yet, though...

William Senn..


ID: 35 · Report as offensive    Reply Quote
Profile Krunchin-Keith [USA]
Avatar

Send message
Joined: 15 Feb 06
Posts: 6
Credit: 638
RAC: 0
Message 42 - Posted: 16 Feb 2006, 13:36:00 UTC

I am testing this with host 34
I have a 30 minute time slice so all work units should get swapped out at least once.

the fisrt wuid 1410 result 1445 finished after 00:57:53, but has not been reported yet.

the second wuid 1373 result 1408 shows a computational error after 00:55:38 run time. At 07:53:33 AM it was paused (removed from memory), Then at 07:53:34 AM it shows unrecoverable error ( - exit code -164 (0xffffff5c)) and then finished.

I have 15 more to process. I am going on vacation, sorry bad timing, I will let them run and they will be reported when finished but I will not be able to comment on them. but I'm set for no more work so only those now showing for my one host will run. When I return I can do more and put some of my other krunchers on this too.
ID: 42 · Report as offensive    Reply Quote
KWSN Sir Clark
Avatar

Send message
Joined: 16 Feb 06
Posts: 4
Credit: 21
RAC: 0
Message 74 - Posted: 16 Feb 2006, 21:28:33 UTC

This unit errored out when I took Sztaki off No New Work.........it seemed to be dumped out of the memory for some reason even though I'm set up to keep apps in memory.

https://ralph.bakerlab.org/result.php?resultid=1611

	Host	Project	Date	Message
CK	ck	---	16/02/2006 21:31:31	request_reschedule_cpus: project op
CK	ck	BBC Climate Change Experiment	16/02/2006 21:31:31	Restarting result hadcm3l_0h9f_00022824_0 using hadcm3l version 507
CK	ck	ralph@home	16/02/2006 21:31:31	Pausing result HBLR_1.0_1dtj_206_25_0 (removed from memory)
CK	ck	ralph@home	16/02/2006 21:31:32	Unrecoverable error for result HBLR_1.0_1dtj_206_25_0 ( - exit code -1073741819 (0xc0000005))
CK	ck	---	16/02/2006 21:31:33	request_reschedule_cpus: process exited
CK	ck	ralph@home	16/02/2006 21:31:33	Computation for result HBLR_1.0_1dtj_206_25_0 finished
CK	ck	SZTAKI Desktop Grid	16/02/2006 21:31:35	Sending scheduler request to http://szdg.lpds.sztaki.hu/szdg/cgi-bin/scheduler


Messages are from BoincView........


www.chris-kent.co.uk aka Chief.com
ID: 74 · Report as offensive    Reply Quote
Profile UBT - Halifax--lad

Send message
Joined: 15 Feb 06
Posts: 29
Credit: 2,723
RAC: 0
Message 88 - Posted: 16 Feb 2006, 23:02:25 UTC - in response to Message 74.  

This unit errored out when I took Sztaki off No New Work.........it seemed to be dumped out of the memory for some reason even though I'm set up to keep apps in memory.

https://ralph.bakerlab.org/result.php?resultid=1611

	Host	Project	Date	Message
CK	ck	---	16/02/2006 21:31:31	request_reschedule_cpus: project op
CK	ck	BBC Climate Change Experiment	16/02/2006 21:31:31	Restarting result hadcm3l_0h9f_00022824_0 using hadcm3l version 507
CK	ck	ralph@home	16/02/2006 21:31:31	Pausing result HBLR_1.0_1dtj_206_25_0 (removed from memory)
CK	ck	ralph@home	16/02/2006 21:31:32	Unrecoverable error for result HBLR_1.0_1dtj_206_25_0 ( - exit code -1073741819 (0xc0000005))
CK	ck	---	16/02/2006 21:31:33	request_reschedule_cpus: process exited
CK	ck	ralph@home	16/02/2006 21:31:33	Computation for result HBLR_1.0_1dtj_206_25_0 finished
CK	ck	SZTAKI Desktop Grid	16/02/2006 21:31:35	Sending scheduler request to http://szdg.lpds.sztaki.hu/szdg/cgi-bin/scheduler


Messages are from BoincView........


Double check your preempt in memory setting on the preferences on this project, for some reason mine was set to No when I joined up, usually it automatically says Yes on other projects I join, that may be why it was removed from memory
Join us in Chat (see the forum) Click the Sig


Join UBT
ID: 88 · Report as offensive    Reply Quote
Pieface

Send message
Joined: 16 Feb 06
Posts: 64
Credit: 203,513
RAC: 0
Message 109 - Posted: 17 Feb 2006, 2:57:42 UTC

this seems similar to the one reported in message nr 42:

i'm running seti/albert/ralph 1/3 each swapping out every hour on a WIN XP P4 with HT. I was watching one ralph WU on screensaver when the swap hit. The screensaver went away (and that unit seems to be ok), but the other ralph WU that was running concurrently bombed out :
Unrecoverable error for result barcode_30_1bq9a_native_208_2_0
( - exit code -164 (0xffff5c))

CPID 263, WU 1920, Res ID 1972
ID: 109 · Report as offensive    Reply Quote
Dimitris Hatzopoulos

Send message
Joined: 16 Feb 06
Posts: 31
Credit: 2,308
RAC: 0
Message 110 - Posted: 17 Feb 2006, 3:12:25 UTC - in response to Message 88.  

Double check your preempt in memory setting on the preferences on this project, for some reason mine was set to No when I joined up, usually it automatically says Yes on other projects I join, that may be why it was removed from memory


"Leave in memory when preempted" is a BOINC GLOBAL default, that "propagates" across all projects. BOINC uses the config from the project with with the newest time-stamp.

i.e. if you run SETI/Rosetta/RALPH and change RALPH today, those RALPH settings (e.g. "Leave app in mem" to NO) will be used by all other projects.

It wasn't quite clear to me too, and I had to look for this "detail".
ID: 110 · Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 16 Feb 06
Posts: 45
Credit: 43,706
RAC: 20
Message 113 - Posted: 17 Feb 2006, 3:56:20 UTC

Oops -- I should have put this here, but instead I put it in the "current tests" area. Well, here it is anyway.

OK, the machine I had set "leave in memory" to OFF had an error on its one WU that it got:

https://ralph.bakerlab.org/result.php?resultid=1666

It's not getting any more at the moment (no work from project). It also just had a Rosetta WU error out. I set Rosetta to NNW on that machine for now so I won't lose any more work.

This is the machine BTW:

https://ralph.bakerlab.org/show_host_detail.php?hostid=76


ID: 113 · Report as offensive    Reply Quote
Aaron Finney

Send message
Joined: 16 Feb 06
Posts: 56
Credit: 1,457
RAC: 0
Message 139 - Posted: 17 Feb 2006, 6:24:24 UTC - in response to Message 2.  
Last modified: 17 Feb 2006, 6:34:53 UTC

We would like volunteers to test this known bug.


WORKS HERE - at least when forced, using "suspend" under the projects tab.

Didn't work on the old rosetta!

Le Roi es mort! Vivè le roi!

Now I'll let it churn for a few days and see if it does it on it's own.. hehe..
ID: 139 · Report as offensive    Reply Quote
Profile UBT - Halifax--lad

Send message
Joined: 15 Feb 06
Posts: 29
Credit: 2,723
RAC: 0
Message 145 - Posted: 17 Feb 2006, 8:10:52 UTC - in response to Message 110.  

Double check your preempt in memory setting on the preferences on this project, for some reason mine was set to No when I joined up, usually it automatically says Yes on other projects I join, that may be why it was removed from memory


"Leave in memory when preempted" is a BOINC GLOBAL default, that "propagates" across all projects. BOINC uses the config from the project with with the newest time-stamp.

i.e. if you run SETI/Rosetta/RALPH and change RALPH today, those RALPH settings (e.g. "Leave app in mem" to NO) will be used by all other projects.

It wasn't quite clear to me too, and I had to look for this "detail".


Yes I know that but it defaults to No on this project so people need to be aware of that and possibly set RALPH up on a different preference (homes, school or work) with preempt set to No, that way it wont interfere with other project preferences
Join us in Chat (see the forum) Click the Sig


Join UBT
ID: 145 · Report as offensive    Reply Quote
Pieface

Send message
Joined: 16 Feb 06
Posts: 64
Credit: 203,513
RAC: 0
Message 148 - Posted: 17 Feb 2006, 14:48:40 UTC - in response to Message 109.  

this seems similar to the one reported in message nr 42:

i'm running seti/albert/ralph 1/3 each swapping out every hour on a WIN XP P4 with HT. I was watching one ralph WU on screensaver when the swap hit. The screensaver went away (and that unit seems to be ok), but the other ralph WU that was running concurrently bombed out :
Unrecoverable error for result barcode_30_1bq9a_native_208_2_0
( - exit code -164 (0xffff5c))

CPID 263, WU 1920, Res ID 1972


I double checked my settings on this one, my general pref's came from ralph and I have leave in memory set to 'no'.
ID: 148 · Report as offensive    Reply Quote
Pieface

Send message
Joined: 16 Feb 06
Posts: 64
Credit: 203,513
RAC: 0
Message 161 - Posted: 17 Feb 2006, 21:27:33 UTC - in response to Message 148.  

this seems similar to the one reported in message nr 42:

i'm running seti/albert/ralph 1/3 each swapping out every hour on a WIN XP P4 with HT. I was watching one ralph WU on screensaver when the swap hit. The screensaver went away (and that unit seems to be ok), but the other ralph WU that was running concurrently bombed out :
Unrecoverable error for result barcode_30_1bq9a_native_208_2_0
( - exit code -164 (0xffff5c))

CPID 263, WU 1920, Res ID 1972


I double checked my settings on this one, my general pref's came from ralph and I have leave in memory set to 'no'.


Just happened again. Looks like when two ralph's are running at the same time and get swapped out simultaneously one of them dies a terrible death.

2006-02-17 16:20:54 [ralph@home] Unrecoverable error for result BARCODE_30_1aiu__NATIVE_210_28_0 ( - exit code -1073741819 (0xc0000005))
ID: 161 · Report as offensive    Reply Quote
Dimitris Hatzopoulos

Send message
Joined: 16 Feb 06
Posts: 31
Credit: 2,308
RAC: 0
Message 163 - Posted: 17 Feb 2006, 22:13:25 UTC - in response to Message 145.  


Yes I know that but it defaults to No on this project so people need to be aware of that and possibly set RALPH up on a different preference (homes, school or work) with preempt set to No, that way it wont interfere with other project preferences



AFAIK "Leave in memory" is a global default, not per project or per location (work/home/school) and so setting it independantly isn't as easy (if you share the same PC between Rosetta and Ralph) see my deciding on resource share
ID: 163 · Report as offensive    Reply Quote
Psycodad

Send message
Joined: 16 Feb 06
Posts: 14
Credit: 2,157
RAC: 0
Message 168 - Posted: 17 Feb 2006, 23:28:25 UTC

Got my first WU and it errored out after a hour of crunching :(

18.02.2006 00:01:10|SETI@home|Restarting result 18dc04aa.7932.30433.467306.1.218_1 using setiathome version 418
18.02.2006 00:01:10|ralph@home|Pausing result BARCODE_30_1c9oA_NATIVE_210_6_0 (removed from memory)
18.02.2006 00:01:11|ralph@home|Unrecoverable error for result BARCODE_30_1c9oA_NATIVE_210_6_0 ( - exit code -164 (0xffffff5c))
18.02.2006 00:01:12||request_reschedule_cpus: process exited
18.02.2006 00:01:12|ralph@home|Computation for result BARCODE_30_1c9oA_NATIVE_210_6_0 finished




Result

Workunit
ID: 168 · Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 16 Feb 06
Posts: 45
Credit: 43,706
RAC: 20
Message 178 - Posted: 18 Feb 2006, 2:46:20 UTC

Oh, well, just had a 4.84 WU crash:

2/17/2006 8:39:37 PM|ralph@home|Unrecoverable error for result BARCODE_30_1tig__NATIVE_210_3_0 ( - exit code -1073741819 (0xc0000005))

This one:

https://ralph.bakerlab.org/result.php?resultid=2785

This computer:

https://ralph.bakerlab.org/show_host_detail.php?hostid=76

So far both WU's this machine has had have crashed. It has "leave in Memory" set to "NO".

ID: 178 · Report as offensive    Reply Quote
Profile dekim
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 20 Jan 06
Posts: 250
Credit: 543,579
RAC: 0
Message 182 - Posted: 18 Feb 2006, 3:38:50 UTC

Looks like we are still seeing this problem on some machines. Can people who are having crashes upon preemption check to see if keeping applications in memory fixes the problem. And can people who are not seeing this problem respond also.

Thanks!
ID: 182 · Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 16 Feb 06
Posts: 45
Credit: 43,706
RAC: 20
Message 183 - Posted: 18 Feb 2006, 3:55:08 UTC
Last modified: 18 Feb 2006, 3:56:14 UTC

OK, I have just set that machine's "Leave in Memory" to YES. It has had 2/2 failures. Hopefully it'll get some more work soon.

I have another machine which has *always* had Leave in Memory set to YES return a good result. This one:

https://ralph.bakerlab.org/show_host_detail.php?hostid=81



ID: 183 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 205 - Posted: 18 Feb 2006, 10:26:27 UTC
Last modified: 18 Feb 2006, 10:28:30 UTC

I switched to w/o keep app into memory to NO
*and I am getting this bug
https://ralph.bakerlab.org/forum_thread.php?id=33
*However when I used YES - the bug was the same?
Click signature for global team stats
ID: 205 · Report as offensive    Reply Quote
Psycodad

Send message
Joined: 16 Feb 06
Posts: 14
Credit: 2,157
RAC: 0
Message 206 - Posted: 18 Feb 2006, 10:30:36 UTC

Unfortunately, my 2nd WU crashed during this night.

Result

Workunit


I have just set my prefernces to "Leave in Memory" to yes.
I hope, there will be soon some more work.
ID: 206 · Report as offensive    Reply Quote
Pieface

Send message
Joined: 16 Feb 06
Posts: 64
Credit: 203,513
RAC: 0
Message 225 - Posted: 18 Feb 2006, 17:19:37 UTC - in response to Message 161.  

this seems similar to the one reported in message nr 42:

i'm running seti/albert/ralph 1/3 each swapping out every hour on a WIN XP P4 with HT. I was watching one ralph WU on screensaver when the swap hit. The screensaver went away (and that unit seems to be ok), but the other ralph WU that was running concurrently bombed out :
Unrecoverable error for result barcode_30_1bq9a_native_208_2_0
( - exit code -164 (0xffff5c))

CPID 263, WU 1920, Res ID 1972


I double checked my settings on this one, my general pref's came from ralph and I have leave in memory set to 'no'.


Just happened again. Looks like when two ralph's are running at the same time and get swapped out simultaneously one of them dies a terrible death.

2006-02-17 16:20:54 [ralph@home] Unrecoverable error for result BARCODE_30_1aiu__NATIVE_210_28_0 ( - exit code -1073741819 (0xc0000005))


By George, I think you've got it!
Running two ralph 4.85's this time, still 1/3 each with seti and albert.
both ralph units were running and swapped out simultaneously with no abend. I'll watch to make sure both finish OK.
ID: 225 · Report as offensive    Reply Quote
1 · 2 · 3 · Next

Message boards : RALPH@home bug list : Report \"failure when switching projects without keeping applications in memory\" bugs here



©2024 University of Washington
http://www.bakerlab.org