Switching between projects with applications removed from memory

Message boards : Current tests : Switching between projects with applications removed from memory

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Aglarond

Send message
Joined: 16 Feb 06
Posts: 11
Credit: 1,094
RAC: 0
Message 514 - Posted: 23 Feb 2006, 2:30:18 UTC - in response to Message 513.  

Now I looked into WU, that was running when I tried to switch apps in Boinc (without leavin in memory) and also, while I have put my laptop into standby. This is part of it:

<stderr_txt>
...
No heartbeat from core client for 31 sec - exiting
...
</stderr_txt>

Do you think this can be the reason why Rosetta exits after my system wake-ups from standby? It doesn't exit when I wake-up my laptop in just few seconds. This behavior is similar with other Boinc projects.
ID: 514 · Report as offensive    Reply Quote
tgm

Send message
Joined: 19 Feb 06
Posts: 5
Credit: 1,066
RAC: 0
Message 558 - Posted: 24 Feb 2006, 6:06:30 UTC

Removing rosetta beta 4.87 work units from memory on one of my windows machines is definitely FAILING with end state client error. This machine is a DUAL PROCESSOR P3 750 w/ 512MB ram running on Windows Server 2003.

I have three examples:

https://ralph.bakerlab.org/workunit.php?wuid=5559
https://ralph.bakerlab.org/workunit.php?wuid=5560
https://ralph.bakerlab.org/workunit.php?wuid=5561

I have now switched my configuration to keep wu's in memory and performed an update. We'll see what happens.

Curiously, I have another wu running on a Fedora box that that is showing some other bizare behavior, but I'll start a new post for this one.
ID: 558 · Report as offensive    Reply Quote
Dimitris Hatzopoulos

Send message
Joined: 16 Feb 06
Posts: 31
Credit: 2,308
RAC: 0
Message 649 - Posted: 25 Feb 2006, 19:21:13 UTC - in response to Message 558.  

Removing rosetta beta 4.87 work units from memory on one of my windows machines is definitely FAILING with end state client error. This machine is a DUAL PROCESSOR P3 750 w/ 512MB ram running on Windows Server 2003.

I have now switched my configuration to keep wu's in memory and performed an update. We'll see what happens.

Curiously, I have another wu running on a Fedora box that that is showing some other bizare behavior, but I'll start a new post for this one.


I think this is the case when a slower machine (P3/750) takes too long to complete the first model and it gets pre-empted and removed from RAM / VM before even the first checkpoint is reached.

In which case you need to keep in RAM while pre-empted and/or increase times between app switching to a higher value from default 60min, to e.g. 4hr in your case.
ID: 649 · Report as offensive    Reply Quote
tgm

Send message
Joined: 19 Feb 06
Posts: 5
Credit: 1,066
RAC: 0
Message 691 - Posted: 27 Feb 2006, 3:42:37 UTC - in response to Message 649.  

I think this is the case when a slower machine (P3/750) takes too long to complete the first model and it gets pre-empted and removed from RAM / VM before even the first checkpoint is reached.

In which case you need to keep in RAM while pre-empted and/or increase times between app switching to a higher value from default 60min, to e.g. 4hr in your case.


I sort of doubt this is the case. I know one of the wu's got up to more than 60% before it crashed.
ID: 691 · Report as offensive    Reply Quote
Dimitris Hatzopoulos

Send message
Joined: 16 Feb 06
Posts: 31
Credit: 2,308
RAC: 0
Message 701 - Posted: 27 Feb 2006, 10:10:08 UTC - in response to Message 691.  

I sort of doubt this is the case. I know one of the wu's got up to more than 60% before it crashed.


Due to the way "new" Rosetta WUs work (variable # Models during a fixed time period e.g. 8hr), you might want to focus more on the Model / Step statistic, rather than % progress.

In that regard, the WU stderr provided aren't very helpful to do remote-diagnostics. In my case, I got similar errors (for R@h, not RALPH) with yours on a machine which had multiple reboots over the previous 3 days, due to power problems.
ID: 701 · Report as offensive    Reply Quote
Aaron Finney

Send message
Joined: 16 Feb 06
Posts: 56
Credit: 1,457
RAC: 0
Message 875 - Posted: 14 Mar 2006, 16:21:02 UTC - in response to Message 4.  
Last modified: 14 Mar 2006, 16:21:17 UTC

Had a problem with this on a workunit that had ran for 60 hours, application version 4.92

3/13/2006 7:40:03 PM||Suspending computation and network activity - user request
3/13/2006 7:40:03 PM|climateprediction.net|Pausing result sulphur_id14_000856696_0 (removed from memory)
3/13/2006 7:40:03 PM|ralph@home|Pausing result TEST_HOMOLOG_ABINITIO_hom008_1fna__220_3_2 (removed from memory)
3/13/2006 7:40:04 PM|ralph@home|Unrecoverable error for result TEST_HOMOLOG_ABINITIO_hom008_1fna__220_3_2 ( - exit code -1073741819 (0xc0000005))
3/13/2006 7:40:04 PM||request_reschedule_cpus: process exited
3/13/2006 7:40:04 PM|ralph@home|Computation for result TEST_HOMOLOG_ABINITIO_hom008_1fna__220_3_2 finished
3/13/2006 7:40:05 PM||request_reschedule_cpus: process exited
3/13/2006 7:40:07 PM||Resuming computation and network activity
3/13/2006 7:40:07 PM||request_reschedule_cpus: Resuming activities
3/13/2006 7:40:07 PM||Allowing work fetch again.
3/13/2006 7:40:07 PM||Resuming round-robin CPU scheduling.

ID: 875 · Report as offensive    Reply Quote
Profile dekim
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 20 Jan 06
Posts: 250
Credit: 543,579
RAC: 0
Message 883 - Posted: 16 Mar 2006, 18:29:37 UTC

The current windows application has a fix that we want to test for this issue. The last batch of work units have default cpu run times of 8 hours. Please let us know if the windows app version 4.93 continues to crash when switching to another app and not left in memory or if the fix helps.
ID: 883 · Report as offensive    Reply Quote
[B^S] sTrey
Avatar

Send message
Joined: 15 Feb 06
Posts: 58
Credit: 15,430
RAC: 0
Message 887 - Posted: 16 Mar 2006, 23:07:20 UTC - in response to Message 38.  

I'm having a problem with this, but not the one you're trying to fix. BOINC simply does not have enough "venues" to set up custom situations to either test specific things or to tune resources for specific machines. And since it doesn't allow "local control", we have to balance carefully.


Duh, thanks genes for pointing out the fact that different venues, few as they are, can be used in this way, even with the same host. With one machine and multiple projects I wasn't going to change my memory settings for this test, but on seeing this I reconfigured to help out. It also alleviates a bit of the strain on my box's vmem since I'm running cpdn's seasonal attribution project and it's quite a hog.
ID: 887 · Report as offensive    Reply Quote
Aglarond

Send message
Joined: 16 Feb 06
Posts: 11
Credit: 1,094
RAC: 0
Message 888 - Posted: 17 Mar 2006, 0:14:09 UTC - in response to Message 887.  

It also alleviates a bit of the strain on my box's vmem since I'm running cpdn's seasonal attribution project and it's quite a hog.


Carefully with cpdn's seasonal attribution project. This is from their forums:
If you have the option 'remove from memory' when preempting, and the boinc default of 1 hour between swapping, the chances are that you have thrown away the model each time you preempt. This project's defaults are 2 hours and 'keep in memory' for obvious reasons.

ID: 888 · Report as offensive    Reply Quote
scottLobster

Send message
Joined: 17 Feb 06
Posts: 1
Credit: 826
RAC: 0
Message 889 - Posted: 17 Mar 2006, 0:36:17 UTC - in response to Message 883.  

The current windows application has a fix that we want to test for this issue. The last batch of work units have default cpu run times of 8 hours. Please let us know if the windows app version 4.93 continues to crash when switching to another app and not left in memory or if the fix helps.


Just did a few switches between Rosetta and Ralph with leave in memory disabled. Seems to work fine. Rosetta didn't crash either. I'll leave it like this overnight and see what happens.

ID: 889 · Report as offensive    Reply Quote
[B^S] sTrey
Avatar

Send message
Joined: 15 Feb 06
Posts: 58
Credit: 15,430
RAC: 0
Message 890 - Posted: 17 Mar 2006, 2:45:03 UTC - in response to Message 888.  
Last modified: 17 Mar 2006, 2:46:37 UTC

Carefully with cpdn's seasonal attribution project. This is from their forums:
If you have the option 'remove from memory' when preempting, and the boinc default of 1 hour between swapping, the chances are that you have thrown away the model each time you preempt. This project's defaults are 2 hours and 'keep in
memory' for obvious reasons.


Thanks for the warning. I keep all my projects in memory and will continue to do so with everything except this project during this test. Just happy to have it pointed out that I can use venues to have one project get tossed from memory on suspend, and the rest left in.

OTOH I'm not sure it's working. I added prefs for "school" and changed my computer to that venue, then did an update and saw the new venue message. My Ralph wu had not yet run. However it's since run for 2 hrs and been suspended, but rosetta beta is still in memory.

p.s. I keep meaning to take out the sig but can't edit it out once posted, I'll go change my default.

ID: 890 · Report as offensive    Reply Quote
Stargazer257

Send message
Joined: 16 Feb 06
Posts: 6
Credit: 17,492
RAC: 0
Message 892 - Posted: 17 Mar 2006, 6:26:26 UTC
Last modified: 17 Mar 2006, 6:29:10 UTC

So far, so good. Have run about 10 WUs on five different hosts (all WinXP SP2). No problems while changing settings to not stay resident in memory, and none so far with applications switching in and out. Knock on wood....


Join Us! - Click the Sig!
ID: 892 · Report as offensive    Reply Quote
[B^S] sTrey
Avatar

Send message
Joined: 15 Feb 06
Posts: 58
Credit: 15,430
RAC: 0
Message 895 - Posted: 17 Mar 2006, 16:22:37 UTC - in response to Message 892.  
Last modified: 17 Mar 2006, 16:43:58 UTC

So Aglarond was right to warn me.
I added separate prefs for "school" and changed my computer's venue on this project only, and updated. hoping to have Ralph removed from memory when suspended but everything else stay resident. Overnight all my projects were removed from memory, not just ralph. [Even though it reported the venue correctly per project.] So apparently one can't fool around claiming one computer is in two places at once... Ralph behaved fine so far, for the 6 hours it's run. but I have switched back to keeping everything in memory.
ID: 895 · Report as offensive    Reply Quote
KB7RZF

Send message
Joined: 16 Feb 06
Posts: 7
Credit: 1,426
RAC: 0
Message 896 - Posted: 17 Mar 2006, 18:17:40 UTC

Did some playing around with just RALPH running. I changed pref's to take everything out of memory, I exited BOINC, restarted, suspended, rebooted, everything I could think of, and so far RALPH has not errored out on me. Seems to be working good so far.

Jeremy
ID: 896 · Report as offensive    Reply Quote
doc :)

Send message
Joined: 16 Feb 06
Posts: 46
Credit: 4,437
RAC: 0
Message 900 - Posted: 18 Mar 2006, 1:52:19 UTC

no crash through removing from memory here so far either (changed my prefs for rosetta to 1h workunits and put my app switch time to 90 minutes to avoid removing rosettas from memory :))
i still get random crashes when i do have the graphics open though (the exit code -1073741811 (0xc000000d) thing)
ID: 900 · Report as offensive    Reply Quote
[B^S] sTrey
Avatar

Send message
Joined: 15 Feb 06
Posts: 58
Credit: 15,430
RAC: 0
Message 909 - Posted: 19 Mar 2006, 0:57:20 UTC
Last modified: 19 Mar 2006, 0:59:19 UTC

FWIW Ralph had behaved fine both when swapped and not, but it didn't survive a pc restart forced by a windows lockup. It was not the active project at the time, chkdsk found nothing scrambled, and none of the other projects lost their work (even cpdn seasonal!) -- but the Ralph wu which was at hour 14 of 16, has restarted at zero. Bummer.
ID: 909 · Report as offensive    Reply Quote
Profile Fuzzy Hollynoodles
Avatar

Send message
Joined: 19 Feb 06
Posts: 37
Credit: 2,089
RAC: 0
Message 914 - Posted: 19 Mar 2006, 7:12:48 UTC
Last modified: 19 Mar 2006, 7:14:25 UTC

Rosetta crashed BIG time!

3/19/2006 8:07:15 AM|rosetta@home|Pausing result HOMSti_homDB019_1tif__352_1732_1 (removed from memory)
3/19/2006 8:07:15 AM|SETI@home Beta Test|Restarting result 01jl01ab.16610.114.798576.3.175_4 using setiathome_enhanced version 506
3/19/2006 8:07:16 AM||Rescheduling CPU: project op

...

3/19/2006 8:07:24 AM|rosetta@home|Unrecoverable error for result HOMSti_homDB019_1tif__352_1732_1 ( - exit code -164 (0xffffff5c))
3/19/2006 8:07:24 AM||Rescheduling CPU: process exited
3/19/2006 8:07:24 AM|rosetta@home|Computation for result HOMSti_homDB019_1tif__352_1732_1 finished

This WU: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10786875
Result: https://boinc.bakerlab.org/rosetta/result.php?resultid=13549302

I see though that this WU has crashed for somebody else, so maybe a coincidence? Even I don't think it is.

Ralph WU runs fine. I've tried to force it to run by suspending the others, and then resuming them, so the Ralph WU are preempted, and no crashes (so far).


[color=navy][b]"I'm trying to maintain a shred of dignity in this world." - Me[/b][/color]

ID: 914 · Report as offensive    Reply Quote
Marky-UK

Send message
Joined: 16 Feb 06
Posts: 5
Credit: 1,530
RAC: 0
Message 917 - Posted: 19 Mar 2006, 12:02:40 UTC

Rosetta's just crashed for me too: https://ralph.bakerlab.org/workunit.php?wuid=17490

Unrecoverable error for result HB_BARCODE_30_1enh__352_83_0 ( - exit code -1073741811 (0xc000000d))
ID: 917 · Report as offensive    Reply Quote
Profile Fuzzy Hollynoodles
Avatar

Send message
Joined: 19 Feb 06
Posts: 37
Credit: 2,089
RAC: 0
Message 918 - Posted: 19 Mar 2006, 16:31:09 UTC

It happened again.

3/19/2006 5:36:43 PM|rosetta@home|Pausing result HB_BARCODE_30_1bq9A_351_14302_0 (removed from memory)
3/19/2006 5:36:44 PM|rosetta@home|Unrecoverable error for result HB_BARCODE_30_1bq9A_351_14302_0 ( - exit code -164 (0xffffff5c))
3/19/2006 5:36:44 PM||Rescheduling CPU: process exited
3/19/2006 5:36:44 PM|rosetta@home|Computation for result HB_BARCODE_30_1bq9A_351_14302_0 finished

Rosetta WU: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11537937
Result: https://boinc.bakerlab.org/rosetta/result.php?resultid=14251499

So this is it, I'm changing back to keeping WU's in memory while preempted untill you get this bug fixed. Else you devs should say that we can't crunch Rosetta and Ralph WU's on the same computer!



[color=navy][b]"I'm trying to maintain a shred of dignity in this world." - Me[/b][/color]

ID: 918 · Report as offensive    Reply Quote
Profile Contact
Avatar

Send message
Joined: 16 Feb 06
Posts: 20
Credit: 137,458
RAC: 2
Message 930 - Posted: 20 Mar 2006, 1:14:21 UTC

Looks good. No matter what i do, can't get ralph to fail under Win98 or XP while switching with app removed from memory.
ID: 930 · Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Current tests : Switching between projects with applications removed from memory



©2024 University of Washington
http://www.bakerlab.org