Switching between projects with applications removed from memory

Author	Message
Aglarond Send message Joined: 16 Feb 06 Posts: 11 Credit: 1,094 RAC: 0	Message 514 - Posted: 23 Feb 2006, 2:30:18 UTC - in response to Message 513. Now I looked into WU, that was running when I tried to switch apps in Boinc (without leavin in memory) and also, while I have put my laptop into standby. This is part of it: <stderr_txt> ... No heartbeat from core client for 31 sec - exiting ... </stderr_txt> Do you think this can be the reason why Rosetta exits after my system wake-ups from standby? It doesn't exit when I wake-up my laptop in just few seconds. This behavior is similar with other Boinc projects. ID: 514 · Reply Quote

tgm Send message Joined: 19 Feb 06 Posts: 5 Credit: 1,066 RAC: 0	Message 558 - Posted: 24 Feb 2006, 6:06:30 UTC Removing rosetta beta 4.87 work units from memory on one of my windows machines is definitely FAILING with end state client error. This machine is a DUAL PROCESSOR P3 750 w/ 512MB ram running on Windows Server 2003. I have three examples: https://ralph.bakerlab.org/workunit.php?wuid=5559 https://ralph.bakerlab.org/workunit.php?wuid=5560 https://ralph.bakerlab.org/workunit.php?wuid=5561 I have now switched my configuration to keep wu's in memory and performed an update. We'll see what happens. Curiously, I have another wu running on a Fedora box that that is showing some other bizare behavior, but I'll start a new post for this one. ID: 558 · Reply Quote

Dimitris Hatzopoulos Send message Joined: 16 Feb 06 Posts: 31 Credit: 2,308 RAC: 0	Message 649 - Posted: 25 Feb 2006, 19:21:13 UTC - in response to Message 558. Removing rosetta beta 4.87 work units from memory on one of my windows machines is definitely FAILING with end state client error. This machine is a DUAL PROCESSOR P3 750 w/ 512MB ram running on Windows Server 2003. I have now switched my configuration to keep wu's in memory and performed an update. We'll see what happens. Curiously, I have another wu running on a Fedora box that that is showing some other bizare behavior, but I'll start a new post for this one. I think this is the case when a slower machine (P3/750) takes too long to complete the first model and it gets pre-empted and removed from RAM / VM before even the first checkpoint is reached. In which case you need to keep in RAM while pre-empted and/or increase times between app switching to a higher value from default 60min, to e.g. 4hr in your case. ID: 649 · Reply Quote

tgm Send message Joined: 19 Feb 06 Posts: 5 Credit: 1,066 RAC: 0	Message 691 - Posted: 27 Feb 2006, 3:42:37 UTC - in response to Message 649. I think this is the case when a slower machine (P3/750) takes too long to complete the first model and it gets pre-empted and removed from RAM / VM before even the first checkpoint is reached. In which case you need to keep in RAM while pre-empted and/or increase times between app switching to a higher value from default 60min, to e.g. 4hr in your case. I sort of doubt this is the case. I know one of the wu's got up to more than 60% before it crashed. ID: 691 · Reply Quote

Dimitris Hatzopoulos Send message Joined: 16 Feb 06 Posts: 31 Credit: 2,308 RAC: 0	Message 701 - Posted: 27 Feb 2006, 10:10:08 UTC - in response to Message 691. I sort of doubt this is the case. I know one of the wu's got up to more than 60% before it crashed. Due to the way "new" Rosetta WUs work (variable # Models during a fixed time period e.g. 8hr), you might want to focus more on the Model / Step statistic, rather than % progress. In that regard, the WU stderr provided aren't very helpful to do remote-diagnostics. In my case, I got similar errors (for R@h, not RALPH) with yours on a machine which had multiple reboots over the previous 3 days, due to power problems. ID: 701 · Reply Quote

Aaron Finney Send message Joined: 16 Feb 06 Posts: 56 Credit: 1,457 RAC: 0	Message 875 - Posted: 14 Mar 2006, 16:21:02 UTC - in response to Message 4. Last modified: 14 Mar 2006, 16:21:17 UTC Had a problem with this on a workunit that had ran for 60 hours, application version 4.92 3/13/2006 7:40:03 PM\|\|Suspending computation and network activity - user request 3/13/2006 7:40:03 PM\|climateprediction.net\|Pausing result sulphur_id14_000856696_0 (removed from memory) 3/13/2006 7:40:03 PM\|ralph@home\|Pausing result TEST_HOMOLOG_ABINITIO_hom008_1fna__220_3_2 (removed from memory) 3/13/2006 7:40:04 PM\|ralph@home\|Unrecoverable error for result TEST_HOMOLOG_ABINITIO_hom008_1fna__220_3_2 ( - exit code -1073741819 (0xc0000005)) 3/13/2006 7:40:04 PM\|\|request_reschedule_cpus: process exited 3/13/2006 7:40:04 PM\|ralph@home\|Computation for result TEST_HOMOLOG_ABINITIO_hom008_1fna__220_3_2 finished 3/13/2006 7:40:05 PM\|\|request_reschedule_cpus: process exited 3/13/2006 7:40:07 PM\|\|Resuming computation and network activity 3/13/2006 7:40:07 PM\|\|request_reschedule_cpus: Resuming activities 3/13/2006 7:40:07 PM\|\|Allowing work fetch again. 3/13/2006 7:40:07 PM\|\|Resuming round-robin CPU scheduling. ID: 875 · Reply Quote

dekim Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 20 Jan 06 Posts: 250 Credit: 543,579 RAC: 0	Message 883 - Posted: 16 Mar 2006, 18:29:37 UTC The current windows application has a fix that we want to test for this issue. The last batch of work units have default cpu run times of 8 hours. Please let us know if the windows app version 4.93 continues to crash when switching to another app and not left in memory or if the fix helps. ID: 883 · Reply Quote

[B^S] sTrey Send message Joined: 15 Feb 06 Posts: 58 Credit: 15,430 RAC: 0	Message 887 - Posted: 16 Mar 2006, 23:07:20 UTC - in response to Message 38. I'm having a problem with this, but not the one you're trying to fix. BOINC simply does not have enough "venues" to set up custom situations to either test specific things or to tune resources for specific machines. And since it doesn't allow "local control", we have to balance carefully. Duh, thanks genes for pointing out the fact that different venues, few as they are, can be used in this way, even with the same host. With one machine and multiple projects I wasn't going to change my memory settings for this test, but on seeing this I reconfigured to help out. It also alleviates a bit of the strain on my box's vmem since I'm running cpdn's seasonal attribution project and it's quite a hog. ID: 887 · Reply Quote

Aglarond Send message Joined: 16 Feb 06 Posts: 11 Credit: 1,094 RAC: 0	Message 888 - Posted: 17 Mar 2006, 0:14:09 UTC - in response to Message 887. It also alleviates a bit of the strain on my box's vmem since I'm running cpdn's seasonal attribution project and it's quite a hog. Carefully with cpdn's seasonal attribution project. This is from their forums: If you have the option 'remove from memory' when preempting, and the boinc default of 1 hour between swapping, the chances are that you have thrown away the model each time you preempt. This project's defaults are 2 hours and 'keep in memory' for obvious reasons. ID: 888 · Reply Quote

scottLobster Send message Joined: 17 Feb 06 Posts: 1 Credit: 826 RAC: 0	Message 889 - Posted: 17 Mar 2006, 0:36:17 UTC - in response to Message 883. The current windows application has a fix that we want to test for this issue. The last batch of work units have default cpu run times of 8 hours. Please let us know if the windows app version 4.93 continues to crash when switching to another app and not left in memory or if the fix helps. Just did a few switches between Rosetta and Ralph with leave in memory disabled. Seems to work fine. Rosetta didn't crash either. I'll leave it like this overnight and see what happens. ID: 889 · Reply Quote

[B^S] sTrey Send message Joined: 15 Feb 06 Posts: 58 Credit: 15,430 RAC: 0	Message 890 - Posted: 17 Mar 2006, 2:45:03 UTC - in response to Message 888. Last modified: 17 Mar 2006, 2:46:37 UTC Carefully with cpdn's seasonal attribution project. This is from their forums: If you have the option 'remove from memory' when preempting, and the boinc default of 1 hour between swapping, the chances are that you have thrown away the model each time you preempt. This project's defaults are 2 hours and 'keep in memory' for obvious reasons. Thanks for the warning. I keep all my projects in memory and will continue to do so with everything except this project during this test. Just happy to have it pointed out that I can use venues to have one project get tossed from memory on suspend, and the rest left in. OTOH I'm not sure it's working. I added prefs for "school" and changed my computer to that venue, then did an update and saw the new venue message. My Ralph wu had not yet run. However it's since run for 2 hrs and been suspended, but rosetta beta is still in memory. p.s. I keep meaning to take out the sig but can't edit it out once posted, I'll go change my default. ID: 890 · Reply Quote

Stargazer257 Send message Joined: 16 Feb 06 Posts: 6 Credit: 17,492 RAC: 0	Message 892 - Posted: 17 Mar 2006, 6:26:26 UTC Last modified: 17 Mar 2006, 6:29:10 UTC So far, so good. Have run about 10 WUs on five different hosts (all WinXP SP2). No problems while changing settings to not stay resident in memory, and none so far with applications switching in and out. Knock on wood.... Join Us! - Click the Sig! ID: 892 · Reply Quote

[B^S] sTrey Send message Joined: 15 Feb 06 Posts: 58 Credit: 15,430 RAC: 0	Message 895 - Posted: 17 Mar 2006, 16:22:37 UTC - in response to Message 892. Last modified: 17 Mar 2006, 16:43:58 UTC So Aglarond was right to warn me. I added separate prefs for "school" and changed my computer's venue on this project only, and updated. hoping to have Ralph removed from memory when suspended but everything else stay resident. Overnight all my projects were removed from memory, not just ralph. [Even though it reported the venue correctly per project.] So apparently one can't fool around claiming one computer is in two places at once... Ralph behaved fine so far, for the 6 hours it's run. but I have switched back to keeping everything in memory. ID: 895 · Reply Quote

KB7RZF Send message Joined: 16 Feb 06 Posts: 7 Credit: 1,426 RAC: 0	Message 896 - Posted: 17 Mar 2006, 18:17:40 UTC Did some playing around with just RALPH running. I changed pref's to take everything out of memory, I exited BOINC, restarted, suspended, rebooted, everything I could think of, and so far RALPH has not errored out on me. Seems to be working good so far. Jeremy ID: 896 · Reply Quote

doc :) Send message Joined: 16 Feb 06 Posts: 46 Credit: 4,437 RAC: 0	Message 900 - Posted: 18 Mar 2006, 1:52:19 UTC no crash through removing from memory here so far either (changed my prefs for rosetta to 1h workunits and put my app switch time to 90 minutes to avoid removing rosettas from memory :)) i still get random crashes when i do have the graphics open though (the exit code -1073741811 (0xc000000d) thing) ID: 900 · Reply Quote

[B^S] sTrey Send message Joined: 15 Feb 06 Posts: 58 Credit: 15,430 RAC: 0	Message 909 - Posted: 19 Mar 2006, 0:57:20 UTC Last modified: 19 Mar 2006, 0:59:19 UTC FWIW Ralph had behaved fine both when swapped and not, but it didn't survive a pc restart forced by a windows lockup. It was not the active project at the time, chkdsk found nothing scrambled, and none of the other projects lost their work (even cpdn seasonal!) -- but the Ralph wu which was at hour 14 of 16, has restarted at zero. Bummer. ID: 909 · Reply Quote

Fuzzy Hollynoodles Send message Joined: 19 Feb 06 Posts: 37 Credit: 2,089 RAC: 0	Message 914 - Posted: 19 Mar 2006, 7:12:48 UTC Last modified: 19 Mar 2006, 7:14:25 UTC Rosetta crashed BIG time! 3/19/2006 8:07:15 AM\|rosetta@home\|Pausing result HOMSti_homDB019_1tif__352_1732_1 (removed from memory) 3/19/2006 8:07:15 AM\|SETI@home Beta Test\|Restarting result 01jl01ab.16610.114.798576.3.175_4 using setiathome_enhanced version 506 3/19/2006 8:07:16 AM\|\|Rescheduling CPU: project op ... 3/19/2006 8:07:24 AM\|rosetta@home\|Unrecoverable error for result HOMSti_homDB019_1tif__352_1732_1 ( - exit code -164 (0xffffff5c)) 3/19/2006 8:07:24 AM\|\|Rescheduling CPU: process exited 3/19/2006 8:07:24 AM\|rosetta@home\|Computation for result HOMSti_homDB019_1tif__352_1732_1 finished This WU: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10786875 Result: https://boinc.bakerlab.org/rosetta/result.php?resultid=13549302 I see though that this WU has crashed for somebody else, so maybe a coincidence? Even I don't think it is. Ralph WU runs fine. I've tried to force it to run by suspending the others, and then resuming them, so the Ralph WU are preempted, and no crashes (so far). [color=navy][b]"I'm trying to maintain a shred of dignity in this world." - Me[/b][/color] ID: 914 · Reply Quote

Marky-UK Send message Joined: 16 Feb 06 Posts: 5 Credit: 1,530 RAC: 0	Message 917 - Posted: 19 Mar 2006, 12:02:40 UTC Rosetta's just crashed for me too: https://ralph.bakerlab.org/workunit.php?wuid=17490 Unrecoverable error for result HB_BARCODE_30_1enh__352_83_0 ( - exit code -1073741811 (0xc000000d)) ID: 917 · Reply Quote

Fuzzy Hollynoodles Send message Joined: 19 Feb 06 Posts: 37 Credit: 2,089 RAC: 0	Message 918 - Posted: 19 Mar 2006, 16:31:09 UTC It happened again. 3/19/2006 5:36:43 PM\|rosetta@home\|Pausing result HB_BARCODE_30_1bq9A_351_14302_0 (removed from memory) 3/19/2006 5:36:44 PM\|rosetta@home\|Unrecoverable error for result HB_BARCODE_30_1bq9A_351_14302_0 ( - exit code -164 (0xffffff5c)) 3/19/2006 5:36:44 PM\|\|Rescheduling CPU: process exited 3/19/2006 5:36:44 PM\|rosetta@home\|Computation for result HB_BARCODE_30_1bq9A_351_14302_0 finished Rosetta WU: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11537937 Result: https://boinc.bakerlab.org/rosetta/result.php?resultid=14251499 So this is it, I'm changing back to keeping WU's in memory while preempted untill you get this bug fixed. Else you devs should say that we can't crunch Rosetta and Ralph WU's on the same computer! [color=navy][b]"I'm trying to maintain a shred of dignity in this world." - Me[/b][/color] ID: 918 · Reply Quote

Contact Send message Joined: 16 Feb 06 Posts: 20 Credit: 137,458 RAC: 2	Message 930 - Posted: 20 Mar 2006, 1:14:21 UTC Looks good. No matter what i do, can't get ralph to fail under Win98 or XP while switching with app removed from memory. ID: 930 · Reply Quote