Message boards : Current tests : Switching between projects with applications removed from memory
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Aglarond Send message Joined: 16 Feb 06 Posts: 11 Credit: 1,094 RAC: 0 |
Now I looked into WU, that was running when I tried to switch apps in Boinc (without leavin in memory) and also, while I have put my laptop into standby. This is part of it: <stderr_txt> ... No heartbeat from core client for 31 sec - exiting ... </stderr_txt> Do you think this can be the reason why Rosetta exits after my system wake-ups from standby? It doesn't exit when I wake-up my laptop in just few seconds. This behavior is similar with other Boinc projects. |
tgm Send message Joined: 19 Feb 06 Posts: 5 Credit: 1,066 RAC: 0 |
Removing rosetta beta 4.87 work units from memory on one of my windows machines is definitely FAILING with end state client error. This machine is a DUAL PROCESSOR P3 750 w/ 512MB ram running on Windows Server 2003. I have three examples: https://ralph.bakerlab.org/workunit.php?wuid=5559 https://ralph.bakerlab.org/workunit.php?wuid=5560 https://ralph.bakerlab.org/workunit.php?wuid=5561 I have now switched my configuration to keep wu's in memory and performed an update. We'll see what happens. Curiously, I have another wu running on a Fedora box that that is showing some other bizare behavior, but I'll start a new post for this one. |
Dimitris Hatzopoulos Send message Joined: 16 Feb 06 Posts: 31 Credit: 2,308 RAC: 0 |
Removing rosetta beta 4.87 work units from memory on one of my windows machines is definitely FAILING with end state client error. This machine is a DUAL PROCESSOR P3 750 w/ 512MB ram running on Windows Server 2003. I think this is the case when a slower machine (P3/750) takes too long to complete the first model and it gets pre-empted and removed from RAM / VM before even the first checkpoint is reached. In which case you need to keep in RAM while pre-empted and/or increase times between app switching to a higher value from default 60min, to e.g. 4hr in your case. |
tgm Send message Joined: 19 Feb 06 Posts: 5 Credit: 1,066 RAC: 0 |
I think this is the case when a slower machine (P3/750) takes too long to complete the first model and it gets pre-empted and removed from RAM / VM before even the first checkpoint is reached. I sort of doubt this is the case. I know one of the wu's got up to more than 60% before it crashed. |
Dimitris Hatzopoulos Send message Joined: 16 Feb 06 Posts: 31 Credit: 2,308 RAC: 0 |
I sort of doubt this is the case. I know one of the wu's got up to more than 60% before it crashed. Due to the way "new" Rosetta WUs work (variable # Models during a fixed time period e.g. 8hr), you might want to focus more on the Model / Step statistic, rather than % progress. In that regard, the WU stderr provided aren't very helpful to do remote-diagnostics. In my case, I got similar errors (for R@h, not RALPH) with yours on a machine which had multiple reboots over the previous 3 days, due to power problems. |
Aaron Finney Send message Joined: 16 Feb 06 Posts: 56 Credit: 1,457 RAC: 0 |
Had a problem with this on a workunit that had ran for 60 hours, application version 4.92 3/13/2006 7:40:03 PM||Suspending computation and network activity - user request 3/13/2006 7:40:03 PM|climateprediction.net|Pausing result sulphur_id14_000856696_0 (removed from memory) 3/13/2006 7:40:03 PM|ralph@home|Pausing result TEST_HOMOLOG_ABINITIO_hom008_1fna__220_3_2 (removed from memory) 3/13/2006 7:40:04 PM|ralph@home|Unrecoverable error for result TEST_HOMOLOG_ABINITIO_hom008_1fna__220_3_2 ( - exit code -1073741819 (0xc0000005)) 3/13/2006 7:40:04 PM||request_reschedule_cpus: process exited 3/13/2006 7:40:04 PM|ralph@home|Computation for result TEST_HOMOLOG_ABINITIO_hom008_1fna__220_3_2 finished 3/13/2006 7:40:05 PM||request_reschedule_cpus: process exited 3/13/2006 7:40:07 PM||Resuming computation and network activity 3/13/2006 7:40:07 PM||request_reschedule_cpus: Resuming activities 3/13/2006 7:40:07 PM||Allowing work fetch again. 3/13/2006 7:40:07 PM||Resuming round-robin CPU scheduling. |
dekim Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 20 Jan 06 Posts: 250 Credit: 543,579 RAC: 0 |
The current windows application has a fix that we want to test for this issue. The last batch of work units have default cpu run times of 8 hours. Please let us know if the windows app version 4.93 continues to crash when switching to another app and not left in memory or if the fix helps. |
[B^S] sTrey Send message Joined: 15 Feb 06 Posts: 58 Credit: 15,430 RAC: 0 |
I'm having a problem with this, but not the one you're trying to fix. BOINC simply does not have enough "venues" to set up custom situations to either test specific things or to tune resources for specific machines. And since it doesn't allow "local control", we have to balance carefully. Duh, thanks genes for pointing out the fact that different venues, few as they are, can be used in this way, even with the same host. With one machine and multiple projects I wasn't going to change my memory settings for this test, but on seeing this I reconfigured to help out. It also alleviates a bit of the strain on my box's vmem since I'm running cpdn's seasonal attribution project and it's quite a hog. |
Aglarond Send message Joined: 16 Feb 06 Posts: 11 Credit: 1,094 RAC: 0 |
It also alleviates a bit of the strain on my box's vmem since I'm running cpdn's seasonal attribution project and it's quite a hog. Carefully with cpdn's seasonal attribution project. This is from their forums: If you have the option 'remove from memory' when preempting, and the boinc default of 1 hour between swapping, the chances are that you have thrown away the model each time you preempt. This project's defaults are 2 hours and 'keep in memory' for obvious reasons. |
scottLobster Send message Joined: 17 Feb 06 Posts: 1 Credit: 826 RAC: 0 |
The current windows application has a fix that we want to test for this issue. The last batch of work units have default cpu run times of 8 hours. Please let us know if the windows app version 4.93 continues to crash when switching to another app and not left in memory or if the fix helps. Just did a few switches between Rosetta and Ralph with leave in memory disabled. Seems to work fine. Rosetta didn't crash either. I'll leave it like this overnight and see what happens. |
[B^S] sTrey Send message Joined: 15 Feb 06 Posts: 58 Credit: 15,430 RAC: 0 |
Carefully with cpdn's seasonal attribution project. This is from their forums: Thanks for the warning. I keep all my projects in memory and will continue to do so with everything except this project during this test. Just happy to have it pointed out that I can use venues to have one project get tossed from memory on suspend, and the rest left in. OTOH I'm not sure it's working. I added prefs for "school" and changed my computer to that venue, then did an update and saw the new venue message. My Ralph wu had not yet run. However it's since run for 2 hrs and been suspended, but rosetta beta is still in memory. p.s. I keep meaning to take out the sig but can't edit it out once posted, I'll go change my default. |
Stargazer257 Send message Joined: 16 Feb 06 Posts: 6 Credit: 17,492 RAC: 0 |
|
[B^S] sTrey Send message Joined: 15 Feb 06 Posts: 58 Credit: 15,430 RAC: 0 |
So Aglarond was right to warn me. I added separate prefs for "school" and changed my computer's venue on this project only, and updated. hoping to have Ralph removed from memory when suspended but everything else stay resident. Overnight all my projects were removed from memory, not just ralph. [Even though it reported the venue correctly per project.] So apparently one can't fool around claiming one computer is in two places at once... Ralph behaved fine so far, for the 6 hours it's run. but I have switched back to keeping everything in memory. |
KB7RZF Send message Joined: 16 Feb 06 Posts: 7 Credit: 1,426 RAC: 0 |
Did some playing around with just RALPH running. I changed pref's to take everything out of memory, I exited BOINC, restarted, suspended, rebooted, everything I could think of, and so far RALPH has not errored out on me. Seems to be working good so far. Jeremy |
doc :) Send message Joined: 16 Feb 06 Posts: 46 Credit: 4,437 RAC: 0 |
no crash through removing from memory here so far either (changed my prefs for rosetta to 1h workunits and put my app switch time to 90 minutes to avoid removing rosettas from memory :)) i still get random crashes when i do have the graphics open though (the exit code -1073741811 (0xc000000d) thing) |
[B^S] sTrey Send message Joined: 15 Feb 06 Posts: 58 Credit: 15,430 RAC: 0 |
FWIW Ralph had behaved fine both when swapped and not, but it didn't survive a pc restart forced by a windows lockup. It was not the active project at the time, chkdsk found nothing scrambled, and none of the other projects lost their work (even cpdn seasonal!) -- but the Ralph wu which was at hour 14 of 16, has restarted at zero. Bummer. |
Fuzzy Hollynoodles Send message Joined: 19 Feb 06 Posts: 37 Credit: 2,089 RAC: 0 |
Rosetta crashed BIG time! 3/19/2006 8:07:15 AM|rosetta@home|Pausing result HOMSti_homDB019_1tif__352_1732_1 (removed from memory) 3/19/2006 8:07:15 AM|SETI@home Beta Test|Restarting result 01jl01ab.16610.114.798576.3.175_4 using setiathome_enhanced version 506 3/19/2006 8:07:16 AM||Rescheduling CPU: project op ... 3/19/2006 8:07:24 AM|rosetta@home|Unrecoverable error for result HOMSti_homDB019_1tif__352_1732_1 ( - exit code -164 (0xffffff5c)) 3/19/2006 8:07:24 AM||Rescheduling CPU: process exited 3/19/2006 8:07:24 AM|rosetta@home|Computation for result HOMSti_homDB019_1tif__352_1732_1 finished This WU: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10786875 Result: https://boinc.bakerlab.org/rosetta/result.php?resultid=13549302 I see though that this WU has crashed for somebody else, so maybe a coincidence? Even I don't think it is. Ralph WU runs fine. I've tried to force it to run by suspending the others, and then resuming them, so the Ralph WU are preempted, and no crashes (so far). [color=navy][b]"I'm trying to maintain a shred of dignity in this world." - Me[/b][/color] |
Marky-UK Send message Joined: 16 Feb 06 Posts: 5 Credit: 1,530 RAC: 0 |
Rosetta's just crashed for me too: https://ralph.bakerlab.org/workunit.php?wuid=17490 Unrecoverable error for result HB_BARCODE_30_1enh__352_83_0 ( - exit code -1073741811 (0xc000000d)) |
Fuzzy Hollynoodles Send message Joined: 19 Feb 06 Posts: 37 Credit: 2,089 RAC: 0 |
It happened again. 3/19/2006 5:36:43 PM|rosetta@home|Pausing result HB_BARCODE_30_1bq9A_351_14302_0 (removed from memory) 3/19/2006 5:36:44 PM|rosetta@home|Unrecoverable error for result HB_BARCODE_30_1bq9A_351_14302_0 ( - exit code -164 (0xffffff5c)) 3/19/2006 5:36:44 PM||Rescheduling CPU: process exited 3/19/2006 5:36:44 PM|rosetta@home|Computation for result HB_BARCODE_30_1bq9A_351_14302_0 finished Rosetta WU: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11537937 Result: https://boinc.bakerlab.org/rosetta/result.php?resultid=14251499 So this is it, I'm changing back to keeping WU's in memory while preempted untill you get this bug fixed. Else you devs should say that we can't crunch Rosetta and Ralph WU's on the same computer! [color=navy][b]"I'm trying to maintain a shred of dignity in this world." - Me[/b][/color] |
Contact Send message Joined: 16 Feb 06 Posts: 20 Credit: 137,458 RAC: 2 |
|
Message boards :
Current tests :
Switching between projects with applications removed from memory
©2024 University of Washington
http://www.bakerlab.org