Rosetta mini beta and/or android 3.61-3.83

Author	Message
Snagletooth Send message Joined: 4 May 07 Posts: 67 Credit: 134,427 RAC: 0	Message 6042 - Posted: 5 Feb 2016, 15:58:53 UTC I'm getting quick client/computer errors for the backrub_design tasks. From the stderr out: minirosetta_3.71_x86_64-apple-darwin(50310,0x7fff732a2300) malloc: * error for object 0x4b4fc3ef02e87d9a: pointer being freed was not allocated * set a breakpoint in malloc_error_break to debug Also gaurav_rsmn_0161_65_daa2_2_SAVE_ALL_OUT_20296_50_0 is claiming a file transfer error: # cpu_run_time_pref: 14400 reached end of minirosetta::main() ====================================================== DONE :: 2 starting structures 13443.3 cpu seconds This process generated 13 decoys from 13 attempts ====================================================== BOINC :: WS_max 2.65622e+08 BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down cleanly ... called boinc_finish </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>gaurav_rsmn_0161_65_daa2_2_SAVE_ALL_OUT_20296_50_0_0</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> Are those results truly lost? ID: 6042 · Reply Quote

dekim Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 20 Jan 06 Posts: 250 Credit: 543,579 RAC: 0	Message 6043 - Posted: 5 Feb 2016, 19:19:53 UTC - in response to Message 6042. I'm getting quick client/computer errors for the backrub_design tasks. From the stderr out: minirosetta_3.71_x86_64-apple-darwin(50310,0x7fff732a2300) malloc: * error for object 0x4b4fc3ef02e87d9a: pointer being freed was not allocated * set a breakpoint in malloc_error_break to debug Also gaurav_rsmn_0161_65_daa2_2_SAVE_ALL_OUT_20296_50_0 is claiming a file transfer error: # cpu_run_time_pref: 14400 reached end of minirosetta::main() ====================================================== DONE :: 2 starting structures 13443.3 cpu seconds This process generated 13 decoys from 13 attempts ====================================================== BOINC :: WS_max 2.65622e+08 BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down cleanly ... called boinc_finish upload failure: gaurav_rsmn_0161_65_daa2_2_SAVE_ALL_OUT_20296_50_0_0 -161 (not found) Are those results truly lost? I'm not sure what is causing the backrub error but the gaurav jobs have a filter that may sometimes remove all models so the result is as expected for that test. I think the filter has been updated so that at least 1 model is generated in the next test batch but I'm not sure. Vikram, the one submitting those jobs is testing this. ID: 6043 · Reply Quote

[VENETO] boboviz Send message Joined: 9 Apr 08 Posts: 929 Credit: 1,892,541 RAC: 294	Message 6049 - Posted: 11 Feb 2016, 8:02:32 UTC This first kind of android wus ("simple_cycpep_predict_") seems to be ok on my smartphone. Now i'm downloading a new type: "db_design5_". ID: 6049 · Reply Quote

Trotador Send message Joined: 7 May 10 Posts: 33 Credit: 14,751,677 RAC: 0	Message 6051 - Posted: 13 Feb 2016, 19:22:39 UTC - in response to Message 6037. The current Ralph WUs use huge amounts of RAM, I've seen up to 4 Gb per unit, is it on purpose? any new kind of simulation? thanks for the info Yes, I'm running a test of a new type of job that runs small perturbations of the protein backbone and then does a round of design. The design protocol can use a lot of memory. I realize that this will be problematic and will see if we can distribute these jobs to high memory machines. We may just not be able to run these on R@h. I've crunched a lot of these backrub units, they are tough due to the large memory requirements. It is necessary to limit the quantity of units being simultaneously crunched and a lot of baby sitting, but it is also fun :). Most of them don't use to go over 4 Gb but I got half a dozen reaching almost 7GB in the same host. It has 32 Gb but also 72 threads :), in short it stalled because lack of memory, So I finally had to abort them and a few more because they were nearly over the deadline. ID: 6051 · Reply Quote

siunik Send message Joined: 16 Mar 16 Posts: 1 Credit: 0 RAC: 0	Message 6055 - Posted: 16 Mar 2016, 4:04:25 UTC - in response to Message 5918. Yeah me too.. Don't understand. ID: 6055 · Reply Quote

dekim Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 20 Jan 06 Posts: 250 Credit: 543,579 RAC: 0	Message 6056 - Posted: 17 Mar 2016, 18:30:24 UTC I just updated the minirosetta_beta application to 3.72. The 32 bit linux version has not been updated yet due to some memory issues while compiling. I hope to have it available soon. ID: 6056 · Reply Quote

Dr. MerkwÃ¼rdigliebe Send message Joined: 12 Jun 15 Posts: 16 Credit: 23,473 RAC: 0	Message 6057 - Posted: 17 Mar 2016, 19:20:10 UTC - in response to Message 6056. Just a short question: Why does ralph@home also download minirosetta_3.71 ? ID: 6057 · Reply Quote

[VENETO] boboviz Send message Joined: 9 Apr 08 Posts: 929 Credit: 1,892,541 RAC: 294	Message 6058 - Posted: 18 Mar 2016, 6:36:57 UTC Some memory errors on my win10 3752038 - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x015FC9A4 read attempt to address 0x2F551088 3752039 - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x015FCA02 read attempt to address 0x30A68058 3752805 - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x015FCA02 read attempt to address 0x2194F048 ID: 6058 · Reply Quote

Dr. MerkwÃ¼rdigliebe Send message Joined: 12 Jun 15 Posts: 16 Credit: 23,473 RAC: 0	Message 6059 - Posted: 18 Mar 2016, 15:19:30 UTC Lots of validation errors, e.g. Validation error ID: 6059 · Reply Quote

Trotador Send message Joined: 7 May 10 Posts: 33 Credit: 14,751,677 RAC: 0	Message 6060 - Posted: 18 Mar 2016, 19:52:16 UTC In one of my hosts, all "des5ralph_design5" units failing after finishing crunching OK with </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>des5ralph_design5_hydrophobic32_test1_buriedtrp_S_0095_SAVE_ALL_OUT_20313_229_0_0</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> </message> ]]> This host have have processing time above default, all units have been crunched during 9-12 hours and generated lot of decoys but end with this error. Wingmen crunhing just an hour and generating few decoys are uploading OK. ID: 6060 · Reply Quote

Trotador Send message Joined: 7 May 10 Posts: 33 Credit: 14,751,677 RAC: 0	Message 6061 - Posted: 19 Mar 2016, 0:30:06 UTC All units erroring in all my Linux hosts: Some of the wus failing after finishing crunching OK with the error (these wus were donwloaded yesterday): </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>des5ralph_design5_hydrophobic32_test1_buriedtrp_S_0095_SAVE_ALL_OUT_20313_229_0_0</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> </message> ]]> Other failing after several hours or after restarting BOINC and reporting 0 seconds of time computed with the error (these ones dowloaded today): ERROR: ERROR: Option matching -cyclic_peptide:user_set_alph_dihedral_perturbation not found in command line top-level context I'm seing that most of the windows hosts seem to finish Ok the wu and report success, but it is not a conclusive fact. Stopping crunching until knowing more. ID: 6061 · Reply Quote

BlisteringSheep Send message Joined: 3 Nov 15 Posts: 4 Credit: 2,231,667 RAC: 8	Message 6062 - Posted: 19 Mar 2016, 2:21:01 UTC - in response to Message 5861. With 3.72, no successful work units on any Linux hosts. Tested across multiple distributions (all 64-bit). They are running to completion, but then reporting output file missing. ID: 6062 · Reply Quote

robertmiles Send message Joined: 13 Jan 09 Posts: 103 Credit: 331,865 RAC: 0	Message 6063 - Posted: 19 Mar 2016, 2:41:04 UTC Last modified: 19 Mar 2016, 2:42:06 UTC These workunits gave a a computation error at about the same time that a workunit from another BOINC projects reached a point with a rather high memory demand - over a gigabyte. So they might be due to running out of memory, rather than anything else. https://ralph.bakerlab.org/result.php?resultid=3762275 https://ralph.bakerlab.org/result.php?resultid=3761810 https://ralph.bakerlab.org/result.php?resultid=3761801 https://ralph.bakerlab.org/result.php?resultid=3757576 https://ralph.bakerlab.org/result.php?resultid=3756003 However, my other computer running BOINC rarely runs out of memory, and gave a different error for some recent workunits. https://ralph.bakerlab.org/result.php?resultid=3757706 https://ralph.bakerlab.org/result.php?resultid=3753036 https://ralph.bakerlab.org/result.php?resultid=3752853 The application was shown as Rosetta Mini Beta, with no version number I could find after the workunits finished. The second computer shows three workunits that may be this type, still marked as version 3.72 while still on the computer. https://ralph.bakerlab.org/result.php?resultid=3763701 https://ralph.bakerlab.org/result.php?resultid=3762417 https://ralph.bakerlab.org/result.php?resultid=3763972 I've already looked into adding more memory for each of my computers that run BOINC. Their motherboards are not compatible with adding more. ID: 6063 · Reply Quote

keputnam Send message Joined: 17 Feb 06 Posts: 2 Credit: 48,278 RAC: 0	Message 6064 - Posted: 19 Mar 2016, 2:43:31 UTC Add me to the no more till it's fixedlist four WUs 0 successes 14 more stackee up that I will abort ID: 6064 · Reply Quote

[VENETO] boboviz Send message Joined: 9 Apr 08 Posts: 929 Credit: 1,892,541 RAC: 294	Message 6065 - Posted: 19 Mar 2016, 8:38:42 UTC - in response to Message 6063. I've already looked into adding more memory for each of my computers that run BOINC. Their motherboards are not compatible with adding more. My 6 cores has 16 Gb of ram and i have also wu's failure. I think it's not a question of "how much" memory, but seems to be an allocation problem. A 3.73 version will be welcome! P.S. 3.72 uses from 40 to 90 Mb of ram on my machines.... ID: 6065 · Reply Quote

[VENETO] boboviz Send message Joined: 9 Apr 08 Posts: 929 Credit: 1,892,541 RAC: 294	Message 6066 - Posted: 19 Mar 2016, 9:19:29 UTC Strange behaviour. Some wus fail after few minutes, others after 2 hours.... ID: 6066 · Reply Quote

Mad_Max Send message Joined: 15 Nov 12 Posts: 15 Credit: 404,700 RAC: 0	Message 6068 - Posted: 19 Mar 2016, 17:41:43 UTC Same here. A LOT of random WUs crashes on v 3.72 Different hosts, different CPUs (4/6/8 cores), different OS (Win 7 x64 and WinXP x32) - all getting a lot failed WUs with "Unhandled Exception Detected..." in logs ID: 6068 · Reply Quote

Snagletooth Send message Joined: 4 May 07 Posts: 67 Credit: 134,427 RAC: 0	Message 6069 - Posted: 19 Mar 2016, 17:52:46 UTC So far all "des5ralph_design5" tasks have failed and two of the three currently processing are exhibiting some curious behavior. Those that failed ended with: std::cerr: Exception was thrown: Cannot normalize xyzVector of length() zero My target runtime is four hours. All of the tasks currently processing have exceeded that by two, eight and twenty-seven hours. According to the properties tab no checkpoints have been taken. I have confirmed via the computers' Activity Managers that all tasks are currently using the cpu. In the stderr out of the tasks that failed the lines "Starting watchdog...Watchdog active." do appear so presumably the watchdog is set but not working in the tasks I'm running now. Even more curious, two of the tasks on two different machines, with different versions of the Mac OS and different versions of BOINC, are recording elapsed times of less than the cpu times. Even my usually creative imagination is stumped by this. It seems fairly obvious that these tasks will have to be aborted but I'll hold off a bit in case anyone has any questions or DEK wants to try and retrieve a file for closer examination. Best, Snags ID: 6069 · Reply Quote

Conan Send message Joined: 16 Feb 06 Posts: 364 Credit: 1,368,421 RAC: 0	Message 6070 - Posted: 19 Mar 2016, 23:04:23 UTC - in response to Message 6062. With 3.72, no successful work units on any Linux hosts. Tested across multiple distributions (all 64-bit). They are running to completion, but then reporting output file missing. I am seeing the same thing, NO successful work units at all. Most run to completion (for me that is a 6 hour run time) but a number are also failing in less than an hour. This is on a 64 bit Linux host. Conan ID: 6070 · Reply Quote

[VENETO] boboviz Send message Joined: 9 Apr 08 Posts: 929 Credit: 1,892,541 RAC: 294	Message 6071 - Posted: 21 Mar 2016, 20:22:39 UTC An error also with the T0599_ batch, wu 3322377 - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x013CC270 write attempt to address 0x017D7EC1 ID: 6071 · Reply Quote