Rosetta mini beta and/or android 3.61-3.83

Message boards : RALPH@home bug list : Rosetta mini beta and/or android 3.61-3.83

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 12 · Next

AuthorMessage
Snagletooth

Send message
Joined: 4 May 07
Posts: 67
Credit: 134,427
RAC: 0
Message 6042 - Posted: 5 Feb 2016, 15:58:53 UTC

I'm getting quick client/computer errors for the backrub_design tasks. From the stderr out:

minirosetta_3.71_x86_64-apple-darwin(50310,0x7fff732a2300) malloc: *** error for object 0x4b4fc3ef02e87d9a: pointer being freed was not allocated
*** set a breakpoint in malloc_error_break to debug


Also gaurav_rsmn_0161_65_daa2_2_SAVE_ALL_OUT_20296_50_0 is claiming a file transfer error:

# cpu_run_time_pref: 14400
reached end of minirosetta::main()
======================================================
DONE :: 2 starting structures 13443.3 cpu seconds
This process generated 13 decoys from 13 attempts
======================================================
BOINC :: WS_max 2.65622e+08

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>gaurav_rsmn_0161_65_daa2_2_SAVE_ALL_OUT_20296_50_0_0</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>


Are those results truly lost?
ID: 6042 · Report as offensive    Reply Quote
Profile dekim
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 20 Jan 06
Posts: 250
Credit: 543,579
RAC: 0
Message 6043 - Posted: 5 Feb 2016, 19:19:53 UTC - in response to Message 6042.  

I'm getting quick client/computer errors for the backrub_design tasks. From the stderr out:

minirosetta_3.71_x86_64-apple-darwin(50310,0x7fff732a2300) malloc: *** error for object 0x4b4fc3ef02e87d9a: pointer being freed was not allocated
*** set a breakpoint in malloc_error_break to debug


Also gaurav_rsmn_0161_65_daa2_2_SAVE_ALL_OUT_20296_50_0 is claiming a file transfer error:

# cpu_run_time_pref: 14400
reached end of minirosetta::main()
======================================================
DONE :: 2 starting structures 13443.3 cpu seconds
This process generated 13 decoys from 13 attempts
======================================================
BOINC :: WS_max 2.65622e+08

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish



upload failure:
gaurav_rsmn_0161_65_daa2_2_SAVE_ALL_OUT_20296_50_0_0
-161 (not found)


Are those results truly lost?


I'm not sure what is causing the backrub error but the gaurav jobs have a filter that may sometimes remove all models so the result is as expected for that test. I think the filter has been updated so that at least 1 model is generated in the next test batch but I'm not sure. Vikram, the one submitting those jobs is testing this.

ID: 6043 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 910
Credit: 1,892,541
RAC: 294
Message 6049 - Posted: 11 Feb 2016, 8:02:32 UTC

This first kind of android wus ("simple_cycpep_predict_") seems to be ok on my smartphone. Now i'm downloading a new type: "db_design5_".
ID: 6049 · Report as offensive    Reply Quote
Trotador

Send message
Joined: 7 May 10
Posts: 33
Credit: 14,751,677
RAC: 0
Message 6051 - Posted: 13 Feb 2016, 19:22:39 UTC - in response to Message 6037.  

The current Ralph WUs use huge amounts of RAM, I've seen up to 4 Gb per unit, is it on purpose? any new kind of simulation?

thanks for the info




Yes, I'm running a test of a new type of job that runs small perturbations of the protein backbone and then does a round of design. The design protocol can use a lot of memory. I realize that this will be problematic and will see if we can distribute these jobs to high memory machines. We may just not be able to run these on R@h.



I've crunched a lot of these backrub units, they are tough due to the large memory requirements. It is necessary to limit the quantity of units being simultaneously crunched and a lot of baby sitting, but it is also fun :).

Most of them don't use to go over 4 Gb but I got half a dozen reaching almost 7GB in the same host. It has 32 Gb but also 72 threads :), in short it stalled because lack of memory, So I finally had to abort them and a few more because they were nearly over the deadline.



ID: 6051 · Report as offensive    Reply Quote
siunik

Send message
Joined: 16 Mar 16
Posts: 1
Credit: 0
RAC: 0
Message 6055 - Posted: 16 Mar 2016, 4:04:25 UTC - in response to Message 5918.  

Yeah me too.. Don't understand.
ID: 6055 · Report as offensive    Reply Quote
Profile dekim
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 20 Jan 06
Posts: 250
Credit: 543,579
RAC: 0
Message 6056 - Posted: 17 Mar 2016, 18:30:24 UTC

I just updated the minirosetta_beta application to 3.72. The 32 bit linux version has not been updated yet due to some memory issues while compiling. I hope to have it available soon.
ID: 6056 · Report as offensive    Reply Quote
Dr. Merkwürdigliebe

Send message
Joined: 12 Jun 15
Posts: 16
Credit: 23,473
RAC: 0
Message 6057 - Posted: 17 Mar 2016, 19:20:10 UTC - in response to Message 6056.  

Just a short question:



Why does ralph@home also download minirosetta_3.71 ?
ID: 6057 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 910
Credit: 1,892,541
RAC: 294
Message 6058 - Posted: 18 Mar 2016, 6:36:57 UTC

Some memory errors on my win10

3752038

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x015FC9A4 read attempt to address 0x2F551088


3752039

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x015FCA02 read attempt to address 0x30A68058


3752805

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x015FCA02 read attempt to address 0x2194F048
ID: 6058 · Report as offensive    Reply Quote
Dr. Merkwürdigliebe

Send message
Joined: 12 Jun 15
Posts: 16
Credit: 23,473
RAC: 0
Message 6059 - Posted: 18 Mar 2016, 15:19:30 UTC

Lots of validation errors, e.g.

Validation error
ID: 6059 · Report as offensive    Reply Quote
Trotador

Send message
Joined: 7 May 10
Posts: 33
Credit: 14,751,677
RAC: 0
Message 6060 - Posted: 18 Mar 2016, 19:52:16 UTC

In one of my hosts, all "des5ralph_design5" units failing after finishing crunching OK with

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>des5ralph_design5_hydrophobic32_test1_buriedtrp_S_0095_SAVE_ALL_OUT_20313_229_0_0</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>

</message>
]]>

This host have have processing time above default, all units have been crunched during 9-12 hours and generated lot of decoys but end with this error.

Wingmen crunhing just an hour and generating few decoys are uploading OK.
ID: 6060 · Report as offensive    Reply Quote
Trotador

Send message
Joined: 7 May 10
Posts: 33
Credit: 14,751,677
RAC: 0
Message 6061 - Posted: 19 Mar 2016, 0:30:06 UTC

All units erroring in all my Linux hosts:

Some of the wus failing after finishing crunching OK with the error (these wus were donwloaded yesterday):

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>des5ralph_design5_hydrophobic32_test1_buriedtrp_S_0095_SAVE_ALL_OUT_20313_229_0_0</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>

</message>
]]>

Other failing after several hours or after restarting BOINC and reporting 0 seconds of time computed with the error (these ones dowloaded today):

ERROR: ERROR: Option matching -cyclic_peptide:user_set_alph_dihedral_perturbation not found in command line top-level context

I'm seing that most of the windows hosts seem to finish Ok the wu and report success, but it is not a conclusive fact.

Stopping crunching until knowing more.


ID: 6061 · Report as offensive    Reply Quote
BlisteringSheep

Send message
Joined: 3 Nov 15
Posts: 4
Credit: 2,231,667
RAC: 8
Message 6062 - Posted: 19 Mar 2016, 2:21:01 UTC - in response to Message 5861.  

With 3.72, no successful work units on any Linux hosts. Tested across multiple distributions (all 64-bit). They are running to completion, but then reporting output file missing.
ID: 6062 · Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 13 Jan 09
Posts: 103
Credit: 331,865
RAC: 0
Message 6063 - Posted: 19 Mar 2016, 2:41:04 UTC
Last modified: 19 Mar 2016, 2:42:06 UTC

These workunits gave a a computation error at about the same time that a workunit from another BOINC projects reached a point with a rather high memory demand - over a gigabyte. So they might be due to running out of memory, rather than anything else.

https://ralph.bakerlab.org/result.php?resultid=3762275

https://ralph.bakerlab.org/result.php?resultid=3761810

https://ralph.bakerlab.org/result.php?resultid=3761801

https://ralph.bakerlab.org/result.php?resultid=3757576

https://ralph.bakerlab.org/result.php?resultid=3756003

However, my other computer running BOINC rarely runs out of memory, and gave a different error for some recent workunits.

https://ralph.bakerlab.org/result.php?resultid=3757706

https://ralph.bakerlab.org/result.php?resultid=3753036

https://ralph.bakerlab.org/result.php?resultid=3752853

The application was shown as Rosetta Mini Beta, with no version number I could find after the workunits finished. The second computer shows three workunits that may be this type, still marked as version 3.72 while still on the computer.

https://ralph.bakerlab.org/result.php?resultid=3763701

https://ralph.bakerlab.org/result.php?resultid=3762417

https://ralph.bakerlab.org/result.php?resultid=3763972

I've already looked into adding more memory for each of my computers that run BOINC. Their motherboards are not compatible with adding more.
ID: 6063 · Report as offensive    Reply Quote
keputnam

Send message
Joined: 17 Feb 06
Posts: 2
Credit: 48,278
RAC: 0
Message 6064 - Posted: 19 Mar 2016, 2:43:31 UTC

Add me to the no more till it's fixedlist

four WUs 0 successes 14 more stackee up that I will abort
ID: 6064 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 910
Credit: 1,892,541
RAC: 294
Message 6065 - Posted: 19 Mar 2016, 8:38:42 UTC - in response to Message 6063.  

I've already looked into adding more memory for each of my computers that run BOINC. Their motherboards are not compatible with adding more.


My 6 cores has 16 Gb of ram and i have also wu's failure.
I think it's not a question of "how much" memory, but seems to be an allocation problem.
A 3.73 version will be welcome!

P.S.
3.72 uses from 40 to 90 Mb of ram on my machines....
ID: 6065 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 910
Credit: 1,892,541
RAC: 294
Message 6066 - Posted: 19 Mar 2016, 9:19:29 UTC

Strange behaviour.
Some wus fail after few minutes, others after 2 hours....
ID: 6066 · Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 15 Nov 12
Posts: 15
Credit: 404,700
RAC: 0
Message 6068 - Posted: 19 Mar 2016, 17:41:43 UTC

Same here. A LOT of random WUs crashes on v 3.72
Different hosts, different CPUs (4/6/8 cores), different OS (Win 7 x64 and WinXP x32) - all getting a lot failed WUs with "Unhandled Exception Detected..." in logs
ID: 6068 · Report as offensive    Reply Quote
Snagletooth

Send message
Joined: 4 May 07
Posts: 67
Credit: 134,427
RAC: 0
Message 6069 - Posted: 19 Mar 2016, 17:52:46 UTC

So far all "des5ralph_design5" tasks have failed and two of the three currently processing are exhibiting some curious behavior. Those that failed ended with:

std::cerr: Exception was thrown:
Cannot normalize xyzVector of length() zero


My target runtime is four hours. All of the tasks currently processing have exceeded that by two, eight and twenty-seven hours. According to the properties tab no checkpoints have been taken. I have confirmed via the computers' Activity Managers that all tasks are currently using the cpu. In the stderr out of the tasks that failed the lines "Starting watchdog...Watchdog active." do appear so presumably the watchdog is set but not working in the tasks I'm running now.

Even more curious, two of the tasks on two different machines, with different versions of the Mac OS and different versions of BOINC, are recording elapsed times of less than the cpu times. Even my usually creative imagination is stumped by this.

It seems fairly obvious that these tasks will have to be aborted but I'll hold off a bit in case anyone has any questions or DEK wants to try and retrieve a file for closer examination.

Best,
Snags
ID: 6069 · Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 16 Feb 06
Posts: 364
Credit: 1,368,421
RAC: 0
Message 6070 - Posted: 19 Mar 2016, 23:04:23 UTC - in response to Message 6062.  

With 3.72, no successful work units on any Linux hosts. Tested across multiple distributions (all 64-bit). They are running to completion, but then reporting output file missing.


I am seeing the same thing, NO successful work units at all. Most run to completion (for me that is a 6 hour run time) but a number are also failing in less than an hour.
This is on a 64 bit Linux host.

Conan

ID: 6070 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 910
Credit: 1,892,541
RAC: 294
Message 6071 - Posted: 21 Mar 2016, 20:22:39 UTC

An error also with the T0599_ batch, wu 3322377

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x013CC270 write attempt to address 0x017D7EC1
ID: 6071 · Report as offensive    Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 12 · Next

Message boards : RALPH@home bug list : Rosetta mini beta and/or android 3.61-3.83



©2024 University of Washington
http://www.bakerlab.org