Posts by Trotador

1) Message boards : RALPH@home bug list : Rosetta 4.12+ (Message 6781)
Posted 1 May 2020 by Trotador
Post:
4.20 still failing this way
https://ralph.bakerlab.org/workunit.php?wuid=4518120

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<stderr_txt>
command: ../../projects/ralph.bakerlab.org/rosetta_4.20_x86_64-pc-linux-gnu -run:protocol jd2_scripting -parser:protocol predictor_v11_boinc--fuse--il1r_design_boinc_v1_mod.xml @flags_il6r2 -in:file:silent Mini_Protein_binds_IL6R_COVID-19_test3_SAVE_ALL_OUT_IGNORE_THE_REST_0cj9pv7f.silent -in:file:silent_struct_type binary -silent_gz -mute all -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip Mini_Protein_binds_IL6R_COVID-19_test3_SAVE_ALL_OUT_IGNORE_THE_REST_0cj9pv7f.zip @Mini_Protein_binds_IL6R_COVID-19_test3_SAVE_ALL_OUT_IGNORE_THE_REST_0cj9pv7f.flags -nstruct 10000 -cpu_run_time 3600 -boinc:max_nstruct 5000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3976176
Using database: database_357d5d93529_n_methyl/minirosetta_database
======================================================
DONE :: 1 starting structures 1201 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
BOINC :: WS_max 0
06:39:06 (90949): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>Mini_Protein_binds_IL6R_COVID-19_test3_SAVE_ALL_OUT_IGNORE_THE_REST_0cj9pv7f_32_771_1_r543859209_0</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
</message>
]]>
2) Message boards : RALPH@home bug list : Some tasks never finish (Message 6754)
Posted 28 Apr 2020 by Trotador
Post:
I'm observing that some tasks reach to 98.1% completion and do not progress more, and are now over 9 hours processing time and continuing when I think the batch time to compeetion is around two hours.

My hosts have ubuntu installed. Many threads but plenty of RAM for this batch.

I'm seen it in two different hosts, all wus downloaded today, one of the host has not bee able to complete any task from the batch, I close boinc manager a restarted it and all wu started from zero. The other host has completed and is completing tasks in two hours time but also have a group of tasks over 9 nine hours now that do not go above 98.3% completion percentage.

In a third host, tasks are now just around 98% completion, two hours processing time and my feeling is that will continue processing for hours.

I've suspended all tasks but a few ones to see how they evolve.

Anyone else?
3) Message boards : News : Testing a switch to using SSL (Secure Socket Layer) (Message 6745)
Posted 28 Apr 2020 by Trotador
Post:
All units seem to be failing across all hosts


Well, many of them, not all
4) Message boards : News : Testing a switch to using SSL (Secure Socket Layer) (Message 6744)
Posted 28 Apr 2020 by Trotador
Post:
All units seem to be failing across all hosts
5) Message boards : News : Testing a switch to using SSL (Secure Socket Layer) (Message 6738)
Posted 25 Apr 2020 by Trotador
Post:
I queued up more jobs. I'm not sure what is happening with the very short run failures.


Still happening in the last batch

https://ralph.bakerlab.org/workunit.php?wuid=4484345
6) Message boards : News : Testing a switch to using SSL (Secure Socket Layer) (Message 6731)
Posted 23 Apr 2020 by Trotador
Post:
I've checked and It also occurs in hosts not connected to SSL, it claims that a file to upload is not available.

I have several errors in a host because I ran out of the disk space so my fault. These units use a lot of space. Corrected
7) Message boards : News : Testing a switch to using SSL (Secure Socket Layer) (Message 6728)
Posted 23 Apr 2020 by Trotador
Post:
Working OK for me. Changed several hosts and all of them have received units, also the ones not changed,

Edit: also having the errors described in previous post. It does not seem to me to be related to SSL, but who knows.
8) Message boards : RALPH@home bug list : Rosetta_beta 4.0+ (Message 6538)
Posted 18 Apr 2018 by Trotador
Post:
More errors

<core_client_version>7.6.31</core_client_version>
<![CDATA[
<message>
process got signal 11
</message>
<stderr_txt>
command: ../../projects/ralph.bakerlab.org/rosetta_4.07_i686-pc-linux-gnu -run:protocol jd2_scripting @flags_rb_04_15_85_134__t000__0_C4_robetta -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip input_rb_04_15_85_134__t000__0_C4_robetta.zip -nstruct 10000 -cpu_run_time 3600 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 2126819
Starting watchdog...
Watchdog active.
BOINC:: CPU time: 18506.1s, 14400s + 3600s[2018- 4-18 2:40:41:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE :: 1 starting structures 18506.1 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
02:40:41 (20103): called boinc_finish(0)

</stderr_txt>
]]>
9) Message boards : RALPH@home bug list : Rosetta_beta 4.0+ (Message 6535)
Posted 16 Apr 2018 by Trotador
Post:
Many units are taking over 1GB RAM (again), I've seen up to 1.6GB

some examples:

https://ralph.bakerlab.org/result.php?resultid=4610974
https://ralph.bakerlab.org/result.php?resultid=4611547
https://ralph.bakerlab.org/result.php?resultid=4610940
https://ralph.bakerlab.org/result.php?resultid=4611488
https://ralph.bakerlab.org/result.php?resultid=4613637
https://ralph.bakerlab.org/result.php?resultid=4613672
https://ralph.bakerlab.org/result.php?resultid=4613678
https://ralph.bakerlab.org/result.php?resultid=4613261
https://ralph.bakerlab.org/result.php?resultid=4616237
https://ralph.bakerlab.org/result.php?resultid=4615718
https://ralph.bakerlab.org/result.php?resultid=4614760
https://ralph.bakerlab.org/result.php?resultid=4614886
https://ralph.bakerlab.org/result.php?resultid=4614837
https://ralph.bakerlab.org/result.php?resultid=4616251
https://ralph.bakerlab.org/result.php?resultid=4616952
https://ralph.bakerlab.org/result.php?resultid=4614243
https://ralph.bakerlab.org/result.php?resultid=4614244
10) Message boards : RALPH@home bug list : Rosetta_beta 4.0+ (Message 6533)
Posted 16 Apr 2018 by Trotador
Post:
Many units are taking over 1GB RAM (again), I've seen up to 1.6GB
11) Message boards : Number crunching : High memory usage by v4.07 task rb_03_24_16_25__t000__0_C2_SAVE_ALL_OUT_IGNORE_THE_REST_20741 (Message 6519)
Posted 3 Apr 2018 by Trotador
Post:
I'm seeing many units using over 1 GB RAM, many to 1.2 GB. and one now up to 1.6GB
12) Message boards : RALPH@home bug list : Rosetta_beta 4.0+ (Message 6513)
Posted 28 Mar 2018 by Trotador
Post:
4281026

File: C:cygwinhomeboincRosettamainsourcesrccore/pack/dunbrack/SingleResidueDunbrackLibrary.hh:306
chi angle must be between -180 and 180: -nan(ind)


Several of those ones as well.


A lot of them actually
13) Message boards : RALPH@home bug list : Rosetta_beta 4.0+ (Message 6512)
Posted 27 Mar 2018 by Trotador
Post:
4281026

File: C:cygwinhomeboincRosettamainsourcesrccore/pack/dunbrack/SingleResidueDunbrackLibrary.hh:306
chi angle must be between -180 and 180: -nan(ind)


Several of those ones as well.
14) Message boards : RALPH@home bug list : Rosetta_beta 4.0+ (Message 6239)
Posted 21 Nov 2017 by Trotador
Post:
I seems that there are lots of units to download but I can't downoload any in my hosts (linux), is it only for windows?
15) Message boards : RALPH@home bug list : Rosetta mini beta and/or android 3.61-3.83 (Message 6074)
Posted 23 Mar 2016 by Trotador
Post:
All WUs continue erroring in Linux. W7 seems Ok, rest of windows is a mix of failure/success
16) Message boards : RALPH@home bug list : Rosetta mini beta and/or android 3.61-3.83 (Message 6061)
Posted 19 Mar 2016 by Trotador
Post:
All units erroring in all my Linux hosts:

Some of the wus failing after finishing crunching OK with the error (these wus were donwloaded yesterday):

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>des5ralph_design5_hydrophobic32_test1_buriedtrp_S_0095_SAVE_ALL_OUT_20313_229_0_0</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>

</message>
]]>

Other failing after several hours or after restarting BOINC and reporting 0 seconds of time computed with the error (these ones dowloaded today):

ERROR: ERROR: Option matching -cyclic_peptide:user_set_alph_dihedral_perturbation not found in command line top-level context

I'm seing that most of the windows hosts seem to finish Ok the wu and report success, but it is not a conclusive fact.

Stopping crunching until knowing more.


17) Message boards : RALPH@home bug list : Rosetta mini beta and/or android 3.61-3.83 (Message 6060)
Posted 18 Mar 2016 by Trotador
Post:
In one of my hosts, all "des5ralph_design5" units failing after finishing crunching OK with

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>des5ralph_design5_hydrophobic32_test1_buriedtrp_S_0095_SAVE_ALL_OUT_20313_229_0_0</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>

</message>
]]>

This host have have processing time above default, all units have been crunched during 9-12 hours and generated lot of decoys but end with this error.

Wingmen crunhing just an hour and generating few decoys are uploading OK.
18) Message boards : RALPH@home bug list : Rosetta mini beta and/or android 3.61-3.83 (Message 6051)
Posted 13 Feb 2016 by Trotador
Post:
The current Ralph WUs use huge amounts of RAM, I've seen up to 4 Gb per unit, is it on purpose? any new kind of simulation?

thanks for the info




Yes, I'm running a test of a new type of job that runs small perturbations of the protein backbone and then does a round of design. The design protocol can use a lot of memory. I realize that this will be problematic and will see if we can distribute these jobs to high memory machines. We may just not be able to run these on R@h.



I've crunched a lot of these backrub units, they are tough due to the large memory requirements. It is necessary to limit the quantity of units being simultaneously crunched and a lot of baby sitting, but it is also fun :).

Most of them don't use to go over 4 Gb but I got half a dozen reaching almost 7GB in the same host. It has 32 Gb but also 72 threads :), in short it stalled because lack of memory, So I finally had to abort them and a few more because they were nearly over the deadline.



19) Message boards : RALPH@home bug list : Win10 3.71 Unhandled Exception: Reason: Out Of Memory (backrub) (Message 6047)
Posted 7 Feb 2016 by Trotador
Post:
Quote from thread http://ralph.bakerlab.org/forum_thread.php?id=567

Trotador
The current Ralph WUs use huge amounts of RAM, I've seen up to 4 Gb per unit, is it on purpose? any new kind of simulation?

thanks for the info


Dekim

Yes, I'm running a test of a new type of job that runs small perturbations of the protein backbone and then does a round of design. The design protocol can use a lot of memory. I realize that this will be problematic and will see if we can distribute these jobs to high memory machines. We may just not be able to run these on R@h.
20) Message boards : RALPH@home bug list : Rosetta mini beta and/or android 3.61-3.83 (Message 6038)
Posted 4 Feb 2016 by Trotador
Post:
The current Ralph WUs use huge amounts of RAM, I've seen up to 4 Gb per unit, is it on purpose? any new kind of simulation?

thanks for the info




Yes, I'm running a test of a new type of job that runs small perturbations of the protein backbone and then does a round of design. The design protocol can use a lot of memory. I realize that this will be problematic and will see if we can distribute these jobs to high memory machines. We may just not be able to run these on R@h.


R@H means Rosseta, doesn`t it? It is good for the investigation you could look for ways of distributing these units. The only effective way I could think of is limiting the quantity of units downloaded, by the user as in CEP project in WCG or by the project. Distributing them only to hosts with lot of memory could just not be enough if the hosts have also a lot of available threads (like mine).

A good thing I'm seeing with these units is that Boinc/Ralph seems to take into account the amount of available system memory and limits the menory used, even limiting the quantity of units in execution below the quantity of available threads, and the systems does not stall and hang as used to happen in these cases. Is it correct ?



Next 20



©2024 University of Washington
http://www.bakerlab.org