Rosetta 4.12+

Message boards : RALPH@home bug list : Rosetta 4.12+

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Mad_Max

Send message
Joined: 15 Nov 12
Posts: 15
Credit: 404,700
RAC: 0
Message 6712 - Posted: 8 Apr 2020, 15:51:09 UTC - in response to Message 6698.  
Last modified: 8 Apr 2020, 16:07:42 UTC

I aborted some of the older test batches. I'm not sure why your client is getting confused and running the wrong app. It should be running the 64bit version on your 64bit computer.

I was writing of getting 64bit "wrapper" app on 32 bit machines including old running under WinXP.
Of course all such WUs fails as Win32 systems can not execute any 64 bit apps. Producing error "Application is not a valid Win32 app" right at start.

I don't have any problems on 64bit windows systems currently. Latest problem was downloading failures of small files, but looks like it resolved now as i didn't saw such errors for about a week.

If older systems not longer supported by project you should adjust server scheduler accordingly, so it should not send tasks to such machines and respond with error/warning, instead of sending work to such host doomed to 100% error rate and wasting internet bandwidth and excess server load.
ID: 6712 · Report as offensive    Reply Quote
xii5ku

Send message
Joined: 8 Apr 20
Posts: 2
Credit: 23,307
RAC: 0
Message 6714 - Posted: 10 Apr 2020, 8:22:48 UTC
Last modified: 10 Apr 2020, 9:06:43 UTC

Linux i686 application version problem in v4.12 + v4.15
(100% reproducible on my Linux EMT64 hosts, problem not reproducible with Linux x86-64 application version)

On April 7 at Rosetta@home, I reported that all "Rosetta v4.12 i686-pc-linux-gnu" tasks got stuck at 1 decoy and finished after target CPU time + 4 h watchdog overtime, whereas all "Rosetta v4.12 x86_64-pc-linux-gnu" ran normally on the same hosts. (Rosetta forum thread "Rosetta v4.12 i686-pc-linux-gnu" : fixed 20 h CPU time, fixed 20 credits)

Last night I received a bunch of tasks from Ralph to 4 of the same set of computers.
I had the default target CPU time configured at Ralph, which is 1 hour.

I have 257 valid results, of 257 tasks received:

  • All 169 "Rosetta v4.15 x86_64-pc-linux-gnu tasks finished after 1 hour and generated at the order of 20...40 decoys, according to spot checks.
  • All 88 "Rosetta v4.15 i686-pc-linux-gnu tasks tasks finished after 5 = 1+4 hours and generated (3x) 9, 8, (3x) 7, 6, (3x) 5, (8x) 4, (5x) 3, (9x) 2, (55x) 1 decoys.

So there is slight progress from v4.12 to v4.15 on my hosts, but not a breakthrough yet.

i686 tasks with more than 1 decoy, and a minority of 1-decoy tasks, received varying but of course low credit.

The majority of i686 1-decoy tasks received the usual fixed 20.00 credits. These ones had the "WARNING! cannot get file size for default.out.gz: could not open file." line in their stderr, while the other tasks with more or less than 20.00 credits did not.

host 44866: received 54 i686 tasks
host 44867: received 53 x86_64 tasks, same hardware and OS as 44866
host 44869: received 34 i686 tasks, almost same hardware, same OS
host 44870: received 116 x86_64 tasks, different hardware, similar OS

hosts 44866...44869: dual Broadwell-EP, openSUSE 15.0
. . . . These hosts received both i686 and x86_64 jobs at Rosetta and at Ralph, with the described consistent results.

host 44870: dual Rome, openSUSE 15.1
. . . . This host received only x86_64 jobs at Rosetta and Ralph so far, hence only had good x86_64 results yet (no i686 jobs received).

I furthermore have a single-socket Haswell with Gentoo Linux, which received only x86_64 jobs at Rosetta but no jobs at Ralph yet.

My prior report on Rosetta v4.12 was while I ran Rosetta@home exclusively on the hosts. This report on v4.15 was while I ran TN-Grid with 0 % resource share + Ralph with 100 % resource share. But at least three of the four computers ended up with almost all threads running Ralph jobs that way, whereas the third computer worked at a mixed workload of ~1/3rd Ralph + ~2/3rds TN-Grid.

ID: 6714 · Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 13 Jan 09
Posts: 103
Credit: 331,865
RAC: 0
Message 6716 - Posted: 10 Apr 2020, 23:01:55 UTC - in response to Message 6714.  
Last modified: 10 Apr 2020, 23:10:10 UTC

I just increased the resource share for Rosetta@Home on my computer. Too soon to see the results from that yet.

My Ralph account shows no 4.15 tasks yet. Can you tell which if any of these possible causes does this?

1. They aren't testing 4.15 for Windows yet.

2. The tasks they show don't include any from the last few days, probably because the list was read from a rather obsolete copy of their database.

3. Their list of tasks for a user show them only for a day or so.

For one decoy tasks, note that the first decoy is usually only for testing how well your computer runs the software. That means that its output is seldom useful for any other purpose, and it may might not even be sent back.

I looked at TN-Grid. They are currently not accepting new users. They are thinking of starting some COVID-19 work, which would probably start a flood of new users if they don't keep limiting them.


On another subject, can't the wrapper for 32-bit tasks be recompiled or rewritten so that it runs in 32-bits, at least under 32-bit operating systems? Or maybe a script that tries the 64-bit wrapper first, and if that fails quickly with certain errors, tries the 32-bit wrapper instead? Does this need extra testing to handle a 32-bit version of BOINC running under a 64-bit operating system?
ID: 6716 · Report as offensive    Reply Quote
xii5ku

Send message
Joined: 8 Apr 20
Posts: 2
Credit: 23,307
RAC: 0
Message 6717 - Posted: 11 Apr 2020, 4:48:47 UTC - in response to Message 6716.  
Last modified: 11 Apr 2020, 4:52:53 UTC

@robertmiles, I can't respond to your Ralph@home/ Rosetta@home related points, because I am new to Ralph and lack the insight. But a quick response to this unrelated item:
robertmiles wrote:

I looked at TN-Grid. They are currently not accepting new users.
This is not correct. New users can join any time. They only need to create the account via the web site and need to enter the invitation code from the main page. AFAIK this is a measure to reduce spam, not to hinder new contributors to join. That said, it is true that their work generator always had and still has a limited pace. But my experience during the last few days was that my hosts remained saturated.


robertmiles wrote:
They are thinking of starting some COVID-19 work, which would probably start a flood of new users if they don't keep limiting them.
They already started such work. They just don't communicate this widely to boinc contributors because of the limited pace of the work generator.

/end-offtopic
ID: 6717 · Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 13 Jan 09
Posts: 103
Credit: 331,865
RAC: 0
Message 6718 - Posted: 11 Apr 2020, 22:03:21 UTC - in response to Message 6717.  

@robertmiles, I can't respond to your Ralph@home/ Rosetta@home related points, because I am new to Ralph and lack the insight. But a quick response to this unrelated item:
robertmiles wrote:

I looked at TN-Grid. They are currently not accepting new users.
This is not correct. New users can join any time. They only need to create the account via the web site and need to enter the invitation code from the main page. AFAIK this is a measure to reduce spam, not to hinder new contributors to join. That said, it is true that their work generator always had and still has a limited pace. But my experience during the last few days was that my hosts remained saturated.
/end-offtopic

[snip]
I think I was able to create an account. I'll finally try to add the project in a few hours, after I upgrade BOINC to 7.16.5. Thank you.
ID: 6718 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 913
Credit: 1,892,541
RAC: 294
Message 6736 - Posted: 24 Apr 2020, 5:45:40 UTC
Last modified: 24 Apr 2020, 5:45:56 UTC

160 valid, only 3 errors

<message>
upload failure: <file_xfer_error>
<file_name>Mini_Protein_binds_IL6R_COVID-19_test3_SAVE_ALL_OUT_IGNORE_THE_REST_0cj9pv7f_32_92_0_r47019903_0</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
</message>
ID: 6736 · Report as offensive    Reply Quote
Rainer Baumeister

Send message
Joined: 7 Apr 20
Posts: 2
Credit: 437,267
RAC: 14
Message 6739 - Posted: 25 Apr 2020, 10:04:29 UTC - in response to Message 6736.  

Hello,

sorry, my English is very poor.


v4.15
I use a Ryzen3700x (default) with 32GB RAM: 30 tasks OK, 2 errors
A Ryzen 1700 (default) with 32GB causes massive problems: 4 OK, 66 errors!

Why? Both computers run VERY reliable in all other projects.
But with Rosetta I have to use Win10. :-(

With Mint the normal Rosetta is anyway with errors.

https://ralph.bakerlab.org/show_user.php?userid=58871

Greeting Rainer

Translated with www.DeepL.com/Translator (free version)
ID: 6739 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 913
Credit: 1,892,541
RAC: 294
Message 6740 - Posted: 25 Apr 2020, 13:44:03 UTC - in response to Message 6736.  

160 valid, only 3 errors

<message>
upload failure: <file_xfer_error>
<file_name>Mini_Protein_binds_IL6R_COVID-19_test3_SAVE_ALL_OUT_IGNORE_THE_REST_0cj9pv7f_32_92_0_r47019903_0</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
</message>


Again 9 with this error.. (after few seconds)
ID: 6740 · Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 13 Jan 09
Posts: 103
Credit: 331,865
RAC: 0
Message 6742 - Posted: 27 Apr 2020, 19:21:54 UTC - in response to Message 6718.  

[snip]

robertmiles wrote:

I looked at TN-Grid. They are currently not accepting new users.
This is not correct. New users can join any time. They only need to create the account via the web site and need to enter the invitation code from the main page. AFAIK this is a measure to reduce spam, not to hinder new contributors to join. That said, it is true that their work generator always had and still has a limited pace. But my experience during the last few days was that my hosts remained saturated.

[snip]

I created the account, and have started running tasks.

They have finished creating all of the workunits for their planned COVID-19 work, and expect to have the rest of them downloaded soon.
ID: 6742 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 913
Credit: 1,892,541
RAC: 294
Message 6749 - Posted: 28 Apr 2020, 17:02:42 UTC

Even with 4.17, after few seconds, i have these errors, like 4.15 (only two wus, however):
<message>
upload failure: <file_xfer_error>
<file_name>Mini_Protein_binds_IL6R_COVID-19_test3_SAVE_ALL_OUT_IGNORE_THE_REST_5aj5gu8j_32_397_0_r1474112397_0</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
</message>
ID: 6749 · Report as offensive    Reply Quote
Ivaylo Bonev

Send message
Joined: 30 Mar 20
Posts: 3
Credit: 3,702
RAC: 0
Message 6762 - Posted: 30 Apr 2020, 11:46:16 UTC - in response to Message 6749.  

Same on 4.18:
https://ralph.bakerlab.org/result.php?resultid=5034587

<message>
upload failure: <file_xfer_error>
<file_name>Mini_Protein_binds_IL6R_COVID-19_test3_SAVE_ALL_OUT_IGNORE_THE_REST_0cj9pv7f_32_749_0_r1202734223_0</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
</message>
ID: 6762 · Report as offensive    Reply Quote
WezH

Send message
Joined: 24 Apr 20
Posts: 6
Credit: 181,771
RAC: 0
Message 6773 - Posted: 30 Apr 2020, 19:19:06 UTC - in response to Message 6749.  

Even with 4.17, after few seconds, i have these errors, like 4.15 (only two wus, however):
<message>
upload failure: <file_xfer_error>
<file_name>Mini_Protein_binds_IL6R_COVID-19_test3_SAVE_ALL_OUT_IGNORE_THE_REST_5aj5gu8j_32_397_0_r1474112397_0</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
</message>


Same here, 9 errors from 184 tasks
ID: 6773 · Report as offensive    Reply Quote
Trotador

Send message
Joined: 7 May 10
Posts: 33
Credit: 14,751,677
RAC: 0
Message 6781 - Posted: 1 May 2020, 8:55:16 UTC

4.20 still failing this way
https://ralph.bakerlab.org/workunit.php?wuid=4518120

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<stderr_txt>
command: ../../projects/ralph.bakerlab.org/rosetta_4.20_x86_64-pc-linux-gnu -run:protocol jd2_scripting -parser:protocol predictor_v11_boinc--fuse--il1r_design_boinc_v1_mod.xml @flags_il6r2 -in:file:silent Mini_Protein_binds_IL6R_COVID-19_test3_SAVE_ALL_OUT_IGNORE_THE_REST_0cj9pv7f.silent -in:file:silent_struct_type binary -silent_gz -mute all -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip Mini_Protein_binds_IL6R_COVID-19_test3_SAVE_ALL_OUT_IGNORE_THE_REST_0cj9pv7f.zip @Mini_Protein_binds_IL6R_COVID-19_test3_SAVE_ALL_OUT_IGNORE_THE_REST_0cj9pv7f.flags -nstruct 10000 -cpu_run_time 3600 -boinc:max_nstruct 5000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3976176
Using database: database_357d5d93529_n_methyl/minirosetta_database
======================================================
DONE :: 1 starting structures 1201 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
BOINC :: WS_max 0
06:39:06 (90949): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>Mini_Protein_binds_IL6R_COVID-19_test3_SAVE_ALL_OUT_IGNORE_THE_REST_0cj9pv7f_32_771_1_r543859209_0</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
</message>
]]>
ID: 6781 · Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 16 Feb 06
Posts: 364
Credit: 1,368,421
RAC: 0
Message 6785 - Posted: 1 May 2020, 12:36:47 UTC

Tried running a couple of 4.20 work units, but they both failed after less than 1 1/2 minutes with the error 'Process got Signal 11"
This also happened with my wingman he got the same error.

Conan
ID: 6785 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 913
Credit: 1,892,541
RAC: 294
Message 6825 - Posted: 14 May 2020, 5:52:39 UTC

All errors after few seconds
5154586
5154504
5154616

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00007FF6E71C8316 read attempt to address 0xFFFFFFFF


- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x000001C8B3712B30


- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0000000900000008


etc...
ID: 6825 · Report as offensive    Reply Quote
Profile PDW

Send message
Joined: 30 Aug 14
Posts: 6
Credit: 1,832,794
RAC: 0
Message 6826 - Posted: 14 May 2020, 5:54:57 UTC - in response to Message 6825.  

All of the ones just released, like these: test_ff_sym_c3_21res_c.127.43_0001_I_21_3_hit_CYS_GLU_4_5_4_cell033_0001_SAVE_ALL_OUT_47_105_1
Access Violation, even just a single WU running on its own with plenty of memory.
ID: 6826 · Report as offensive    Reply Quote
Profile PDW

Send message
Joined: 30 Aug 14
Posts: 6
Credit: 1,832,794
RAC: 0
Message 6827 - Posted: 14 May 2020, 18:38:42 UTC - in response to Message 6826.  

More of the same: test_ff_sym_c3_21res_c.127.43_0001_I_21_3_hit_CYS_GLU_4_5_4_cell033_0001_SAVE_ALL_OUT_47_357_1
All Access Violation again.
ID: 6827 · Report as offensive    Reply Quote
Profile PDW

Send message
Joined: 30 Aug 14
Posts: 6
Credit: 1,832,794
RAC: 0
Message 6828 - Posted: 27 May 2020, 19:20:37 UTC - in response to Message 6827.  

Today's tasks are a fail with file upload error on Windows at least.
ID: 6828 · Report as offensive    Reply Quote
CIA

Send message
Joined: 5 Apr 20
Posts: 13
Credit: 111,953
RAC: 0
Message 6829 - Posted: 28 May 2020, 15:45:56 UTC
Last modified: 28 May 2020, 15:46:35 UTC

I only got one recent Ralph WU but it didn't fare so well on my OSX machine, or on another Linux machine that also tried.

https://ralph.bakerlab.org/workunit.php?wuid=4615810
ID: 6829 · Report as offensive    Reply Quote
Dr Who Fan
Avatar

Send message
Joined: 2 Sep 06
Posts: 76
Credit: 107,857
RAC: 0
Message 6830 - Posted: 28 May 2020, 21:45:07 UTC

Failed for BOTH me and the wingman
Task 5155194
Outcome Computation error
Client state Compute error
Exit status 0 (0x00000000)


Stderr output
<core_client_version>7.16.3</core_client_version>
<![CDATA[
<stderr_txt>
command: projects/ralph.bakerlab.org/rosetta_4.20_windows_x86_64.exe @jml20200526_358_A.flags -nstruct 10000 -cpu_run_time 3600 -boinc:max_nstruct 5000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3960217
Using database: database_357d5d93529_n_methylminirosetta_database
======================================================
DONE :: 66 starting structures 28483.3 cpu seconds
This process generated 66 decoys from 66 attempts
======================================================
BOINC :: WS_max 1.09564e+09
15:04:23 (4612): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>jml20200526_358_A_54_34_1_r1983490202_0</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>

</message>
]]>


ID: 6830 · Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : RALPH@home bug list : Rosetta 4.12+



©2024 University of Washington
http://www.bakerlab.org