RoseTTAFold All-Atom

Message boards : RALPH@home bug list : RoseTTAFold All-Atom

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 11 · Next

AuthorMessage
Vester

Send message
Joined: 29 Apr 20
Posts: 17
Credit: 1,176
RAC: 33
Message 7550 - Posted: 12 Jun 2024, 17:10:12 UTC

Task 5454093
Name	RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_e_pred_159_16901_1_0
Workunit	4846073
Created	12 Jun 2024, 10:56:21 UTC
Sent	12 Jun 2024, 12:18:31 UTC
Report deadline	13 Jun 2024, 12:18:31 UTC
Received	12 Jun 2024, 14:17:23 UTC
Server state	Over
Outcome	Computation error
Client state	Compute error
Exit status	12 (0x0000000C) Unknown error code
Computer ID	49920
Run time	
CPU time	
Validate state	Invalid
Credit	0.00
Device peak FLOPS	5.70 GFLOPS
Application version	Generalized biomolecular modeling and design with RoseTTAFold All-Atom v0.02
windows_x86_64

Stderr output
<core_client_version>8.0.2</core_client_version>
<![CDATA[
<message>
The access code is invalid.
 (0xc) - exit code 12 (0xc)</message>
<stderr_txt>
'C:ProgramDataBOINC/projects/ralph.bakerlab.orgev0Scriptsactivate.bat' is not recognized as an internal or external command,
operable program or batch file.

</stderr_txt>
]]>
ID: 7550 · Report as offensive    Reply Quote
Vester

Send message
Joined: 29 Apr 20
Posts: 17
Credit: 1,176
RAC: 33
Message 7551 - Posted: 12 Jun 2024, 17:13:17 UTC

Task 5455911
Name RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_e_pred_41_16901_2_1
Workunit 4845355
Created 12 Jun 2024, 14:30:04 UTC
Sent 12 Jun 2024, 14:31:43 UTC
Report deadline 13 Jun 2024, 14:31:43 UTC
Received 12 Jun 2024, 14:45:06 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 12 (0x0000000C) Unknown error code
Computer ID 49920
Run time 17 sec
CPU time
Validate state Invalid
Credit 0.00
Device peak FLOPS 5.70 GFLOPS
Application version Generalized biomolecular modeling and design with RoseTTAFold All-Atom v0.02
windows_x86_64
Peak working set size 1,970.34 MB
Peak swap size 6,092.06 MB
Peak disk usage 0.01 MB

Stderr output
<core_client_version>8.0.2</core_client_version>
<![CDATA[
<message>
The access code is invalid.
 (0xc) - exit code 12 (0xc)</message>
<stderr_txt>
Traceback (most recent call last):
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aapredict.py", line 699, in <module>
    with zipfile.ZipFile(b) as z:
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libzipfile.py", line 1268, in __init__
    self._RealGetContents()
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libzipfile.py", line 1335, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

</stderr_txt>
]]>
ID: 7551 · Report as offensive    Reply Quote
Vester

Send message
Joined: 29 Apr 20
Posts: 17
Credit: 1,176
RAC: 33
Message 7552 - Posted: 13 Jun 2024, 1:41:53 UTC
Last modified: 13 Jun 2024, 2:29:39 UTC

I have ten tasks in progress that have run more than twenty five minutes without failing. At first there were failures such as the one shown below indicating that my System Managed Paging Files were inadequate. I quickly suspended tasks, restarted and entered BIOS to stop hyperthreading my ten core Intel I9-10850K CPU. After continued running without task failures, I restarted and changed the paging files from the System Managed 42235 MB to a fixed 84470 MB. I have enabled hyperthreading again but have only 12 tasks available.
Task 5456469
Name	RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_f_pred_62_16902_5_0
Workunit	4847018
Created	12 Jun 2024, 23:56:01 UTC
Sent	13 Jun 2024, 1:08:41 UTC
Report deadline	14 Jun 2024, 1:08:41 UTC
Received	13 Jun 2024, 1:13:38 UTC
Server state	Over
Outcome	Computation error
Client state	Compute error
Exit status	12 (0x0000000C) Unknown error code
Computer ID	49920
Run time	1 sec
CPU time	
Validate state	Invalid
Credit	0.00
Device peak FLOPS	5.70 GFLOPS
Application version	Generalized biomolecular modeling and design with RoseTTAFold All-Atom v0.02
windows_x86_64
Peak working set size	33.86 MB
Peak swap size	2,778.58 MB
Peak disk usage	2.10 MB

Stderr output
<core_client_version>8.0.2</core_client_version>
<![CDATA[
<message>
The access code is invalid.
 (0xc) - exit code 12 (0xc)</message>
[b]<stderr_txt>[/b]
Traceback (most recent call last):
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aapredict.py", line 8, in <module>
    import torch
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorch__init__.py", line 124, in <module>
    raise err
OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchlibcudnn_cnn_infer64_8.dll" or one of its dependencies.

</stderr_txt>
]]>
ID: 7552 · Report as offensive    Reply Quote
Vester

Send message
Joined: 29 Apr 20
Posts: 17
Credit: 1,176
RAC: 33
Message 7553 - Posted: 13 Jun 2024, 2:37:12 UTC

Are there no check points? I restarted after 30 minutes of crunching (over 20% complete) and the tasks restarted at zero progress.
ID: 7553 · Report as offensive    Reply Quote
mikey

Send message
Joined: 28 Nov 20
Posts: 8
Credit: 114,593
RAC: 377
Message 7554 - Posted: 13 Jun 2024, 3:08:43 UTC - in response to Message 7553.  

Mine are using almost 6gb of ram for EACH task so that could be why some are failing, I had to limit the 8 that were running to only 3 because of the ram usage but those 3 ARE running into the 3+ minutes so far!!
ID: 7554 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 890
Credit: 1,889,390
RAC: 1
Message 7555 - Posted: 13 Jun 2024, 6:20:39 UTC

My default runtime is 4 hrs, but after 6hrs the wus are at 75%.
If you see the windows task manager, you see these are python apps (on my pc is 1gb of ram/wu)
The deadline is only 1 day :-(
ID: 7555 · Report as offensive    Reply Quote
Grant (SSSF)

Send message
Joined: 13 Jun 24
Posts: 83
Credit: 67,898
RAC: 3,459
Message 7556 - Posted: 13 Jun 2024, 9:26:42 UTC
Last modified: 13 Jun 2024, 9:47:41 UTC

I was getting Tasks erroring out, looking in my Event log & saw complaints about not enough disk space.
Went from 20GB for BOINC to 30GB
13/06/2024 18:29:33 | ralph@home | Message from server: Generalized biomolecular modeling and design with RoseTTAFold All-Atom needs 3216.72MB more disk space. You currently have 13167.28 MB available and it needs 16384.00 MB.

Don't know why it wasn't happy, but after giving it more the errors stopped (or i got lucky with the Tasks).

Or it was trying to run Tasks before the last of the downloads had completed? BUt then why complain about disk space?




Behaviour of these Tasks is very odd looking at Task manager- the CPU usage varies between 2% and 12%
Suspending all but 6 Tasks (12 thread CPU) results in no improvement. In fact it makes it worse as suspending doesn't actually suspend them.
In the BOINC Manager they show as suspended- but in Task manager they are still there using CPU time.

After several suspends and resumes, the end result in BOINC Manager,
12 are running,
6 are waiting to run,
7 are ready to start.

Yet in Task Manager there are 18 Python processes running. each using 700-1.6GB of RAM (the amount varies second by second).



Developers- please fix suspend function so that Tasks do suspend when told to.
FIx checkpointing so work can continue on, not start from scratch after restarting BOINC (or un-suspending when suspending is fixed).


System is now sluggish, barely responsive at times. May need to kill some of those processes to keep the system up, but will try exiting & restarting BOINC first.

Edit- exited BOINC, restarted, and back to just 12 running Tasks in Task Manager.
ID: 7556 · Report as offensive    Reply Quote
Grant (SSSF)

Send message
Joined: 13 Jun 24
Posts: 83
Credit: 67,898
RAC: 3,459
Message 7557 - Posted: 13 Jun 2024, 9:40:52 UTC - in response to Message 7553.  
Last modified: 13 Jun 2024, 9:48:30 UTC

Are there no check points? I restarted after 30 minutes of crunching (over 20% complete) and the tasks restarted at zero progress.
Looking at the properties of my running Tasks, and it shows as no CPU time elapsed, and no elapsed CPU time since last checkpoint.

Edit- exited & restarted BOINC- all work done lost, started from scratch.
ID: 7557 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 890
Credit: 1,889,390
RAC: 1
Message 7558 - Posted: 13 Jun 2024, 12:10:16 UTC - in response to Message 7557.  

Looking at the properties of my running Tasks, and it shows as no CPU time elapsed, and no elapsed CPU time since last checkpoint.
Edit- exited & restarted BOINC- all work done lost, started from scratch.


Yep, i had to restart my pc after 11hrs (93% of calculation) and restarted from 0%
:-(
ID: 7558 · Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 26 Feb 21
Posts: 22
Credit: 1,893
RAC: 0
Message 7559 - Posted: 13 Jun 2024, 13:08:10 UTC
Last modified: 13 Jun 2024, 13:18:56 UTC

It uses GPU, but boinc manager doesn't reflect that.

Also stderr.txt is empty. I would like to see some output during computation.
ID: 7559 · Report as offensive    Reply Quote
Vester

Send message
Joined: 29 Apr 20
Posts: 17
Credit: 1,176
RAC: 33
Message 7560 - Posted: 13 Jun 2024, 14:51:41 UTC - in response to Message 7559.  
Last modified: 13 Jun 2024, 14:52:40 UTC

There is not an entry in stderr because there not any errors. You may find what interests you in file:///C:/ProgramData/BOINC/client_state.xml, kotenok2000.
ID: 7560 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 890
Credit: 1,889,390
RAC: 1
Message 7561 - Posted: 13 Jun 2024, 14:57:11 UTC - in response to Message 7559.  

It uses GPU, but boinc manager doesn't reflect that.


Nvidia, i suppose...
ID: 7561 · Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 26 Feb 21
Posts: 22
Credit: 1,893
RAC: 0
Message 7562 - Posted: 13 Jun 2024, 15:01:48 UTC
Last modified: 13 Jun 2024, 15:05:06 UTC

My task crashed with
<core_client_version>8.0.2</core_client_version>
<![CDATA[
<stderr_txt>
16:47:05 (15252): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
  <file_name>RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_f_pred_342_16902_3_1_r301773838_0</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
</message>
]]>
ID: 7562 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 890
Credit: 1,889,390
RAC: 1
Message 7563 - Posted: 13 Jun 2024, 15:14:56 UTC

The behaviour on my computer is this:
the wus start "fast", 1% of calculation in approximately 2,5 minutes (so, to complete a wu, correctly, 4hrs) and then slow down
Now, to crunch 1% between 51% and 52%, it takes over 5 minutes. And continue to reduce speed.

This version of the app is better then the previous ones, but there is still a lot of work to do....
ID: 7563 · Report as offensive    Reply Quote
Vester

Send message
Joined: 29 Apr 20
Posts: 17
Credit: 1,176
RAC: 33
Message 7564 - Posted: 13 Jun 2024, 15:20:37 UTC - in response to Message 7562.  
Last modified: 13 Jun 2024, 15:23:43 UTC

kotenok2000, that is not good news. I have 12 CPU tasks still running for over 12 hours.

A Brave AI search found:
Error_code -240 (stat() failed) /error_code
The error code -240 (stat() failed) typically indicates that the system was unable to access or retrieve information about a file or directory. This can occur due to various reasons such as:

File or directory not found: The file or directory specified does not exist or is inaccessible.
Permission issues: The system does not have the necessary permissions to access the file or directory.
File system errors: There may be issues with the file system, such as a corrupted file system or a disk error.
To resolve this issue, you can try the following:

Check the file or directory path: Verify that the file or directory exists and is accessible.
Check permissions: Ensure that the system has the necessary permissions to access the file or directory.
Run a disk check: Run a disk check to identify and fix any file system errors.
It is also important to note that this error code may also occur in other contexts, such as in the context of a database or a network connection. In these cases, the solution may be different and may require specific troubleshooting steps.

AI-generated answer. Please verify critical facts.
ID: 7564 · Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 26 Feb 21
Posts: 22
Credit: 1,893
RAC: 0
Message 7565 - Posted: 13 Jun 2024, 17:55:16 UTC - in response to Message 7556.  

I was getting Tasks erroring out, looking in my Event log & saw complaints about not enough disk space.
Went from 20GB for BOINC to 30GB
13/06/2024 18:29:33 | ralph@home | Message from server: Generalized biomolecular modeling and design with RoseTTAFold All-Atom needs 3216.72MB more disk space. You currently have 13167.28 MB available and it needs 16384.00 MB.

Don't know why it wasn't happy, but after giving it more the errors stopped (or i got lucky with the Tasks).

Or it was trying to run Tasks before the last of the downloads had completed? BUt then why complain about disk space?




Behaviour of these Tasks is very odd looking at Task manager- the CPU usage varies between 2% and 12%
Suspending all but 6 Tasks (12 thread CPU) results in no improvement. In fact it makes it worse as suspending doesn't actually suspend them.
In the BOINC Manager they show as suspended- but in Task manager they are still there using CPU time.

After several suspends and resumes, the end result in BOINC Manager,
12 are running,
6 are waiting to run,
7 are ready to start.

Yet in Task Manager there are 18 Python processes running. each using 700-1.6GB of RAM (the amount varies second by second).



Developers- please fix suspend function so that Tasks do suspend when told to.
FIx checkpointing so work can continue on, not start from scratch after restarting BOINC (or un-suspending when suspending is fixed).


System is now sluggish, barely responsive at times. May need to kill some of those processes to keep the system up, but will try exiting & restarting BOINC first.

Edit- exited BOINC, restarted, and back to just 12 running Tasks in Task Manager.


You shouldn't run 12 tasks unless you have 12 GPUs.
ID: 7565 · Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 26 Feb 21
Posts: 22
Credit: 1,893
RAC: 0
Message 7566 - Posted: 13 Jun 2024, 21:31:31 UTC
Last modified: 13 Jun 2024, 21:45:02 UTC

All tasks fail on my system.
https://ralph.bakerlab.org/results.php?hostid=48013

And there is nothing useful in stderr.txt
ID: 7566 · Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 26 Feb 21
Posts: 22
Credit: 1,893
RAC: 0
Message 7567 - Posted: 13 Jun 2024, 23:22:46 UTC
Last modified: 13 Jun 2024, 23:23:01 UTC

Was anyone able to return task successfully?
ID: 7567 · Report as offensive    Reply Quote
Vester

Send message
Joined: 29 Apr 20
Posts: 17
Credit: 1,176
RAC: 33
Message 7568 - Posted: 13 Jun 2024, 23:46:25 UTC - in response to Message 7567.  

I have twelve CPU tasks that are over 99% completed, but have not yet completed any tasks. I will know in the next few hours.
ID: 7568 · Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 26 Feb 21
Posts: 22
Credit: 1,893
RAC: 0
Message 7569 - Posted: 13 Jun 2024, 23:48:40 UTC

does your task manager show any gpu activity?
ID: 7569 · Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 11 · Next

Message boards : RALPH@home bug list : RoseTTAFold All-Atom



©2024 University of Washington
http://www.bakerlab.org