RoseTTAFold All-Atom

Message boards : RALPH@home bug list : RoseTTAFold All-Atom

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 11 · Next

AuthorMessage
Grant (SSSF)

Send message
Joined: 13 Jun 24
Posts: 83
Credit: 67,898
RAC: 3,459
Message 7590 - Posted: 14 Jun 2024, 8:04:56 UTC - in response to Message 7589.  
Last modified: 14 Jun 2024, 8:14:28 UTC

How do you know the rate is the same?
Because as i said in my previous post- i had already processed 12 of those Tasks. The progress rate for the currently running single Tasks is the same as it was for those 12 other Tasks- it starts off fast, and continues to drop as the Task just keeps on going, well after the initial 4 hour estimate.

True- If a Task was to ever complete, then hopefully we could then see if it actually did do any more work in that time, but at present i've got a single Task that is on the very same course as the previous ones with the same processing rate, the same slowing rate of fraction done & heading for missing the deadline because it's going to take over 24hours to process (if it ever does manage to process it).

Given that others have run into memory issues after letting it run for 24hrs+, and I've now got 8 threads for the one Task i'd have expected to start running in to similar issues 8 times sooner, but that's yet t happen.

So i think it's a pretty reasonable assumption that it's not doing 8 times as much work as it was, which is what it would have to do to make using that many threads worthwhile.
ID: 7590 · Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 3 Mar 23
Posts: 31
Credit: 9,510
RAC: 203
Message 7591 - Posted: 14 Jun 2024, 8:05:16 UTC - in response to Message 7588.  
Last modified: 14 Jun 2024, 8:06:23 UTC

They use 10 threads each
On my system i have limited them to only one TTAFold Task at a time.
It's using a maximum of 62% of my CPU time which works out at just under 8 threads. So it's effectively using 8.
Yet the processing rate is the same as when 12 of them were fighting over 12 threads in total, so they really only need 1.
Actually I have a different experience. I have a Ryzen 9 3900XT and a Ryzen 9 3900X (pretty much the same CPU). On one of them I told Boinc to allocate 9.6 threads (as it appeared to use about 9.6 on average when only one task was running). On the other CPU I didn't get round to it, and didn't care so much as it only runs Boinc and there's no GPU to slow down. The 9.6 thread one takes 2 hours to complete a task, the 1 thread one takes 23 hours. So from what I'm seeing it does use the threads it's given.

Anyway we can't trust the progress as we don't know it's reliable, it could just be a timer.
ID: 7591 · Report as offensive    Reply Quote
Grant (SSSF)

Send message
Joined: 13 Jun 24
Posts: 83
Credit: 67,898
RAC: 3,459
Message 7592 - Posted: 14 Jun 2024, 8:31:41 UTC - in response to Message 7591.  

The 9.6 thread one takes 2 hours to complete a task, the 1 thread one takes 23 hours. So from what I'm seeing it does use the threads it's given.
All of your Tasks have errored out.
Some of them may have finished, but then failed when it came to return the result.
Yet another thing they need to fix if that is the case.


And the times between the 2 systems are all over the place.
The one with the 1 day 4 hour Tasks also has 5min, 30min, & 7 hour Tasks.
The one that is consistently around 5hrs 30min, also has some 5min tasks.
If the extra threads are doing more work, then it's not 8 times better, maybe 3 times. Waste of resources if that is the case.
They need to fix all the other errors, then either optimise the multiprocessing or do away with it.



Anyway we can't trust the progress as we don't know it's reliable, it could just be a timer.
The Runtime is the elapsed time, which is a timer. There should also be CPU time but that's broken. The progress done does indicate how far along it is- hence it's slowing down as the Task's run time drags on.
ID: 7592 · Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 3 Mar 23
Posts: 31
Credit: 9,510
RAC: 203
Message 7593 - Posted: 14 Jun 2024, 8:55:18 UTC - in response to Message 7592.  
Last modified: 14 Jun 2024, 8:57:44 UTC

The progress done simply indicates when it wants to finish. Rosetta 4.2 will make this about 8 hours, no matter how much actual work was done. So when you run 8 threads, you might actually be doing 8 times the work in the same task.

It depends whether they want the best throughput or fast returns. Using 8 threads to do 3 threads of work is good if each task is returned 3 times earlier. Folding@Home for example loves quick returns so they can calculate the next task from the results. Faster at the expense of lower efficiency is therefore ok.

Anyway there's so many bugs we can't draw much in the way of conclusions.

I'm just telling my computers to allocate the number of threads the tasks want to use, so everything runs smoothly.

It would be good if an admin replied. Are they even reading our posts and using the information?
ID: 7593 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 890
Credit: 1,889,390
RAC: 1
Message 7594 - Posted: 14 Jun 2024, 8:55:35 UTC

Seems that some computers can run correctly the new app, for example:
AuthenticAMD AMD Ryzen 9 5950X 16-Core Processor
NVIDIA GeForce RTX 3090 (24575MB) driver: 546.01


So, we need a little monster pc to crunch?
ID: 7594 · Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 3 Mar 23
Posts: 31
Credit: 9,510
RAC: 203
Message 7595 - Posted: 14 Jun 2024, 8:58:14 UTC - in response to Message 7594.  

Seems that some computers can run correctly the new app, for example:
AuthenticAMD AMD Ryzen 9 5950X 16-Core Processor
NVIDIA GeForce RTX 3090 (24575MB) driver: 546.01
So, we need a little monster pc to crunch?
Is this app only supposed to be given to PCs with a Nvidia?
ID: 7595 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 890
Credit: 1,889,390
RAC: 1
Message 7596 - Posted: 14 Jun 2024, 9:04:56 UTC - in response to Message 7595.  
Last modified: 14 Jun 2024, 9:05:22 UTC

NVIDIA GeForce RTX 3090 (24575MB) driver: 546.01

Is this app only supposed to be given to PCs with a Nvidia?


Perhaps
(and not a little Nvidia)
ID: 7596 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 890
Credit: 1,889,390
RAC: 1
Message 7597 - Posted: 14 Jun 2024, 9:11:27 UTC - in response to Message 7596.  

(and not a little Nvidia)


For example this guy has some pcs
The wus run correctly on a 3060, but not on a 4060 (ex. wu 5460613)
This is strange.
ID: 7597 · Report as offensive    Reply Quote
Grant (SSSF)

Send message
Joined: 13 Jun 24
Posts: 83
Credit: 67,898
RAC: 3,459
Message 7598 - Posted: 14 Jun 2024, 9:50:18 UTC - in response to Message 7595.  

Is this app only supposed to be given to PCs with a Nvidia?
Not according to the home page

We are initially targeting only Windows machines with or without Nvidia GPUs for this test.
Which is really odd wording- it's the same as saying "We are initially targeting only Windows machines with or without AMD GPUs for this test." or "We are initially targeting only Windows machines with or without Intel GPUs for this test." or "We are initially targeting only Windows machines with or without Apple GPUs for this test."
ie what they are saying is "We are initially targeting only Windows machines for this test."
ID: 7598 · Report as offensive    Reply Quote
Grant (SSSF)

Send message
Joined: 13 Jun 24
Posts: 83
Credit: 67,898
RAC: 3,459
Message 7599 - Posted: 14 Jun 2024, 9:53:16 UTC - in response to Message 7597.  

(and not a little Nvidia)


For example this guy has some pcs
The wus run correctly on a 3060, but not on a 4060 (ex. wu 5460613)
This is strange.
Different video drivers.
ID: 7599 · Report as offensive    Reply Quote
Vester

Send message
Joined: 29 Apr 20
Posts: 17
Credit: 1,176
RAC: 33
Message 7600 - Posted: 14 Jun 2024, 10:02:38 UTC

Upload failure.
Task 5456545
Stderr output
<core_client_version>8.0.2</core_client_version>
<![CDATA[
<stderr_txt>
00:01:02 (16016): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
  <file_name>RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_f_pred_88_16902_2_0_r1516067865_0</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
</message>
]]>
ID: 7600 · Report as offensive    Reply Quote
Grant (SSSF)

Send message
Joined: 13 Jun 24
Posts: 83
Credit: 67,898
RAC: 3,459
Message 7601 - Posted: 14 Jun 2024, 10:07:25 UTC - in response to Message 7589.  

How do you know the rate is the same? We don't know how the task is counting progress. It could be timed like the standard Rosetta 4.2 tasks. Those take 8 hours on any speed of CPU.
Looks like those extra threads do have an impact- a Task finished in just under 4 hours (previously 20+ hrs and no sign of ending).

But then it failed like yours with an upload issue-
It also shows signs of trying to use my video card, but not being able to find CUDA support (it does have CUDA sport, (or did have)).

RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_f_pred_340_16902_6_0

<core_client_version>8.0.2</core_client_version>
<![CDATA[
<stderr_txt>
C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchcuda__init__.py:52: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at  ..c10cudaCUDAFunctions.cpp:115.)
  return torch._C._cuda_getDeviceCount() > 0
19:19:49 (9548): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
  <file_name>RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_f_pred_340_16902_6_0_r1143327185_0</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
</message>
]]>
ID: 7601 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 890
Credit: 1,889,390
RAC: 1
Message 7602 - Posted: 14 Jun 2024, 10:08:10 UTC - in response to Message 7599.  

Different video drivers.


Uh, are we in such shape that a different driver can cause the failure of a wu?
This would be a nightmare!
ID: 7602 · Report as offensive    Reply Quote
Grant (SSSF)

Send message
Joined: 13 Jun 24
Posts: 83
Credit: 67,898
RAC: 3,459
Message 7603 - Posted: 14 Jun 2024, 10:11:08 UTC - in response to Message 7600.  
Last modified: 14 Jun 2024, 10:37:43 UTC

Upload failure.
Task 5456545
Stderr output
<core_client_version>8.0.2</core_client_version>
<![CDATA[
<stderr_txt>
00:01:02 (16016): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
  <file_name>RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_f_pred_88_16902_2_0_r1516067865_0</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
</message>
]]>


Report deadline 14 Jun 2024, 1:08:41 UTC
Received 14 Jun 2024, 4:04:05 UTC
I'm wondering if all these file transfer issues are due to the Ralph server not cancelling Tasks on the computer once they've been cancelled on the sever?
You complete the task, but it can't be returned because it's been cancelled for missing the deadline on the server?
ID: 7603 · Report as offensive    Reply Quote
Grant (SSSF)

Send message
Joined: 13 Jun 24
Posts: 83
Credit: 67,898
RAC: 3,459
Message 7604 - Posted: 14 Jun 2024, 10:13:11 UTC - in response to Message 7602.  

Different video drivers.
Uh, are we in such shape that a different driver can cause the failure of a wu?
This would be a nightmare!
If there is a significant difference in the CUDA version between driver versions, and the TTAFold application makes use of calls that are only available in the most recent version, fail time.
ID: 7604 · Report as offensive    Reply Quote
Grant (SSSF)

Send message
Joined: 13 Jun 24
Posts: 83
Credit: 67,898
RAC: 3,459
Message 7605 - Posted: 14 Jun 2024, 10:37:01 UTC - in response to Message 7603.  
Last modified: 14 Jun 2024, 10:38:31 UTC

Report deadline 14 Jun 2024, 1:08:41 UTC
Received 14 Jun 2024, 4:04:05 UTC
I'm wondering if all these file transfer issues are due to the Ralph server not cancelling Tasks on the computer once they've been cancelled on the sever?
You complete the task, but it can't be returned because it's been cancelled for missing the deadline on the server?


My failed upload Task details.
Report deadline  14 Jun 2024, 8:49:48 UTC
Received         14 Jun 2024, 9:51:17 UTC

P Hucker
Report deadline  14 Jun 2024, 8:25:06 UTC
Received         14 Jun 2024, 9:46:38 UTC


Anyone noticing a pattern here with the failed file transfers?


And every single Ralph Task on my system is past the deadline, and even the ones not yet started aren't being removed by the server.
If i can't return the result, there's not much point keeping them...
Grant
Darwin NT
ID: 7605 · Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 3 Mar 23
Posts: 31
Credit: 9,510
RAC: 203
Message 7606 - Posted: 14 Jun 2024, 10:50:27 UTC - in response to Message 7598.  
Last modified: 14 Jun 2024, 10:51:02 UTC

Is this app only supposed to be given to PCs with a Nvidia?
Not according to the home page

We are initially targeting only Windows machines with or without Nvidia GPUs for this test.
Which is really odd wording- it's the same as saying "We are initially targeting only Windows machines with or without AMD GPUs for this test." or "We are initially targeting only Windows machines with or without Intel GPUs for this test." or "We are initially targeting only Windows machines with or without Apple GPUs for this test."
ie what they are saying is "We are initially targeting only Windows machines for this test."
Seems plain to me. You must have windows. A Nvidia will be used if you have it. Other GPUs will not be used.
ID: 7606 · Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 3 Mar 23
Posts: 31
Credit: 9,510
RAC: 203
Message 7607 - Posted: 14 Jun 2024, 10:52:16 UTC - in response to Message 7602.  

Different video drivers.


Uh, are we in such shape that a different driver can cause the failure of a wu?
This would be a nightmare!
This is often the case with many projects. Primegrid has a list of which versions to avoid - Nvidia are well known for buggy drivers.
ID: 7607 · Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 3 Mar 23
Posts: 31
Credit: 9,510
RAC: 203
Message 7608 - Posted: 14 Jun 2024, 10:55:09 UTC - in response to Message 7605.  

Report deadline 14 Jun 2024, 1:08:41 UTC
Received 14 Jun 2024, 4:04:05 UTC
I'm wondering if all these file transfer issues are due to the Ralph server not cancelling Tasks on the computer once they've been cancelled on the sever?
You complete the task, but it can't be returned because it's been cancelled for missing the deadline on the server?


My failed upload Task details.
Report deadline  14 Jun 2024, 8:49:48 UTC
Received         14 Jun 2024, 9:51:17 UTC

P Hucker
Report deadline  14 Jun 2024, 8:25:06 UTC
Received         14 Jun 2024, 9:46:38 UTC


Anyone noticing a pattern here with the failed file transfers?


And every single Ralph Task on my system is past the deadline, and even the ones not yet started aren't being removed by the server.
If i can't return the result, there's not much point keeping them...
The normal way to do things is let you report (and get credit for) a late task. But on the deadline it's resent incase your computer is never going to do it (eg. it broke). If yours is returned before the other guy starts it, his PC is told to not bother. If he's started it, he finishes it too, and it's used for comparison (and not to upset silly people who think they've wasted their time, but actually, they're wasting more time by completing a task already done by you).
ID: 7608 · Report as offensive    Reply Quote
Grant (SSSF)

Send message
Joined: 13 Jun 24
Posts: 83
Credit: 67,898
RAC: 3,459
Message 7609 - Posted: 14 Jun 2024, 11:02:35 UTC - in response to Message 7601.  

It also shows signs of trying to use my video card, but not being able to find CUDA support (it does have CUDA sport, (or did have)).
GPU-Z says i have CUDA.
Grant
Darwin NT
ID: 7609 · Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 11 · Next

Message boards : RALPH@home bug list : RoseTTAFold All-Atom



©2024 University of Washington
http://www.bakerlab.org