RoseTTAFold All-Atom

Message boards : RALPH@home bug list : RoseTTAFold All-Atom

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 11 · Next

AuthorMessage
Mr P Hucker
Avatar

Send message
Joined: 3 Mar 23
Posts: 31
Credit: 9,510
RAC: 3
Message 7611 - Posted: 14 Jun 2024, 11:45:50 UTC - in response to Message 7610.  

Note: One can also limit the number of cores in Windows 11 by setting "number of processors" in Advanced Boot options (run msconfig).
I've opted to use max_concurrent to limit the number of cores/threads avalable to the TTAFold Tasks, leaving the others available for other processes.
As i have found, they are pigs. 1 Task = 8 threads.

I am an idiot! All we have to do is set the percentage of CPUs in our account preferences. I set my 10 cores hyperthreaded (20) to 10% and run 2 tasks.
No good if you are also running other projects. I limited in app config for this project by telling it each task needs 10 threads.
ID: 7611 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 913
Credit: 1,892,541
RAC: 294
Message 7612 - Posted: 14 Jun 2024, 12:09:10 UTC - in response to Message 7607.  

This is often the case with many projects. Primegrid has a list of which versions to avoid - Nvidia are well known for buggy drivers.


I know. i know.
PrimeGrid has historical gpu app and well known behaviour
Here we don't know nothing about this app....
ID: 7612 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 913
Credit: 1,892,541
RAC: 294
Message 7613 - Posted: 14 Jun 2024, 13:56:06 UTC - in response to Message 7604.  

If there is a significant difference in the CUDA version between driver versions, and the TTAFold application makes use of calls that are only available in the most recent version, fail time.


Some messages above, you said that the app is cpu only.
I think that until admins say something, we can only guess...
ID: 7613 · Report as offensive    Reply Quote
Profile Bill F

Send message
Joined: 1 Jan 18
Posts: 21
Credit: 34,272
RAC: 52
Message 7615 - Posted: 14 Jun 2024, 17:22:23 UTC

I have gotten a couple of completions with credit on this system which is not a power system. Appears that they ran about 8 hours or so

GenuineIntel
Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz [Family 6 Model 60 Stepping 3]
(4 processors) INTEL Intel(R) HD Graphics 4600 (1629MB) OpenCL: 1.2 Microsoft Windows 10
Professional x64 Edition, (10.00.19045.00)
In October 1969 I took an oath to support and defend the Constitution of the United States against all enemies, foreign and domestic;
There was no expiration date.

ID: 7615 · Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 3 Mar 23
Posts: 31
Credit: 9,510
RAC: 3
Message 7616 - Posted: 14 Jun 2024, 18:31:05 UTC - in response to Message 7615.  
Last modified: 14 Jun 2024, 18:32:40 UTC

I have gotten a couple of completions with credit on this system which is not a power system. Appears that they ran about 8 hours or so

GenuineIntel
Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz [Family 6 Model 60 Stepping 3]
(4 processors) INTEL Intel(R) HD Graphics 4600 (1629MB) OpenCL: 1.2 Microsoft Windows 10
Professional x64 Edition, (10.00.19045.00)
I have 5 computers of varying power (4 to 24 threads), and they've all broken every single task. Oh well, at least they worked out how to send the right zip files....

If they get multithreaded tasks working well, this will be good for Rosetta.

I see over on the main server they're handing out a shitload of Beta v6. I wonder what happened to v5?
ID: 7616 · Report as offensive    Reply Quote
Grant (SSSF)

Send message
Joined: 13 Jun 24
Posts: 126
Credit: 193,939
RAC: 2,635
Message 7617 - Posted: 14 Jun 2024, 21:37:55 UTC - in response to Message 7613.  
Last modified: 14 Jun 2024, 22:36:41 UTC

If there is a significant difference in the CUDA version between driver versions, and the TTAFold application makes use of calls that are only available in the most recent version, fail time.


Some messages above, you said that the app is cpu only.
I think that until admins say something, we can only guess...
Yep.
The are signs that if it finds a Nvidia driver, it tries to run on the GPU, otherwise it reverts to the CPU.
In which case the developers really need to inform us of the minimum CPU, OS, video card and video card driver requirements. And they need to hard code those requirements into the application so it doesn't try to run on unsupported hardware/software.

Here's an example of the problem.
What appears to be 6hrs runtime on a RTX 2080 Super, and then it ran out of Video RAM.
Not a lot of video cards have more than 8GB of RAM. Hell, not a lot of video cards would even have 8GB of RAM.
Most would be 4-6GB.

A single application that can do work on both devices is good. It's also going to massively screw up work scheduling, deadlines & resource share balancing.


The other problem is with it's multi processing- the initial Estimated processing time is 4 hours- when in actual fact it takes over 24hrs (longer than the deadline).
If you do give it the 8 threads it wants, it then takes around 4 hrs, but the initial Estimated processing time is only 53 min.
This needs to be fixed, urgently.

And i'm still not sure if the multiprocessing really is doing more processing using 8 cores than a single core, or it's just the fact that you don't have Tasks that want to use 96 threads when only 12 are available & are just fighting each other for CPU time. A massively, massively over committed system.


The other thing that needs to be fixed, urgently, is not being able to return work once completed.
I aborted all my Tasks that were past the deadline, and got some more. The first one was returned will before the deadline,
Report deadline  15 Jun 2024, 10:39:24 UTC
Received        14 Jun 2024, 17:41:52 UTC
, but it also failed to upload the result.
<core_client_version>8.0.2</core_client_version>
<![CDATA[
<stderr_txt>
C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchcuda__init__.py:52: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at  ..c10cudaCUDAFunctions.cpp:115.)
  return torch._C._cuda_getDeviceCount() > 0
03:10:21 (13388): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
  <file_name>RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_f_pred_369_16902_5_1_r880759058_0</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
</message>
]]>


RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_f_pred_369_16902_5_1


The developers have a mountain of serious work to do before this even makes it to a beta stage.
Grant
Darwin NT
ID: 7617 · Report as offensive    Reply Quote
Grant (SSSF)

Send message
Joined: 13 Jun 24
Posts: 126
Credit: 193,939
RAC: 2,635
Message 7618 - Posted: 14 Jun 2024, 22:42:26 UTC - in response to Message 7617.  

The other thing that needs to be fixed, urgently, is not being able to return work once completed.
....

RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_f_pred_369_16902_5_1
And yet my next completed Task, was able to be returned.

RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_g_pred_173_16903_4_1

although it did include a complaint about CUDA, which GPU-Z says is available.
<core_client_version>8.0.2</core_client_version>
<![CDATA[
<stderr_txt>
C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchcuda__init__.py:52: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at  ..c10cudaCUDAFunctions.cpp:115.)
  return torch._C._cuda_getDeviceCount() > 0
07:08:34 (15488): called boinc_finish(0)

</stderr_txt>
]]>

Grant
Darwin NT
ID: 7618 · Report as offensive    Reply Quote
Grant (SSSF)

Send message
Joined: 13 Jun 24
Posts: 126
Credit: 193,939
RAC: 2,635
Message 7620 - Posted: 15 Jun 2024, 0:06:29 UTC
Last modified: 15 Jun 2024, 0:49:16 UTC

So, in no particular order, the list of things that need fixing so far.


1 Tasks need to be able to be suspended.
Having them still running in the background even though the BOINC Manager says they are suspended will lead to unresponsive & crashed systems as they run out of memory, and available CPU time.


2 Checkpointing needs to be fixed- at present there are none at all.
So if someone has to exit BOINC, or restart their computer, then all work to that point is lost.


3 CPU time needs to work properly.
Not just for Scheduling reasons but also so that checkpointing can work.


4 The issue with file transfers failing needs to be fixed.
What good is it completing a Task, if it fails because you can't return the result?


5 The issue with multiprocessing needs to be addressed.
By default- TTAFold Tasks take up all CPU time, regardless of what other applications are running (eg Rosetta beta Tasks that show in the BOINC manager as running, but in Task manage show as getting 0 CPU time).
Needing 8 threads to process 1 TTAFolding Task is not viable. It needs to be limited to just 1 thread per Task, with the option in the Projects settings of the Users account to allocate more if they wish. Or if 2 (or more) are used by default, then that needs to be handled in a way that the BIONC Manager is aware of this.
This will allow the BOINC Manager to know how many threads are really available for other Tasks to use and stop systems from being over committed, resulting in missed deadlines & work scheduling issues with almost everything ending up in Panic mode all the time. Trying to use more threads than the processor actually has just leads to grief.


6 Initial Estimated completion times need to be more realistic.
The default starts at 4 hours, yet with multiple TTAFold tasks running their completion time is over 24 hours- which means they miss the Projects own set deadlines.

If we do limit the number of Tasks so that each one gets the 8 threads it needs, then the initial Estimated completion time is around 50min, yet the actual completion time is now around 4 hours. Missed deadlines, Panic mode & general chaos will be the result of this as well.


7 Requirements for GPU processing need to be known.
People need to know just what is required in order for GPU processing to occur- ie minimum CPU, OS, video card type, video card memory size, driver version, in order to reduce confusion & concern about what is going on with their systems.


8 Having one application for both GPU & CPU work.
This will probably be the single biggest issue, as history has shown us that without separate tracking of actual processing times for separate initial estimated completion times, honouring Resource shares becomes near impossible, and systems download more work than they can finish, as well as download less work than they can handle, all depending on how much work is done on what compute resource.

With Rosetta having a (generally) fixed Runtime, as long as that can be maintained for Tasks regardless of whether they are done on a low-end CPU or a high-end GPU then everything will be OK. But that is going to be huge ask IMHO due to the massive difference in computing capability between those two extremes.

Otherwise having two different applications (even if they are exactly the same other than their names (eg TTAFold CPU and TTAFold GPU)) like every other project would be the way to go.
It may well be the best option to allow for the default usage of threads for CPU Tasks to be one (selectable by the user as they wish), and the default number of threads for work on the GPU to be set to the number of Compute Units (in OpenCL terms, i think NVidia call them SMs (Stream Multiprocessors)) the GPU being used has (selectable by the user once again).
Either way the option in the Project settings for each user's account for using or not using the GPU (and ideally the CPU as well) will be necessary.
Grant
Darwin NT
ID: 7620 · Report as offensive    Reply Quote
Grant (SSSF)

Send message
Joined: 13 Jun 24
Posts: 126
Credit: 193,939
RAC: 2,635
Message 7621 - Posted: 15 Jun 2024, 2:45:59 UTC

OK, i did something incredibly rash- i upgraded my video driver even though there was nothing wrong.
Went from 447.xx and now on 552.22

End result- on restart my GPU is now crunching TTAFold tasks. CPU usage is 2 threads in order to support the one Task on the GPU.
I tried running a second Task using my second GPU, but it ended up moving to the first GPU as well, significantly impacting on the system responsiveness, making everything extremely sluggish.
So just 1 Task at a time it is.

Gone from 8 threads taking 4 hours, to 2 threads & the GPU taking 20min or so (at least for the first task).
Grant
Darwin NT
ID: 7621 · Report as offensive    Reply Quote
mikey

Send message
Joined: 28 Nov 20
Posts: 9
Credit: 114,771
RAC: 17
Message 7622 - Posted: 15 Jun 2024, 4:21:13 UTC - in response to Message 7580.  
Last modified: 15 Jun 2024, 4:23:01 UTC

There needs to be a way to limit the number of threads a single Task can use.
And at the moment, the indications are that 1 Task using 8 threads performs no better than when 12 Tasks were trying to use 8 threads each, when there were only 12 threads available.
I'll keep an eye on things to see if they don't slow down later, but the initial signs are that the extra threads are providing not even the slightest improvement in processing time- they're just being wasted.[/quote]

Try this:

<app_config>
<project_max_concurrent>1</project_max_concurrent>
</app_config>

Save it as 'app.config.xml' in each project folder in C:programdata/boinc/projects/project name and adjust the number to reflect how many tasks of each project you want to run, ie 1 of this project, 2 of that project and 7 of that project.

The other thing I'm seeing is that the Ralph tasks are taking about 10gb of ram for EACH task so I had to limit my running tasks accordingly. I'm running 1 cpu core per task and they are taking over 2 days to finish. I'm still getting some errors but have not ruled out it being a pc problem as yet.
ID: 7622 · Report as offensive    Reply Quote
Grant (SSSF)

Send message
Joined: 13 Jun 24
Posts: 126
Credit: 193,939
RAC: 2,635
Message 7623 - Posted: 15 Jun 2024, 5:28:26 UTC - in response to Message 7622.  
Last modified: 15 Jun 2024, 5:30:20 UTC

There needs to be a way to limit the number of threads a single Task can use.
And at the moment, the indications are that 1 Task using 8 threads performs no better than when 12 Tasks were trying to use 8 threads each, when there were only 12 threads available.
I'll keep an eye on things to see if they don't slow down later, but the initial signs are that the extra threads are providing not even the slightest improvement in processing time- they're just being wasted.


Try this:

<app_config>
<project_max_concurrent>1</project_max_concurrent>
</app_config>
Already done that, but just using max_concurrent. Even so, that doesn't limit the number of threads per Task, just the number of Tasks.



The other thing I'm seeing is that the Ralph tasks are taking about 10gb of ram for EACH task so I had to limit my running tasks accordingly. I'm running 1 cpu core per task and they are taking over 2 days to finish. I'm still getting some errors but have not ruled out it being a pc problem as yet.
The most i've seen for a CPU processed Task in use is a bit over 1.5GB.
For GPU processed Tasks, they're using up to 2.5GB of system RAM & 6GB of VRAM.
However the peak Swap file size i've seen is as high as 15GB.

Most of your errors appear to be file transfer errors, nothing to do with RAM usage.
Grant
Darwin NT
ID: 7623 · Report as offensive    Reply Quote
Grant (SSSF)

Send message
Joined: 13 Jun 24
Posts: 126
Credit: 193,939
RAC: 2,635
Message 7624 - Posted: 15 Jun 2024, 6:45:07 UTC - in response to Message 7621.  

End result- on restart my GPU is now crunching TTAFold tasks. CPU usage is 2 threads in order to support the one Task on the GPU.
After running a few more Tasks on the GPU, it appears that most of the time it's happy with just one thread. Right at the start when initialising it looks like it makes use of up to 2 threads. Then the rest of the time it's just the one thread with the occasional spike of making partial use of another thread.
Grant
Darwin NT
ID: 7624 · Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 3 Mar 23
Posts: 31
Credit: 9,510
RAC: 3
Message 7625 - Posted: 15 Jun 2024, 7:26:26 UTC - in response to Message 7617.  
Last modified: 15 Jun 2024, 7:28:31 UTC

A single application that can do work on both devices is good. It's also going to massively screw up work scheduling, deadlines & resource share balancing.
Why? Folding@Home allocates 1 thread per GPU task. I can do the same (and do) with some Boinc projects, manually in app config, this can also be preset at the server end, look at any GPU task from other projects, you'll see things like "0.7C + 1GPU", and the number of threads could be more than one if necessary.

The server knows what CPU and GPU you have, it should be able to assess roughly how long it will take and how many threads need to run alongside the GPU. The total time is already then adjusted with experience at the host end.
ID: 7625 · Report as offensive    Reply Quote
Johnbodlis team

Send message
Joined: 5 May 07
Posts: 1
Credit: 4,854
RAC: 1
Message 7626 - Posted: 15 Jun 2024, 7:46:37 UTC
Last modified: 15 Jun 2024, 7:48:52 UTC

Jednotka se nepo?ítá v nastaveném ?ase (jedna hodina +-).
V nastavení p?edvoleb rozhodn? chybí možnost volby kolik jednotek chci a budu po?ítat (málo RAM, spousta jader = vysoká swap =nepoužitelny PC)
ID: 7626 · Report as offensive    Reply Quote
Sabroe_SMC

Send message
Joined: 10 Sep 10
Posts: 6
Credit: 1,067,564
RAC: 77
Message 7627 - Posted: 15 Jun 2024, 9:40:37 UTC
Last modified: 15 Jun 2024, 9:41:39 UTC

15.06.2024 10:53:10 | ralph@home | Sending scheduler request: To fetch work.
15.06.2024 10:53:10 | ralph@home | Requesting new tasks for CPU
15.06.2024 10:53:12 | ralph@home | Scheduler request completed: got 0 new tasks
15.06.2024 10:53:12 | ralph@home | No tasks sent
15.06.2024 10:53:12 | ralph@home | Project requested delay of 31 seconds
15.06.2024 10:53:47 | ralph@home | update requested by user
15.06.2024 10:53:47 | ralph@home | Sending scheduler request: Requested by user.
15.06.2024 10:53:47 | ralph@home | Requesting new tasks for CPU
15.06.2024 10:53:49 | ralph@home | Scheduler request completed: got 0 new tasks
15.06.2024 10:53:49 | ralph@home | No tasks sent
15.06.2024 10:53:49 | ralph@home | Project requested delay of 31 seconds
15.06.2024 10:54:25 | ralph@home | Sending scheduler request: To fetch work.
15.06.2024 10:54:25 | ralph@home | Requesting new tasks for CPU
15.06.2024 10:54:27 | ralph@home | Scheduler request completed: got 0 new tasks
15.06.2024 10:54:27 | ralph@home | No tasks sent
15.06.2024 10:54:27 | ralph@home | Project requested delay of 31 seconds
15.06.2024 11:08:08 | ralph@home | Sending scheduler request: To fetch work.
15.06.2024 11:08:08 | ralph@home | Requesting new tasks for CPU
15.06.2024 11:08:10 | ralph@home | Scheduler request completed: got 0 new tasks
15.06.2024 11:08:10 | ralph@home | No tasks sent
15.06.2024 11:08:10 | ralph@home | Project requested delay of 31 seconds
15.06.2024 11:27:54 | ralph@home | Sending scheduler request: To fetch work.
15.06.2024 11:27:54 | ralph@home | Requesting new tasks for CPU
15.06.2024 11:27:55 | ralph@home | Scheduler request completed: got 0 new tasks
15.06.2024 11:27:55 | ralph@home | No tasks sent
15.06.2024 11:27:55 | ralph@home | Project requested delay of 31 seconds
15.06.2024 11:32:10 | ralph@home | update requested by user
15.06.2024 11:32:13 | ralph@home | Sending scheduler request: Requested by user.
15.06.2024 11:32:13 | ralph@home | Requesting new tasks for CPU
15.06.2024 11:32:14 | ralph@home | Scheduler request completed: got 0 new tasks
15.06.2024 11:32:14 | ralph@home | No tasks sent
15.06.2024 11:32:14 | ralph@home | This computer has finished a daily quota of 1 tasks
15.06.2024 11:32:14 | ralph@home | Project requested delay of 31 seconds
15.06.2024 11:32:49 | ralph@home | Sending scheduler request: To fetch work.
15.06.2024 11:32:49 | ralph@home | Requesting new tasks for CPU
15.06.2024 11:32:51 | ralph@home | Scheduler request completed: got 0 new tasks
15.06.2024 11:32:51 | ralph@home | No tasks sent
15.06.2024 11:32:51 | ralph@home | This computer has finished a daily quota of 1 tasks
15.06.2024 11:32:51 | ralph@home | Project requested delay of 31 seconds

I updated Boinc manually at 11:32:10 to get new Wus but the result was that Ralph@home wont send me Wus for the rest of the day. Curios!
ID: 7627 · Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 3 Mar 23
Posts: 31
Credit: 9,510
RAC: 3
Message 7628 - Posted: 15 Jun 2024, 9:47:43 UTC - in response to Message 7627.  

I've seen other projects do that, if you send back a failure, they won't send you many more until you do one successfully. A bit strange on a project where most are going to fail through no fault of your own.
ID: 7628 · Report as offensive    Reply Quote
Grant (SSSF)

Send message
Joined: 13 Jun 24
Posts: 126
Credit: 193,939
RAC: 2,635
Message 7629 - Posted: 15 Jun 2024, 10:07:17 UTC - in response to Message 7625.  

A single application that can do work on both devices is good. It's also going to massively screw up work scheduling, deadlines & resource share balancing.
Why? Folding@Home allocates 1 thread per GPU task. I can do the same (and do) with some Boinc projects, manually in app config, this can also be preset at the server end, look at any GPU task from other projects, you'll see things like "0.7C + 1GPU", and the number of threads could be more than one if necessary.
What you are talking about is completely different to what i was talking about.

What you are talking about is where there are different applications for CPU work and GPU work. The GPU work requires the support of a CPU thread, but the application it uses is not the same as the one used for the CPU work.
Here the CPU & GPU application are one & the same.
Grant
Darwin NT
ID: 7629 · Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 3 Mar 23
Posts: 31
Credit: 9,510
RAC: 3
Message 7630 - Posted: 15 Jun 2024, 10:10:39 UTC - in response to Message 7629.  

A single application that can do work on both devices is good. It's also going to massively screw up work scheduling, deadlines & resource share balancing.
Why? Folding@Home allocates 1 thread per GPU task. I can do the same (and do) with some Boinc projects, manually in app config, this can also be preset at the server end, look at any GPU task from other projects, you'll see things like "0.7C + 1GPU", and the number of threads could be more than one if necessary.
What you are talking about is completely different to what i was talking about.

What you are talking about is where there are different applications for CPU work and GPU work. The GPU work requires the support of a CPU thread, but the application it uses is not the same as the one used for the CPU work.
Here the CPU & GPU application are one & the same.
No, this is precisely the same as say Einstein GPU tasks which need a CPU thread or a part of one. The task is given to me as 1GPU + 0.7 CPU threads.
ID: 7630 · Report as offensive    Reply Quote
Grant (SSSF)

Send message
Joined: 13 Jun 24
Posts: 126
Credit: 193,939
RAC: 2,635
Message 7631 - Posted: 15 Jun 2024, 10:11:36 UTC - in response to Message 7627.  

I updated Boinc manually at 11:32:10 to get new Wus but the result was that Ralph@home wont send me Wus for the rest of the day. Curios!
It can't send you any work if there is no work to send.

From the Server Status page
Tasks ready to send  0

Grant
Darwin NT
ID: 7631 · Report as offensive    Reply Quote
Vato
Avatar

Send message
Joined: 30 Jun 10
Posts: 2
Credit: 117,751
RAC: 0
Message 7632 - Posted: 15 Jun 2024, 10:17:03 UTC

on my machines with small nvidia GPU, this app pushed it's way onto the GPU which was already running something from another project - not good for what looks like a CPU app to the boinc-client. this needs to behave like other BOINC apps with separate CPU and GPU app versions. if not, this is the end of me running this app on such machines
ID: 7632 · Report as offensive    Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 11 · Next

Message boards : RALPH@home bug list : RoseTTAFold All-Atom



©2024 University of Washington
http://www.bakerlab.org