Message boards : RALPH@home bug list : RoseTTAFold All-Atom
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 11 · Next
Author | Message |
---|---|
Mr P Hucker Send message Joined: 3 Mar 23 Posts: 31 Credit: 9,510 RAC: 3 |
No good if you are also running other projects. I limited in app config for this project by telling it each task needs 10 threads.Note: One can also limit the number of cores in Windows 11 by setting "number of processors" in Advanced Boot options (run msconfig).I've opted to use max_concurrent to limit the number of cores/threads avalable to the TTAFold Tasks, leaving the others available for other processes. |
[VENETO] boboviz Send message Joined: 9 Apr 08 Posts: 913 Credit: 1,892,541 RAC: 294 |
This is often the case with many projects. Primegrid has a list of which versions to avoid - Nvidia are well known for buggy drivers. I know. i know. PrimeGrid has historical gpu app and well known behaviour Here we don't know nothing about this app.... |
[VENETO] boboviz Send message Joined: 9 Apr 08 Posts: 913 Credit: 1,892,541 RAC: 294 |
If there is a significant difference in the CUDA version between driver versions, and the TTAFold application makes use of calls that are only available in the most recent version, fail time. Some messages above, you said that the app is cpu only. I think that until admins say something, we can only guess... |
Bill F Send message Joined: 1 Jan 18 Posts: 21 Credit: 34,272 RAC: 52 |
I have gotten a couple of completions with credit on this system which is not a power system. Appears that they ran about 8 hours or so GenuineIntel Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz [Family 6 Model 60 Stepping 3] (4 processors) INTEL Intel(R) HD Graphics 4600 (1629MB) OpenCL: 1.2 Microsoft Windows 10 Professional x64 Edition, (10.00.19045.00) In October 1969 I took an oath to support and defend the Constitution of the United States against all enemies, foreign and domestic; There was no expiration date. |
Mr P Hucker Send message Joined: 3 Mar 23 Posts: 31 Credit: 9,510 RAC: 3 |
I have gotten a couple of completions with credit on this system which is not a power system. Appears that they ran about 8 hours or soI have 5 computers of varying power (4 to 24 threads), and they've all broken every single task. Oh well, at least they worked out how to send the right zip files.... If they get multithreaded tasks working well, this will be good for Rosetta. I see over on the main server they're handing out a shitload of Beta v6. I wonder what happened to v5? |
Grant (SSSF) Send message Joined: 13 Jun 24 Posts: 126 Credit: 193,939 RAC: 2,635 |
Yep.If there is a significant difference in the CUDA version between driver versions, and the TTAFold application makes use of calls that are only available in the most recent version, fail time. The are signs that if it finds a Nvidia driver, it tries to run on the GPU, otherwise it reverts to the CPU. In which case the developers really need to inform us of the minimum CPU, OS, video card and video card driver requirements. And they need to hard code those requirements into the application so it doesn't try to run on unsupported hardware/software. Here's an example of the problem. What appears to be 6hrs runtime on a RTX 2080 Super, and then it ran out of Video RAM. Not a lot of video cards have more than 8GB of RAM. Hell, not a lot of video cards would even have 8GB of RAM. Most would be 4-6GB. A single application that can do work on both devices is good. It's also going to massively screw up work scheduling, deadlines & resource share balancing. The other problem is with it's multi processing- the initial Estimated processing time is 4 hours- when in actual fact it takes over 24hrs (longer than the deadline). If you do give it the 8 threads it wants, it then takes around 4 hrs, but the initial Estimated processing time is only 53 min. This needs to be fixed, urgently. And i'm still not sure if the multiprocessing really is doing more processing using 8 cores than a single core, or it's just the fact that you don't have Tasks that want to use 96 threads when only 12 are available & are just fighting each other for CPU time. A massively, massively over committed system. The other thing that needs to be fixed, urgently, is not being able to return work once completed. I aborted all my Tasks that were past the deadline, and got some more. The first one was returned will before the deadline, Report deadline 15 Jun 2024, 10:39:24 UTC Received 14 Jun 2024, 17:41:52 UTC, but it also failed to upload the result. <core_client_version>8.0.2</core_client_version> <![CDATA[ <stderr_txt> C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchcuda__init__.py:52: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ..c10cudaCUDAFunctions.cpp:115.) return torch._C._cuda_getDeviceCount() > 0 03:10:21 (13388): called boinc_finish(0) </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_f_pred_369_16902_5_1_r880759058_0</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> </message> ]]> RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_f_pred_369_16902_5_1 The developers have a mountain of serious work to do before this even makes it to a beta stage. Grant Darwin NT |
Grant (SSSF) Send message Joined: 13 Jun 24 Posts: 126 Credit: 193,939 RAC: 2,635 |
The other thing that needs to be fixed, urgently, is not being able to return work once completed.And yet my next completed Task, was able to be returned. RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_g_pred_173_16903_4_1 although it did include a complaint about CUDA, which GPU-Z says is available. <core_client_version>8.0.2</core_client_version> <![CDATA[ <stderr_txt> C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchcuda__init__.py:52: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ..c10cudaCUDAFunctions.cpp:115.) return torch._C._cuda_getDeviceCount() > 0 07:08:34 (15488): called boinc_finish(0) </stderr_txt> ]]> Grant Darwin NT |
Grant (SSSF) Send message Joined: 13 Jun 24 Posts: 126 Credit: 193,939 RAC: 2,635 |
So, in no particular order, the list of things that need fixing so far. 1 Tasks need to be able to be suspended. Having them still running in the background even though the BOINC Manager says they are suspended will lead to unresponsive & crashed systems as they run out of memory, and available CPU time. 2 Checkpointing needs to be fixed- at present there are none at all. So if someone has to exit BOINC, or restart their computer, then all work to that point is lost. 3 CPU time needs to work properly. Not just for Scheduling reasons but also so that checkpointing can work. 4 The issue with file transfers failing needs to be fixed. What good is it completing a Task, if it fails because you can't return the result? 5 The issue with multiprocessing needs to be addressed. By default- TTAFold Tasks take up all CPU time, regardless of what other applications are running (eg Rosetta beta Tasks that show in the BOINC manager as running, but in Task manage show as getting 0 CPU time). Needing 8 threads to process 1 TTAFolding Task is not viable. It needs to be limited to just 1 thread per Task, with the option in the Projects settings of the Users account to allocate more if they wish. Or if 2 (or more) are used by default, then that needs to be handled in a way that the BIONC Manager is aware of this. This will allow the BOINC Manager to know how many threads are really available for other Tasks to use and stop systems from being over committed, resulting in missed deadlines & work scheduling issues with almost everything ending up in Panic mode all the time. Trying to use more threads than the processor actually has just leads to grief. 6 Initial Estimated completion times need to be more realistic. The default starts at 4 hours, yet with multiple TTAFold tasks running their completion time is over 24 hours- which means they miss the Projects own set deadlines. If we do limit the number of Tasks so that each one gets the 8 threads it needs, then the initial Estimated completion time is around 50min, yet the actual completion time is now around 4 hours. Missed deadlines, Panic mode & general chaos will be the result of this as well. 7 Requirements for GPU processing need to be known. People need to know just what is required in order for GPU processing to occur- ie minimum CPU, OS, video card type, video card memory size, driver version, in order to reduce confusion & concern about what is going on with their systems. 8 Having one application for both GPU & CPU work. This will probably be the single biggest issue, as history has shown us that without separate tracking of actual processing times for separate initial estimated completion times, honouring Resource shares becomes near impossible, and systems download more work than they can finish, as well as download less work than they can handle, all depending on how much work is done on what compute resource. With Rosetta having a (generally) fixed Runtime, as long as that can be maintained for Tasks regardless of whether they are done on a low-end CPU or a high-end GPU then everything will be OK. But that is going to be huge ask IMHO due to the massive difference in computing capability between those two extremes. Otherwise having two different applications (even if they are exactly the same other than their names (eg TTAFold CPU and TTAFold GPU)) like every other project would be the way to go. It may well be the best option to allow for the default usage of threads for CPU Tasks to be one (selectable by the user as they wish), and the default number of threads for work on the GPU to be set to the number of Compute Units (in OpenCL terms, i think NVidia call them SMs (Stream Multiprocessors)) the GPU being used has (selectable by the user once again). Either way the option in the Project settings for each user's account for using or not using the GPU (and ideally the CPU as well) will be necessary. Grant Darwin NT |
Grant (SSSF) Send message Joined: 13 Jun 24 Posts: 126 Credit: 193,939 RAC: 2,635 |
OK, i did something incredibly rash- i upgraded my video driver even though there was nothing wrong. Went from 447.xx and now on 552.22 End result- on restart my GPU is now crunching TTAFold tasks. CPU usage is 2 threads in order to support the one Task on the GPU. I tried running a second Task using my second GPU, but it ended up moving to the first GPU as well, significantly impacting on the system responsiveness, making everything extremely sluggish. So just 1 Task at a time it is. Gone from 8 threads taking 4 hours, to 2 threads & the GPU taking 20min or so (at least for the first task). Grant Darwin NT |
mikey Send message Joined: 28 Nov 20 Posts: 9 Credit: 114,771 RAC: 17 |
There needs to be a way to limit the number of threads a single Task can use. And at the moment, the indications are that 1 Task using 8 threads performs no better than when 12 Tasks were trying to use 8 threads each, when there were only 12 threads available. I'll keep an eye on things to see if they don't slow down later, but the initial signs are that the extra threads are providing not even the slightest improvement in processing time- they're just being wasted.[/quote] Try this: <app_config> <project_max_concurrent>1</project_max_concurrent> </app_config> Save it as 'app.config.xml' in each project folder in C:programdata/boinc/projects/project name and adjust the number to reflect how many tasks of each project you want to run, ie 1 of this project, 2 of that project and 7 of that project. The other thing I'm seeing is that the Ralph tasks are taking about 10gb of ram for EACH task so I had to limit my running tasks accordingly. I'm running 1 cpu core per task and they are taking over 2 days to finish. I'm still getting some errors but have not ruled out it being a pc problem as yet. |
Grant (SSSF) Send message Joined: 13 Jun 24 Posts: 126 Credit: 193,939 RAC: 2,635 |
Already done that, but just using max_concurrent. Even so, that doesn't limit the number of threads per Task, just the number of Tasks.There needs to be a way to limit the number of threads a single Task can use. The other thing I'm seeing is that the Ralph tasks are taking about 10gb of ram for EACH task so I had to limit my running tasks accordingly. I'm running 1 cpu core per task and they are taking over 2 days to finish. I'm still getting some errors but have not ruled out it being a pc problem as yet.The most i've seen for a CPU processed Task in use is a bit over 1.5GB. For GPU processed Tasks, they're using up to 2.5GB of system RAM & 6GB of VRAM. However the peak Swap file size i've seen is as high as 15GB. Most of your errors appear to be file transfer errors, nothing to do with RAM usage. Grant Darwin NT |
Grant (SSSF) Send message Joined: 13 Jun 24 Posts: 126 Credit: 193,939 RAC: 2,635 |
End result- on restart my GPU is now crunching TTAFold tasks. CPU usage is 2 threads in order to support the one Task on the GPU.After running a few more Tasks on the GPU, it appears that most of the time it's happy with just one thread. Right at the start when initialising it looks like it makes use of up to 2 threads. Then the rest of the time it's just the one thread with the occasional spike of making partial use of another thread. Grant Darwin NT |
Mr P Hucker Send message Joined: 3 Mar 23 Posts: 31 Credit: 9,510 RAC: 3 |
A single application that can do work on both devices is good. It's also going to massively screw up work scheduling, deadlines & resource share balancing.Why? Folding@Home allocates 1 thread per GPU task. I can do the same (and do) with some Boinc projects, manually in app config, this can also be preset at the server end, look at any GPU task from other projects, you'll see things like "0.7C + 1GPU", and the number of threads could be more than one if necessary. The server knows what CPU and GPU you have, it should be able to assess roughly how long it will take and how many threads need to run alongside the GPU. The total time is already then adjusted with experience at the host end. |
Johnbodlis team Send message Joined: 5 May 07 Posts: 1 Credit: 4,854 RAC: 1 |
Jednotka se nepo?ítá v nastaveném ?ase (jedna hodina +-). V nastavení p?edvoleb rozhodn? chybí možnost volby kolik jednotek chci a budu po?ítat (málo RAM, spousta jader = vysoká swap =nepoužitelny PC) |
Sabroe_SMC Send message Joined: 10 Sep 10 Posts: 6 Credit: 1,067,564 RAC: 77 |
15.06.2024 10:53:10 | ralph@home | Sending scheduler request: To fetch work. 15.06.2024 10:53:10 | ralph@home | Requesting new tasks for CPU 15.06.2024 10:53:12 | ralph@home | Scheduler request completed: got 0 new tasks 15.06.2024 10:53:12 | ralph@home | No tasks sent 15.06.2024 10:53:12 | ralph@home | Project requested delay of 31 seconds 15.06.2024 10:53:47 | ralph@home | update requested by user 15.06.2024 10:53:47 | ralph@home | Sending scheduler request: Requested by user. 15.06.2024 10:53:47 | ralph@home | Requesting new tasks for CPU 15.06.2024 10:53:49 | ralph@home | Scheduler request completed: got 0 new tasks 15.06.2024 10:53:49 | ralph@home | No tasks sent 15.06.2024 10:53:49 | ralph@home | Project requested delay of 31 seconds 15.06.2024 10:54:25 | ralph@home | Sending scheduler request: To fetch work. 15.06.2024 10:54:25 | ralph@home | Requesting new tasks for CPU 15.06.2024 10:54:27 | ralph@home | Scheduler request completed: got 0 new tasks 15.06.2024 10:54:27 | ralph@home | No tasks sent 15.06.2024 10:54:27 | ralph@home | Project requested delay of 31 seconds 15.06.2024 11:08:08 | ralph@home | Sending scheduler request: To fetch work. 15.06.2024 11:08:08 | ralph@home | Requesting new tasks for CPU 15.06.2024 11:08:10 | ralph@home | Scheduler request completed: got 0 new tasks 15.06.2024 11:08:10 | ralph@home | No tasks sent 15.06.2024 11:08:10 | ralph@home | Project requested delay of 31 seconds 15.06.2024 11:27:54 | ralph@home | Sending scheduler request: To fetch work. 15.06.2024 11:27:54 | ralph@home | Requesting new tasks for CPU 15.06.2024 11:27:55 | ralph@home | Scheduler request completed: got 0 new tasks 15.06.2024 11:27:55 | ralph@home | No tasks sent 15.06.2024 11:27:55 | ralph@home | Project requested delay of 31 seconds 15.06.2024 11:32:10 | ralph@home | update requested by user 15.06.2024 11:32:13 | ralph@home | Sending scheduler request: Requested by user. 15.06.2024 11:32:13 | ralph@home | Requesting new tasks for CPU 15.06.2024 11:32:14 | ralph@home | Scheduler request completed: got 0 new tasks 15.06.2024 11:32:14 | ralph@home | No tasks sent 15.06.2024 11:32:14 | ralph@home | This computer has finished a daily quota of 1 tasks 15.06.2024 11:32:14 | ralph@home | Project requested delay of 31 seconds 15.06.2024 11:32:49 | ralph@home | Sending scheduler request: To fetch work. 15.06.2024 11:32:49 | ralph@home | Requesting new tasks for CPU 15.06.2024 11:32:51 | ralph@home | Scheduler request completed: got 0 new tasks 15.06.2024 11:32:51 | ralph@home | No tasks sent 15.06.2024 11:32:51 | ralph@home | This computer has finished a daily quota of 1 tasks 15.06.2024 11:32:51 | ralph@home | Project requested delay of 31 seconds I updated Boinc manually at 11:32:10 to get new Wus but the result was that Ralph@home wont send me Wus for the rest of the day. Curios! |
Mr P Hucker Send message Joined: 3 Mar 23 Posts: 31 Credit: 9,510 RAC: 3 |
I've seen other projects do that, if you send back a failure, they won't send you many more until you do one successfully. A bit strange on a project where most are going to fail through no fault of your own. |
Grant (SSSF) Send message Joined: 13 Jun 24 Posts: 126 Credit: 193,939 RAC: 2,635 |
What you are talking about is completely different to what i was talking about.A single application that can do work on both devices is good. It's also going to massively screw up work scheduling, deadlines & resource share balancing.Why? Folding@Home allocates 1 thread per GPU task. I can do the same (and do) with some Boinc projects, manually in app config, this can also be preset at the server end, look at any GPU task from other projects, you'll see things like "0.7C + 1GPU", and the number of threads could be more than one if necessary. What you are talking about is where there are different applications for CPU work and GPU work. The GPU work requires the support of a CPU thread, but the application it uses is not the same as the one used for the CPU work. Here the CPU & GPU application are one & the same. Grant Darwin NT |
Mr P Hucker Send message Joined: 3 Mar 23 Posts: 31 Credit: 9,510 RAC: 3 |
No, this is precisely the same as say Einstein GPU tasks which need a CPU thread or a part of one. The task is given to me as 1GPU + 0.7 CPU threads.What you are talking about is completely different to what i was talking about.A single application that can do work on both devices is good. It's also going to massively screw up work scheduling, deadlines & resource share balancing.Why? Folding@Home allocates 1 thread per GPU task. I can do the same (and do) with some Boinc projects, manually in app config, this can also be preset at the server end, look at any GPU task from other projects, you'll see things like "0.7C + 1GPU", and the number of threads could be more than one if necessary. |
Grant (SSSF) Send message Joined: 13 Jun 24 Posts: 126 Credit: 193,939 RAC: 2,635 |
I updated Boinc manually at 11:32:10 to get new Wus but the result was that Ralph@home wont send me Wus for the rest of the day. Curios!It can't send you any work if there is no work to send. From the Server Status page Tasks ready to send 0 Grant Darwin NT |
Vato Send message Joined: 30 Jun 10 Posts: 2 Credit: 117,751 RAC: 0 |
on my machines with small nvidia GPU, this app pushed it's way onto the GPU which was already running something from another project - not good for what looks like a CPU app to the boinc-client. this needs to behave like other BOINC apps with separate CPU and GPU app versions. if not, this is the end of me running this app on such machines |
Message boards :
RALPH@home bug list :
RoseTTAFold All-Atom
©2024 University of Washington
http://www.bakerlab.org