RoseTTAFold All-Atom 0.03 (nvidia_alpha)

Message boards : RALPH@home bug list : RoseTTAFold All-Atom 0.03 (nvidia_alpha)

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile rilian
Avatar

Send message
Joined: 7 Sep 07
Posts: 35
Credit: 107,666
RAC: 725
Message 7790 - Posted: 6 Jul 2024, 23:35:27 UTC - in response to Message 7789.  

I got 6 units of the latest batch and all errored out after 4 seconds because of an access code violation. Same fate for my wingmen. Windows 10, RTX 4080, 32 GB RAM. Anybody who managed to finish off a bunch of them care to share your system details?


i completed all 20 tasks from last batch
win 10 64; RTX 3060
i had folding@home in parallel
--
I crunch for Ukraine

ID: 7790 · Report as offensive    Reply Quote
Sabroe_SMC

Send message
Joined: 10 Sep 10
Posts: 6
Credit: 1,067,564
RAC: 77
Message 7791 - Posted: 6 Jul 2024, 23:45:56 UTC - in response to Message 7788.  

Hello guys
Are you foolish or what?
Not as foolish as you for running two at time.
Per Task runtime doesn't matter- the number per hour is what matters.

On my RTX 2060 & RTX 2060 Super the GPU load is around 75% to 80% for these Tasks.
Given that the GPU load is over 50%, and it requires about 1.3 CPU cores to support one Task, even if you give them 3 CPU cores to support 2 GPU Tasks when running two at a time, and the fact that the application has not been optimised much (if at all) as shown by the frequent dips in the GPU load & less frequent but very regular large dips in GPU load, it doesn't make much sense to run 2 at a time.

Run one at a time & see how it goes.


As for the Credit per Task- it's Alpha work, things will change as they progress. At least for the last batch the Credit per Task was much more consistent than the earlier batches & applications.
And as for their Runtime- while they take longer than the last batch, they are still taking less time than the first batch i was able to process on the GPU.


On my RTX4090 1 Task of the Wus from 3.7.2024 longs about 440 sec 2 Tasks longs about 650 sec. At 1 Task the GPU Utilisation was abot 50-60% with 2 tasks it was about 98%
1 Stk 440-450 sec = 100%
2 Stk 610-640 sec = 142,8%
3 Stk 904-910 sec = 149%
4 Stk 1199-1202 sec = 149,8%

But now 2 tasks are 1,5 longer
ID: 7791 · Report as offensive    Reply Quote
Grant (SSSF)

Send message
Joined: 13 Jun 24
Posts: 126
Credit: 193,939
RAC: 2,635
Message 7792 - Posted: 6 Jul 2024, 23:50:55 UTC - in response to Message 7789.  

I got 6 units of the latest batch and all errored out after 4 seconds because of an access code violation. Same fate for my wingmen. Windows 10, RTX 4080, 32 GB RAM. Anybody who managed to finish off a bunch of them care to share your system details?
I had the same thing happening with my systems initially.
i7-8700K, 32GB RAM, Win10 Pro, 552.22 video driver, RTX 2060 & RTX 2060 Super on other system with same other specs.

I reset the project, it re-downloaded all the files, and then they worked OK (no idea why downloading the same things all over again made any difference, but it did).
Grant
Darwin NT
ID: 7792 · Report as offensive    Reply Quote
Grant (SSSF)

Send message
Joined: 13 Jun 24
Posts: 126
Credit: 193,939
RAC: 2,635
Message 7793 - Posted: 7 Jul 2024, 0:09:57 UTC - in response to Message 7791.  

But now 2 tasks are 1,5 longer
As i mentnioned in my other post- while these Tasks take longer to process than the last batch, they are still taking less time than the initial batch that i was able to process.
For my RTX 2060
First batch   27min 50 sec
Last batch    16min 50 sec
Current batch 25min 45 sec


How many CPU cores/threads of your system are in use?
Either try 1 Task at a time again, or limit the number of running CPU Tasks to free up a CPU core/thread, or reserve 3 cores/threads for your 2 GPU Ralph Tasks.
The default for Ralph is to reserve only 0.997 cores/threads per GPU Task, but the actual usage is more like 1.3, with frequent bumps of 2.5 every 20 seconds or so (corresponding with the drop in GPU load & power draw).
See if reserving 3 cores/threads helps significantly enough to make a difference.

I tried reserving 2 cores/threads, but the improvement in Ralph processing time was so small as to not come close to offsetting the lost output of CPU work from that core/thread. For your high end GPU and running 2 Tasks at a time, it might be.
Grant
Darwin NT
ID: 7793 · Report as offensive    Reply Quote
Sabroe_SMC

Send message
Joined: 10 Sep 10
Posts: 6
Credit: 1,067,564
RAC: 77
Message 7795 - Posted: 7 Jul 2024, 9:27:42 UTC - in response to Message 7793.  

Momently i have NO CPU tasks running
ID: 7795 · Report as offensive    Reply Quote
Drago75

Send message
Joined: 29 Jul 22
Posts: 3
Credit: 70,604
RAC: 227
Message 7797 - Posted: 7 Jul 2024, 19:58:08 UTC - in response to Message 7795.  

Ok, so now my three Windows hosts run these wus successfully. It was necessary to detach from Ralph on all of them and to rejoin the project after a reboot to make sure like it was mentioned on a previous post. Evidently the data from the project folder got corrupted. Now they all work flawlessly.

Another thing: My RTX 3070ti was initially rejected because of not enough VRAM which is nonsense because it’s got 8 GB. The solution was to install the latest version of BOINC manager.
ID: 7797 · Report as offensive    Reply Quote
Profile rilian
Avatar

Send message
Joined: 7 Sep 07
Posts: 35
Credit: 107,666
RAC: 725
Message 7802 - Posted: 9 Jul 2024, 3:38:42 UTC

something strange is going with the tasks

my tasks page shows that i have 3 tasks in progress, but BOINC on the computer shows only one. Nothing in transfers

i definitely have all tasks visible, not only active...

will research logs carefully tomorrow. did anyone see such issue ?
--
I crunch for Ukraine

ID: 7802 · Report as offensive    Reply Quote
Grant (SSSF)

Send message
Joined: 13 Jun 24
Posts: 126
Credit: 193,939
RAC: 2,635
Message 7803 - Posted: 9 Jul 2024, 5:25:00 UTC - in response to Message 7802.  
Last modified: 9 Jul 2024, 5:32:52 UTC

my tasks page shows that i have 3 tasks in progress, but BOINC on the computer shows only one. Nothing in transfers
Sounds like what's known as Ghost Tasks.
Some sort of network issue during contacting the Scheduler & getting the work. The Scheduler thinks you got it, but you didn't.


Looking at my Tasks i've had some weirdness as well.
I had a Task error out with the reason being "Timed out - no response", that was after only 8 minutes after getting it...
Grant
Darwin NT
ID: 7803 · Report as offensive    Reply Quote
Profile rilian
Avatar

Send message
Joined: 7 Sep 07
Posts: 35
Credit: 107,666
RAC: 725
Message 7804 - Posted: 9 Jul 2024, 14:26:43 UTC - in response to Message 7803.  

my tasks page shows that i have 3 tasks in progress, but BOINC on the computer shows only one. Nothing in transfers
Sounds like what's known as Ghost Tasks.
Some sort of network issue during contacting the Scheduler & getting the work. The Scheduler thinks you got it, but you didn't.


Looking at my Tasks i've had some weirdness as well.
I had a Task error out with the reason being "Timed out - no response", that was after only 8 minutes after getting it...

Thank you

I think you are right - i had some networking issues yesterday due to work on some other project..

Those 2 extra tasks still hang In Progress and will time out in few days
--
I crunch for Ukraine

ID: 7804 · Report as offensive    Reply Quote
Profile rilian
Avatar

Send message
Joined: 7 Sep 07
Posts: 35
Credit: 107,666
RAC: 725
Message 7836 - Posted: 2 Sep 2024, 14:36:57 UTC
Last modified: 2 Sep 2024, 14:37:59 UTC

I've got few tasks and they all fail after 4000 sec with

C:ProgramDataBOINCprojectsralph.bakerlab.orgcv1rf2aautil.py:450: UserWarning: Using torch.cross without specifying the dim arg is deprecated.
Please either pass the dim explicitly or simply use torch.linalg.cross.
The default value of dim will change to agree with that of linalg.cross in a future release. (Triggered internally at C:cbpytorch_1000000000000workatensrcATennativeCross.cpp:66.)
  Z = torch.cross(Xn,Yn)

--
I crunch for Ukraine

ID: 7836 · Report as offensive    Reply Quote
Grant (SSSF)

Send message
Joined: 13 Jun 24
Posts: 126
Credit: 193,939
RAC: 2,635
Message 7837 - Posted: 2 Sep 2024, 19:58:35 UTC - in response to Message 7836.  
Last modified: 2 Sep 2024, 19:59:44 UTC

I've got few tasks and they all fail after 4000 sec with

C:ProgramDataBOINCprojectsralph.bakerlab.orgcv1rf2aautil.py:450: UserWarning: Using torch.cross without specifying the dim arg is deprecated.
Please either pass the dim explicitly or simply use torch.linalg.cross.
The default value of dim will change to agree with that of linalg.cross in a future release. (Triggered internally at C:cbpytorch_1000000000000workatensrcATennativeCross.cpp:66.)
  Z = torch.cross(Xn,Yn)
They didn't necessarily fail- that error message was occurring with Taks that completed and validated on the previous runs.
It looks like they completed processing OK, but there was a problem with Validation. I'm getting the same issue.

Either they didn't process correctly, or there is an issue with the Validators.
Grant
Darwin NT
ID: 7837 · Report as offensive    Reply Quote
dcdc

Send message
Joined: 15 Aug 06
Posts: 27
Credit: 90,652
RAC: 0
Message 7838 - Posted: 2 Sep 2024, 22:23:52 UTC
Last modified: 2 Sep 2024, 22:35:49 UTC

I'm getting this error:

655	ralph@home	02/09/2024 23:21:46	Requesting new tasks for NVIDIA GPU	
656	ralph@home	02/09/2024 23:21:47	Scheduler request completed: got 0 new tasks	
657	ralph@home	02/09/2024 23:21:47	A minimum of 5120 MB (preferably 5120 MB) of video RAM is needed to process tasks using your computer's NVIDIA GPU	
658	ralph@home	02/09/2024 23:21:47	Project requested delay of 31 seconds



I've got a Quadro P2200 with exactly that (5120MB) according to GPU-Z. Anyone else getting that error?

D
ID: 7838 · Report as offensive    Reply Quote
rjs5

Send message
Joined: 5 Jul 15
Posts: 22
Credit: 135,787
RAC: 2,494
Message 7839 - Posted: 2 Sep 2024, 23:25:49 UTC - in response to Message 7837.  

I've got few tasks and they all fail after 4000 sec with

C:ProgramDataBOINCprojectsralph.bakerlab.orgcv1rf2aautil.py:450: UserWarning: Using torch.cross without specifying the dim arg is deprecated.
Please either pass the dim explicitly or simply use torch.linalg.cross.
The default value of dim will change to agree with that of linalg.cross in a future release. (Triggered internally at C:cbpytorch_1000000000000workatensrcATennativeCross.cpp:66.)
  Z = torch.cross(Xn,Yn)
They didn't necessarily fail- that error message was occurring with Taks that completed and validated on the previous runs.
It looks like they completed processing OK, but there was a problem with Validation. I'm getting the same issue.

Either they didn't process correctly, or there is an issue with the Validators.



Mine seem to be finishing with the WARNING message and then fail to VALIDATE. They are taking 7GB of 24GB of dedicated GPU memory.



Stderr output

<core_client_version>8.0.2</core_client_version>
<![CDATA[
<stderr_txt>
C:ProgramDataBOINCprojectsralph.bakerlab.orgcv1rf2aautil.py:450: UserWarning: Using torch.cross without specifying the dim arg is deprecated.
Please either pass the dim explicitly or simply use torch.linalg.cross.
The default value of dim will change to agree with that of linalg.cross in a future release. (Triggered internally at C:cbpytorch_1000000000000workatensrcATennativeCross.cpp:66.)
Z = torch.cross(Xn,Yn)
16:09:10 (7564): called boinc_finish(0)

</stderr_txt>
]]>
ID: 7839 · Report as offensive    Reply Quote
Profile rilian
Avatar

Send message
Joined: 7 Sep 07
Posts: 35
Credit: 107,666
RAC: 725
Message 7857 - Posted: 6 Sep 2024, 12:45:09 UTC

I've got few of these tasks today and they are successfully validated!
--
I crunch for Ukraine

ID: 7857 · Report as offensive    Reply Quote
Previous · 1 · 2

Message boards : RALPH@home bug list : RoseTTAFold All-Atom 0.03 (nvidia_alpha)



©2024 University of Washington
http://www.bakerlab.org