Test spotted

Message boards : Current tests : Test spotted

To post messages, you must log in.

AuthorMessage
Mr P Hucker
Avatar

Send message
Joined: 3 Mar 23
Posts: 6
Credit: 2,949
RAC: 0
Message 7428 - Posted: 31 Oct 2023, 5:36:51 UTC

Pity this place is full of spam. Here's a real post.

I got 1 task on each of 2 machines. 5GB vdi image downloaded for Python.

1st machine said download failure, not sure why. Computer may have crashed of it's own accord, it has a dodgy old GPU.
https://ralph.bakerlab.org/results.php?hostid=49339

2nd machine got it ok, but caused a computation error immediately.
https://ralph.bakerlab.org/results.php?hostid=48821
ID: 7428 · Report as offensive    Reply Quote
Profile aendgraend

Send message
Joined: 24 Oct 08
Posts: 5
Credit: 330,028
RAC: 0
Message 7432 - Posted: 10 Nov 2023, 10:18:04 UTC - in response to Message 7428.  
Last modified: 10 Nov 2023, 10:20:50 UTC

My Host 37110 got several WUs one of which is running 21+ hrs right now. AFAIK one Workunit always completes at least one model, so my Question is, are those running times expected? If yes, maybe the Deadline should be adjusted, as the other WUS waiting here on my host won't have even started until the Deadline is over (so I will sadly need to cancel them which I normally won't do).

Here's the Details of the current WU (which is already on 100% progress since yesterday evening)


Application => rosetta python projects 0.21 (vbox64)
Name => Diff_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_2_5_16878_11
Status => Active
received => 08.11.2023 13:45:29
Expiration date => 11/11/2023 13:45:15
Estimated calculation effort => 80,000 GFLOPs
Processor time => 1d 05:24:20
Processor time since last checkpoint => 00:07:08
Runtime to date => 21:23:25
Estimated time remaining => 00:00:00
Progress => 100.000%
Memory required => 62.15 MB
Size of the work package => 3.26 GB
Directory => slots/21
Process no. => 10440
Progress rate => 4.680% per hour
Executable file => vboxwrapper_26203_windows_x86_64.exe
ID: 7432 · Report as offensive    Reply Quote
Fardringle

Send message
Joined: 22 Feb 06
Posts: 6
Credit: 257,195
RAC: 0
Message 7433 - Posted: 10 Nov 2023, 14:25:52 UTC
Last modified: 10 Nov 2023, 14:33:31 UTC

I received a total of 12 tasks from this test batch. Five completed successfully. Seven have failed due to the virtual machine becoming "stale" when the application thinks that it was stuck. In each of the failed tasks, it looks like the VM task was put into a paused state instead of shutting down, and then because it was in an unexpected state the application/task marked itself as invalid.

And it does not seem to be related to the actual run time. Some of the successful tasks completed in about 15-16 minutes. Others took several hours to finish. The failed tasks have a pretty wide variety of run times as well.

https://ralph.bakerlab.org/results.php?userid=717&offset=0&show_names=0&state=6&appid=
ID: 7433 · Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 3 Mar 23
Posts: 6
Credit: 2,949
RAC: 0
Message 7434 - Posted: 11 Nov 2023, 14:46:09 UTC
Last modified: 11 Nov 2023, 14:49:48 UTC

I'm seeing these take up to a day (and not finish yet), but on an old slow computer (unfortunately only that one seems to get Ralph). I guess I should let them run to the end? I was cancelling anything lasting a lot longer than the standard 4/8/12 hours we see on Rosetta. They are still using a full CPU core each. It does also say "Rule 1. Please do not abort work units." in the guidelines, although those are from 2006.
ID: 7434 · Report as offensive    Reply Quote
Fardringle

Send message
Joined: 22 Feb 06
Posts: 6
Credit: 257,195
RAC: 0
Message 7435 - Posted: 12 Nov 2023, 2:02:26 UTC - in response to Message 7434.  

Yes, let them try to run to completion, especially on an older computer that might take a while. I had a few take 4-6 hours on my pretty fast Ryzen 9 3900X, so a slow computer could easily take several times that long to finish.
ID: 7435 · Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 3 Mar 23
Posts: 6
Credit: 2,949
RAC: 0
Message 7436 - Posted: 12 Nov 2023, 2:27:40 UTC - in response to Message 7435.  

Yes, let them try to run to completion, especially on an older computer that might take a while. I had a few take 4-6 hours on my pretty fast Ryzen 9 3900X, so a slow computer could easily take several times that long to finish.
Some will presumably be auto-cancelled by Boinc because they won't start in time. Boinc is doing the usual trick of downloading enough work for 4 cores when only one is allowed in the preferences (25% usage). I'm sick to death of reporting schoolboy errors to Boinc programmers.
ID: 7436 · Report as offensive    Reply Quote
Profile aendgraend

Send message
Joined: 24 Oct 08
Posts: 5
Credit: 330,028
RAC: 0
Message 7437 - Posted: 13 Nov 2023, 11:24:09 UTC - in response to Message 7432.  
Last modified: 13 Nov 2023, 11:24:56 UTC

My Host 37110 got several WUs one of which is running 21+ hrs right now. AFAIK one Workunit always completes at least one model, so my Question is, are those running times expected? If yes, maybe the Deadline should be adjusted, as the other WUS waiting here on my host won't have even started until the Deadline is over (so I will sadly need to cancel them which I normally won't do).

Here's the Details of the current WU (which is already on 100% progress since yesterday evening)


Application => rosetta python projects 0.21 (vbox64)
Name => Diff_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_2_5_16878_11
Status => Active
received => 08.11.2023 13:45:29
Expiration date => 11/11/2023 13:45:15
Estimated calculation effort => 80,000 GFLOPs
Processor time => 1d 05:24:20
Processor time since last checkpoint => 00:07:08
Runtime to date => 21:23:25
Estimated time remaining => 00:00:00
Progress => 100.000%
Memory required => 62.15 MB
Size of the work package => 3.26 GB
Directory => slots/21
Process no. => 10440
Progress rate => 4.680% per hour
Executable file => vboxwrapper_26203_windows_x86_64.exe


The mentioned WU is still running and still on 100% progress:


application => rosetta python projects 0.21 (vbox64)
name
Diff_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_2_5_16878_11
Status => Active
Received => 08.11.2023 13:45:29
Expiration date => 11.11.2023 13:45:15
Estimated calculation effort => 80,000 GFLOPs
Processor time => 3d 17:10:49
Processor time since last checkpoint => 00:06:32
Runtime to date => 2d 21:15:29
Estimated time remaining => ---
Progress => 100,000%
Memory required => 63,26 MB
Size of the work package => 3.26 GB
directory => slots/21
Process no. => 4464
Progress rate => 1.440% per hour
Executable file => vboxwrapper_26203_windows_x86_64.exe


As my profile doesn't show any Workunit in progress (all are shown as 'Timeout - no response') I'm going to cancel it sadly.
ID: 7437 · Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 3 Mar 23
Posts: 6
Credit: 2,949
RAC: 0
Message 7438 - Posted: 13 Nov 2023, 20:19:35 UTC - in response to Message 7437.  

Timeout no response means you didn't send it back on time. It will still be accepted if you send it in late, it just means the server will have sent it to someone else, who probably has the same problem. So I think you should leave it running. I'm leaving mine on.
ID: 7438 · Report as offensive    Reply Quote
marmot

Send message
Joined: 24 May 19
Posts: 1
Credit: 2,183
RAC: 0
Message 7450 - Posted: 5 Dec 2023, 19:21:21 UTC
Last modified: 5 Dec 2023, 19:23:38 UTC

Been trying to get one of these on my laptop that is attached to all my projects.
10th gen Intel successfully running my own VM's (Win 11 and Win 10 minimal services and tiny Linux install).
Haven't been able to get a single python WU.
One project WU's doesn't work with VBox 7+ yet so it's on VBox 6.1.42.

Is this the most likely reason the server isn't sending the WU?
Of course it just might be this client missed every single batch and rerun put out.

I've noticed the beta servers tend to send work to clients that have been attached longest that have prior credit.
This one has no credit.

<scheduler_reply>
<scheduler_version>707</scheduler_version>
<dont_use_dcf/>
<master_url>https://ralph.bakerlab.org/</master_url>
<request_delay>31.000000</request_delay>
<message priority="low">No tasks sent</message>
<project_name>ralph@home</project_name>
<symstore>https://boinc.bakerlab.org/rosetta/symstore</symstore>

<user_name>marmot</user_name>
<user_total_credit>2182.807721</user_total_credit>
<user_expavg_credit>0.099605</user_expavg_credit>

<project_preferences>
<resource_share>100</resource_share>
<no_cpu>0</no_cpu>
<project_specific>
<max_fps>20</max_fps>
<max_cpu>20</max_cpu>
<cpu_run_time>86400</cpu_run_time>
<bg>2</bg>
</project_specific>
</project_preferences>
<host_total_credit>0.000000</host_total_credit>
<host_expavg_credit>0.000000</host_expavg_credit>
<host_venue></host_venue>
<host_create_time>1666233677</host_create_time>
ID: 7450 · Report as offensive    Reply Quote

Message boards : Current tests : Test spotted



©2024 University of Washington
http://www.bakerlab.org