Checkpointing, more credits? Or more models?

Message boards : Number crunching : Checkpointing, more credits? Or more models?

To post messages, you must log in.

AuthorMessage
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 1455 - Posted: 2 May 2006, 2:48:12 UTC

At one point it was mentioned that we were seeing 3x productivity on clients with the new checkpointing. I haven't tracked things closely enough... when I lose work due to preemption, does the time spent reset back to the checkpoint? And the credits is based on time spent, right?

Or if time spent always rolls forward, then we'd just see more model completions per hour of time? (because less time is spent retracing the steps we had made prior to preemption).
ID: 1455 · Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 16 Feb 06
Posts: 251
Credit: 0
RAC: 0
Message 1456 - Posted: 2 May 2006, 3:34:54 UTC - in response to Message 1455.  
Last modified: 2 May 2006, 3:35:13 UTC

At one point it was mentioned that we were seeing 3x productivity on clients with the new checkpointing. I haven't tracked things closely enough... when I lose work due to preemption, does the time spent reset back to the checkpoint? And the credits is based on time spent, right?

Or if time spent always rolls forward, then we'd just see more model completions per hour of time? (because less time is spent retracing the steps we had made prior to preemption).


IF the Work Unit is removed from memory, it will always roll back to the last checkpoint. When it starts on my systems this will usually result in lost time as well. The clock does not keep rolling forward if the percent resets. This is why it is still a good idea to set keep in memory to yes.

All the project loose somme time because of this loss. CPDN and Rosetta are two of the more lossy in this regard, but all projects loose some time this way.

Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact
ID: 1456 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 1458 - Posted: 2 May 2006, 14:26:14 UTC - in response to Message 1456.  



IF the Work Unit is removed from memory, it will always roll back to the last checkpoint. When it starts on my systems this will usually result in lost time as well. The clock does not keep rolling forward if the percent resets. This is why it is still a good idea to set keep in memory to yes.

All the project loose somme time because of this loss. CPDN and Rosetta are two of the more lossy in this regard, but all projects loose some time this way.


...so, on average, with the enhanced checkpointing, we should expect to see a credit increase throughout the project, along with increased project TFLOPS (which as you've pointed out elsewhere appear directly calculated from credits issued).
ID: 1458 · Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 16 Feb 06
Posts: 251
Credit: 0
RAC: 0
Message 1459 - Posted: 2 May 2006, 15:07:16 UTC - in response to Message 1458.  



IF the Work Unit is removed from memory, it will always roll back to the last checkpoint. When it starts on my systems this will usually result in lost time as well. The clock does not keep rolling forward if the percent resets. This is why it is still a good idea to set keep in memory to yes.

All the project loose somme time because of this loss. CPDN and Rosetta are two of the more lossy in this regard, but all projects loose some time this way.


...so, on average, with the enhanced checkpointing, we should expect to see a credit increase throughout the project, along with increased project TFLOPS (which as you've pointed out elsewhere appear directly calculated from credits issued).

[And well, not surprisingly that is precisely what has happened. If you look at the graphs on BOINCStats for Teraflops, and you have been watching the homepage of Rosetta, you can see the effect.

You have to ignore Friday because there is a spike caused by failed credit awards on Friday, but the project is showing about 27TF and there is a general trend upward. It rises and falls a little but still the trend is up.

The important thing is that only a week ago the project was stalled at about 24TF. That 3 TF gain is all about fixing the errors, and reductions in time lost from checkpointing issues. By my estimates there is about another 1TF that will come from additional error fixing. There could be another 2-4 TF still being lost due to long checkpointing. There is also about 2-3TF available if the Mac version of the application is fixed and optimized using Altivec coding. So there is still about 5 TF that could be squeezed out of the existing attach base of the project. This is all without adding a single system. Now to be fair there have been systems joining and returning every day so some part of the improvements comes from that as well.


Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact
ID: 1459 · Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 16 Feb 06
Posts: 251
Credit: 0
RAC: 0
Message 1460 - Posted: 2 May 2006, 15:09:16 UTC - in response to Message 1458.  
Last modified: 2 May 2006, 15:12:12 UTC



IF the Work Unit is removed from memory, it will always roll back to the last checkpoint. When it starts on my systems this will usually result in lost time as well. The clock does not keep rolling forward if the percent resets. This is why it is still a good idea to set keep in memory to yes.

All the project loose somme time because of this loss. CPDN and Rosetta are two of the more lossy in this regard, but all projects loose some time this way.


...so, on average, with the enhanced checkpointing, we should expect to see a credit increase throughout the project, along with increased project TFLOPS (which as you've pointed out elsewhere appear directly calculated from credits issued).

And well, not surprisingly that is precisely what has happened. If you look at the graphs on BOINCStats for Teraflops, and you have been watching the homepage of Rosetta, you can see the effect.

You have to ignore Friday because there is a spike caused by failed credit awards on Friday, but the project is showing about 27TF and there is a general trend upward. It rises and falls a little but still the trend is up.

The important thing is that only a week ago the project was stalled at about 24TF. That 3 TF gain is all about fixing the errors, and reductions in time lost from checkpointing issues. By my estimates there is about another 1TF that will come from additional error fixing. There could be another 2-4 TF still being lost due to long checkpointing. There is also about 2-3TF available if the Mac version of the application is fixed and optimized using Altivec coding. So there is still about 5 TF that could be squeezed out of the existing attach base of the project. This is all without adding a single system. Now to be fair there have been systems joining and returning every day so some part of the improvements comes from that as well.


Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact
ID: 1460 · Report as offensive    Reply Quote

Message boards : Number crunching : Checkpointing, more credits? Or more models?



©2024 University of Washington
http://www.bakerlab.org