Message boards : Number crunching : Checkpointing, more credits? Or more models?
Author | Message |
---|---|
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
At one point it was mentioned that we were seeing 3x productivity on clients with the new checkpointing. I haven't tracked things closely enough... when I lose work due to preemption, does the time spent reset back to the checkpoint? And the credits is based on time spent, right? Or if time spent always rolls forward, then we'd just see more model completions per hour of time? (because less time is spent retracing the steps we had made prior to preemption). |
Moderator9 Volunteer moderator Send message Joined: 16 Feb 06 Posts: 251 Credit: 0 RAC: 0 |
At one point it was mentioned that we were seeing 3x productivity on clients with the new checkpointing. I haven't tracked things closely enough... when I lose work due to preemption, does the time spent reset back to the checkpoint? And the credits is based on time spent, right? IF the Work Unit is removed from memory, it will always roll back to the last checkpoint. When it starts on my systems this will usually result in lost time as well. The clock does not keep rolling forward if the percent resets. This is why it is still a good idea to set keep in memory to yes. All the project loose somme time because of this loss. CPDN and Rosetta are two of the more lossy in this regard, but all projects loose some time this way. Moderator9 RALPH@home FAQs RALPH@home Guidelines Moderator Contact |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
...so, on average, with the enhanced checkpointing, we should expect to see a credit increase throughout the project, along with increased project TFLOPS (which as you've pointed out elsewhere appear directly calculated from credits issued). |
Moderator9 Volunteer moderator Send message Joined: 16 Feb 06 Posts: 251 Credit: 0 RAC: 0 |
[And well, not surprisingly that is precisely what has happened. If you look at the graphs on BOINCStats for Teraflops, and you have been watching the homepage of Rosetta, you can see the effect. You have to ignore Friday because there is a spike caused by failed credit awards on Friday, but the project is showing about 27TF and there is a general trend upward. It rises and falls a little but still the trend is up. The important thing is that only a week ago the project was stalled at about 24TF. That 3 TF gain is all about fixing the errors, and reductions in time lost from checkpointing issues. By my estimates there is about another 1TF that will come from additional error fixing. There could be another 2-4 TF still being lost due to long checkpointing. There is also about 2-3TF available if the Mac version of the application is fixed and optimized using Altivec coding. So there is still about 5 TF that could be squeezed out of the existing attach base of the project. This is all without adding a single system. Now to be fair there have been systems joining and returning every day so some part of the improvements comes from that as well. Moderator9 RALPH@home FAQs RALPH@home Guidelines Moderator Contact |
Moderator9 Volunteer moderator Send message Joined: 16 Feb 06 Posts: 251 Credit: 0 RAC: 0 |
And well, not surprisingly that is precisely what has happened. If you look at the graphs on BOINCStats for Teraflops, and you have been watching the homepage of Rosetta, you can see the effect. You have to ignore Friday because there is a spike caused by failed credit awards on Friday, but the project is showing about 27TF and there is a general trend upward. It rises and falls a little but still the trend is up. The important thing is that only a week ago the project was stalled at about 24TF. That 3 TF gain is all about fixing the errors, and reductions in time lost from checkpointing issues. By my estimates there is about another 1TF that will come from additional error fixing. There could be another 2-4 TF still being lost due to long checkpointing. There is also about 2-3TF available if the Mac version of the application is fixed and optimized using Altivec coding. So there is still about 5 TF that could be squeezed out of the existing attach base of the project. This is all without adding a single system. Now to be fair there have been systems joining and returning every day so some part of the improvements comes from that as well. Moderator9 RALPH@home FAQs RALPH@home Guidelines Moderator Contact |
Message boards :
Number crunching :
Checkpointing, more credits? Or more models?
©2024 University of Washington
http://www.bakerlab.org