Posts by tralala

21) Message boards : Current tests : New crediting system (Message 1924)
Posted 8 Aug 2006 by tralala
Post:
tralala - the aim is to avoid and abandom always and ever ill-numbered benchmarks.
(or at least I hope and pray).

You can simply choose CPU type, divide it with CPU frequency - or golden machine, as tony suggested. I know it's not perfert, RAM speed takes place etc.
But you should fit in acceptable +- 10%, not something like 500% with benchmarks.

Just try to maintain inter-project parity, that will do...


Well so far they revealed no details how they plan to establish the credit/model ratio. As I see it, there are two approaches. What you describe would require a bunch of trustable computers, preferably locally available, with different processors and OSs. 10-20 various machines would be probably sufficient for that (one golden machine will not work, since it will prefer one CPU-type over others and one OS over others). That is not a task for Ralph I'd say since computer on RALPH are not trustable.

The second approach is different and just relies on the force of numbers where anything will balance out with more sampling. This would mean sending out WU on Ralph to "untrustable" hosts but with a lot of results, so that on average you will have also correct values.

The bottom line is, it is a tricky task which requires some testing and tuning until it works flawlessly. I hope they will discuss it with us before they use it over at Rosetta to avoid fixing it while it is being used at Rosetta.

22) Message boards : Current tests : New crediting system (Message 1922)
Posted 8 Aug 2006 by tralala
Post:
How do you want to determine credit/model from Ralph runs? As an average of claimed credit based on X runs? As a median? Keep in mind that the reported specs and benchmarks on RALPH can be as manipulated for some hosts as on Rosetta. Some use special BOINC versions which claim about 3x the credit the standard client claims (including myself).

Possibly an average of claimed credit/model with 10 or more results is reliable enough to be used for Rosetta. If you take lesser results you have a higher impact of distorted credit reports on some WU and thus a WU which will give more credit and other which will give lesser credit (which is not good since people start cherrypicking "good" WU by aborting "bad" WU)
23) Message boards : RALPH@home bug list : Bug reports for Ralph 5.21 (Message 1783)
Posted 6 Jun 2006 by tralala
Post:
I had four good results with 5.21.
However I noticed that no checkpointing was done between the models. On my fast computer a model completed between 10 and 25 minutes. For this WU for example it took 25 minutes between the checkpoints (models) which can translate in over an hour on a slow Mac.

Over at Rosetta people are "complaining" that it may take between 90-120 minutes for a WU to reach its first checkpoint.

What happened to more often checkpointing?
24) Message boards : RALPH@home bug list : Bug reports for Ralph 5.20 (Message 1764)
Posted 4 Jun 2006 by tralala
Post:
Rom tells me it is waiting for the watchdog to finish for debugging.

Here is his response:

"When I added code .... to wait until
the thread is finished, it stalls for up to 30 minutes waiting until
watchdog makes its next check."

I think the watchdog can take up to 2x the cpu run time pref, which may explain the longer stalls.


Does this mean it was intentionally implemented for debugging purposes? You could have saved us some investigation if you would have told us. Anyway it's good to know that the reason is known and won't delay any further development.
25) Message boards : RALPH@home bug list : Bug reports for Ralph 5.20 (Message 1758)
Posted 3 Jun 2006 by tralala
Post:

EDIT: Can't you make a watchdog to activate the WU again, after it has been idle for, let's say 3 minutes? Or 5 minutes? Not crunching my computer goes into sleepmode after 15 minutes.


This is a bug which was invented after 5.16 so I hope they can spot it and fix it completely rather than adding another safety mechanism.
26) Message boards : RALPH@home bug list : Bug reports for Ralph 5.17-5.19 (Message 1733)
Posted 1 Jun 2006 by tralala
Post:
While a progress of an unit reached 100% on my BOINC client, strangely it continues to work, and % of completed on the graphic keeps less than 100%, approximately 57%.
(here you can see a screen shot of it.) And I continue to work on the unit several minutes, but it remains at the same percentage.

Is this bug or not? thanks,



I think this is related to the phenomenon Mike Gelvin was reporting that some WU remain dormant after showing 100%. Perhaps you can check the task manager for the next WU.


I can confirm this bug reported from suguruhirahara and Mike Gelvin. This WU:
http://ralph.bakerlab.org/result.php?resultid=145529

finished after 54 minutes and BOINC reported 100% but still running and in the task manager 0% CPU-ulilization was reported. This lasted for about 5 minutes until it finished. I checked the screensaver and the progress was 89% and it was shown that it had started a 4th model but which in fact seemed not to be calculated. Here is the screenshot
27) Message boards : RALPH@home bug list : Bug reports for Ralph 5.17-5.19 (Message 1732)
Posted 1 Jun 2006 by tralala
Post:
Instant failure, no graphics displayed but two preempted WU from Rosetta in mem:
http://ralph.bakerlab.org/result.php?resultid=145667

Have two more of those WU, will watch them closely.

How can I turn debug-information on?


The next two WUs failed as well with the same error code after the same time (6sec):

http://ralph.bakerlab.org/result.php?resultid=145641
http://ralph.bakerlab.org/result.php?resultid=145639
28) Message boards : RALPH@home bug list : Bug reports for Ralph 5.17-5.19 (Message 1731)
Posted 1 Jun 2006 by tralala
Post:
Instant failure, no graphics displayed but two preempted WU from Rosetta in mem:
http://ralph.bakerlab.org/result.php?resultid=145667

Have two more of those WU, will watch them closely.

How can I turn debug-information on?
29) Message boards : RALPH@home bug list : Bug reports for Ralph 5.17-5.19 (Message 1730)
Posted 1 Jun 2006 by tralala
Post:
While a progress of an unit reached 100% on my BOINC client, strangely it continues to work, and % of completed on the graphic keeps less than 100%, approximately 57%.
(here you can see a screen shot of it.) And I continue to work on the unit several minutes, but it remains at the same percentage.

Is this bug or not? thanks,



I think this is related to the phenomenon Mike Gelvin was reporting that some WU remain dormant after showing 100%. Perhaps you can check the task manager for the next WU.
30) Message boards : Feedback : The ultimate guide to work unit distribution on RALPH (Message 1492)
Posted 5 May 2006 by tralala
Post:
Hi guys,

I read a bit about the boinc system and found some interesting infos about possible work unit distribution. The two most teriffic features for RALPH are:

max_wus_to_send:
Maximum results sent per scheduler RPC.

min_sendwork_interval:
Minimum number of seconds to wait after sending results to a given host, before new results are sent to the same host.

I think with these two parameters one can stop the stockpiling effectively and reduce avg. return time dramatically;

max_wus_to_send: 1
min_sendwork_interval: 1h

This means every host gets only one WU per request and can ask for an additional WU only after an hour. This will lead to a more equal distribution of WU and prevent the first lucky hosts to grab each 20 WUs. It will still allow them to ask for another WU every hour though.

So if I were the work unit distributor of RALPH I would use the following parameters:

deadline: 5 days
daily quota: 10
max_wus_to_send: 1
min_sendwork_interval: 1h

What do you think? Let's discuss the best settings here and propose them to Rhiju after we agreed.
31) Message boards : Cafe RALPH : SOMEone PLEASE vote SOMEONE for user of the day!! (Message 1463)
Posted 3 May 2006 by tralala
Post:
As you may be aware RALPH is a test project. The purpose of the User of the Day on RALPH is to test the tolerance of the user community for consistancy in this area of the homepage....

A real stress test for the community. ;-)
32) Message boards : RALPH@home bug list : Bug reports for Ralph 5.05 and higher (Message 1442)
Posted 30 Apr 2006 by tralala
Post:
I would think for the daily quota 2 would be the minimum and the max 4 or 8. You would want to have a chance at getting multiple tasks running on multi-CPU hosts.


The daily quota is per CPU. So if you have a dual-core or a Hyperthreading-enabled P4 you get 6 WU/day if the daily quote is 3WU/Day.
33) Message boards : RALPH@home bug list : Bug reports for Ralph 5.05 and higher (Message 1439)
Posted 30 Apr 2006 by tralala
Post:
As for keeping work on ralph, we haven't quite got that figured out. We'd like to have jobs go out instantly to clients when we post the new app or test a new scientific mode on ralph, so that we get feedback ASAP. The problem is that if we've flooded the clients with jobs with the previous app or previous jobs, there's typically a wait for those clients to free up again.


That's easy to solve: limit the daily quota to five or less. That means clients grab new jobs instantly but can't pile up big caches.
At the moment it works as follows the first 20 clients pile up 20 WUs each and no more work is available. These hosts are busy with them several days so you get your work returned late. With 5WU/day the first 80 clients grab 5 WU each and are busy with them only for a day or less. I'd even say 3WU/day is a good quota.

Short deadlines have a similar effect but it seems you reset them to match those of Rosetta.
34) Message boards : RALPH@home bug list : Bug reports for Ralph 5.05 and higher (Message 1418)
Posted 28 Apr 2006 by tralala
Post:
Back to possible bugs:

Rosetta 5.06

using 161 MB of memory, 542 MB of virtuel memory

The box is a very old one, the WU has run 11 hours now, sitting with 1,04 %

I guess, it will never finish :-(

Oh, my setting for RALPH Target CPU time is 4 hours ...

This is the box: http://ralph.bakerlab.org/show_host_detail.php?hostid=1911

This is the result: http://ralph.bakerlab.org/result.php?resultid=98748

Abort or stay a little bit longer ?


This t216 protein is really big. It used up to 250 MB on my box and needed over an hour for the first model to finish (on AMD 64 @ 2400 MHz). So I suggest not to abort but to see whether it will finish on your old machine.
35) Message boards : RALPH@home bug list : Bug reports for Ralph 5.05 and higher (Message 1417)
Posted 28 Apr 2006 by tralala
Post:
I think there is little value to crunch Ralph WU for more than 8 hours. I would suggest to deactivate this feature in Ralph and to send out WUs with fixed runtimes and to send out a mix most appropriate for the tested app/wu. But maybe Rhiju can give his opinion on this. Nevertheless if one can only crunch one WU in 4 days due to the ressource share of Ralph and runtime preference that is okay. I think the goal of Ralph is nto throughput but diversity. It is better to have 10 hosts trying 1 WU than 1 host trying 10. But perhaps Rhiju can give his opinion on that and post some advice in the news section (at least not to download 20 WUs at once).

"Max outstanding WU/CPU"

This would be a cool feature but that is something BOINC has to implement. It would certainly enable much better distribution of WU without restricting hosts on the maximum wu per day.
36) Message boards : RALPH@home bug list : Bug reports for Ralph 5.05 and higher (Message 1409)
Posted 27 Apr 2006 by tralala
Post:

At least with the current BOINC system, I can't seem to set the max WU sent to a client per day. Can you post here which project allowed you to set that as a preference?


I think I saw that over at CPDN but it's no longer setable there as well. Perhaps I remembered it wrong perhaps it has been disabled in more recent BOINC releases.

I'd still think about 10 WU/day is sufficient and this will further prevent people from building up big caches.
37) Message boards : RALPH@home bug list : Bug reports for Ralph 5.05 and higher (Message 1405)
Posted 27 Apr 2006 by tralala
Post:
Thanks for all the advice. I think we've largely killed the watchdog timer problem and are ready to release. (Feet1st, you noticed how bad the problem was with 5.05; has your client tried any 5.06?) We haven't seen any evidence for jobs being aborted prematurely by the watchdog, except for the tests where we forced an infinite loop.


So you are going to release it on Rosetta today? Good luck! ;-)


A few quick replies:

I'll bring the debate about shorter/longer deadlines (or a mix) to the attention of the other project scientists.

I really do like Feet1st's idea to ask ralph users to lower the fraction of time their client spends on ralph. That will distribute the jobs to as many different cpus as possible. I can make a note of it on the news page next time we release.


Asking is one thing making sure jobs will be distributed in the most useful manner is another. I really don't think one needs to rely on aware testers for that. Just lower the quota and shorten the deadlines and you get what you want. Probably a one-week deadline and a quota of 10 WU's is a first step and a compromise.

You can even make the WU/day quota editable by the participants. At least I saw it editable in one project not sure if this is still possible with the latest BOINC version. If you can I'd recommend to set the quota to 3/day and make it editable for those who want to continue testing for more than three Wu/s per day. That will prevent ignorant users to hijack the WUs which just join the project which their usual 3-day-cache and load 20 WUs at once (and returning them after 10 days or so).
38) Message boards : RALPH@home bug list : Bug reports for Ralph 5.05 and higher (Message 1398)
Posted 27 Apr 2006 by tralala
Post:
I just finished one WU with several deliberate switchings inbetween (long and short) and it seems 5.06 has solved the issue. The warning message about the file deletion error is also gone. :-)

http://ralph.bakerlab.org/result.php?resultid=98261

Now Rhiju please have a look at this and this.
39) Message boards : Current tests : Weird.. I kept the 60 minutes switch between Applications (Message 1396)
Posted 27 Apr 2006 by tralala
Post:
That's the scheduler of BOINC and not related to Rosetta/Ralph. The scheduler is supposed to be super-smart and balance it all out nicely however I'm too dumb to understand its brillant reasoning and therefore micromanage my projects with suspending/resuming the WUs I want BOINC to crunch.

However in the long run BOINC will operate fine in auto-mode if unattended. You just can't influence it's decision reliable without suspending/resuming the WU/Projects you want.
40) Message boards : RALPH@home bug list : Bug reports for Ralph 5.05 and higher (Message 1382)
Posted 26 Apr 2006 by tralala
Post:

The watchdog aborted this although overall runtime was ony a couple of minutes. After a few minutes runtime it was for a few hours preempted by another WU and after resuming the watchdog probably assumed it run for over an hour with no progress. It seems the Watchdog is only comparing two points in time without checking what happened inbetween.

04/26/06 18:59:19||Rescheduling CPU: application exited
04/26/06 18:59:19|ralph@home|Computation for task AB_CASP6_u272__444_4_0 finished


335.453125
stderr out

<core_client_version>5.4.6</core_client_version>
<stderr_txt>
# cpu_run_time_pref: 10800
# random seed: 3882530
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is killing the run!
Stuck at score 33.7964 for 3600 seconds
**********************************************************************
GZIP SILENT FILE: .xxu272.out
WARNING! error deleting file .xxu272.out

</stderr_txt>


Previous 20 · Next 20



©2024 University of Washington
http://www.bakerlab.org