Message boards : RALPH@home bug list : Bug reports for Ralph 5.05 and higher
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
tralala Send message Joined: 12 Apr 06 Posts: 52 Credit: 15,257 RAC: 0 |
Thanks for all the advice. I think we've largely killed the watchdog timer problem and are ready to release. (Feet1st, you noticed how bad the problem was with 5.05; has your client tried any 5.06?) We haven't seen any evidence for jobs being aborted prematurely by the watchdog, except for the tests where we forced an infinite loop. So you are going to release it on Rosetta today? Good luck! ;-)
Asking is one thing making sure jobs will be distributed in the most useful manner is another. I really don't think one needs to rely on aware testers for that. Just lower the quota and shorten the deadlines and you get what you want. Probably a one-week deadline and a quota of 10 WU's is a first step and a compromise. You can even make the WU/day quota editable by the participants. At least I saw it editable in one project not sure if this is still possible with the latest BOINC version. If you can I'd recommend to set the quota to 3/day and make it editable for those who want to continue testing for more than three Wu/s per day. That will prevent ignorant users to hijack the WUs which just join the project which their usual 3-day-cache and load 20 WUs at once (and returning them after 10 days or so). |
Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0 |
Tralala, nice advice. I'm lowering the RALPH deadline from 14 days to 4 days. We still value results that come back after two or three days, but you're right that its ridiculous to get back a job as ancient as two weeks old. At least with the current BOINC system, I can't seem to set the max WU sent to a client per day. Can you post here which project allowed you to set that as a preference?
|
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
I dont think 5.06 is good for Linux, However for Windows 5.06 is OK May be u can trap that signal 11 to make it exit with 0 but no finshed file ? So, boinc will restart that WUs again ... and possible finish OK. These signal 11 are caused by a timing problem ... Not by an unallocated aray. No heartbeat from core client for 31 sec - exiting *too much network traffic ! 127.0.0.1 unserviced! 2006-04-27 11:58:14 [ralph@home] Finished download of 1tul__alltopologycodes.bar 2006-04-27 11:58:14 [ralph@home] Throughput 21465 bytes/sec 2006-04-27 11:58:15 [ralph@home] Starting result FACONTACTS_NOFILTERS_1tul__381_3_1 using rosetta_beta version 506 2006-04-27 12:02:48 [ralph@home] Pausing result FACONTACTS_NOFILTERS_1tul__381_3_1 (left in memory) 2006-04-27 12:02:49 [ralph@home] Unrecoverable error for result FACONTACTS_NOFILTERS_1tul__381_3_1 (process exited with code 131 (0x83)) 2006-04-27 12:02:49 [ralph@home] Computation for result FACONTACTS_NOFILTERS_1tul__381_3_1 finished 2006-04-27 12:03:50 [ralph@home] Sending scheduler request to https://ralph.bakerlab.org/ralph_cgi/cgi 2006-04-27 12:03:50 [ralph@home] Reason: To report results 2006-04-27 12:03:50 [ralph@home] Requesting 0.864 seconds of new work, and reporting 1 results 2006-04-27 12:04:00 [ralph@home] Scheduler request to https://ralph.bakerlab.org/ralph_cgi/cgi succeeded 2006-04-27 12:04:02 [ralph@home] Started download of casp6_aat216_03_05.200_v1_3.gz 2006-04-27 12:04:02 [ralph@home] Started download of casp6_aat216_09_05.200_v1_3.gz 2006-04-27 12:07:47 [ralph@home] Finished download of casp6_aat216_03_05.200_v1_3.gz 2006-04-27 12:07:47 [ralph@home] Throughput 17163 bytes/sec 2006-04-27 12:07:47 [ralph@home] Started download of casp6_t216_.fasta.gz 2006-04-27 12:07:48 [ralph@home] Finished download of casp6_t216_.fasta.gz 2006-04-27 12:07:48 [ralph@home] Throughput 548 bytes/sec 2006-04-27 12:07:48 [ralph@home] Started download of casp6_t216.pdb.gz 2006-04-27 12:07:51 [ralph@home] Finished download of casp6_t216.pdb.gz 2006-04-27 12:07:51 [ralph@home] Throughput 21820 bytes/sec 2006-04-27 12:07:51 [ralph@home] Started download of casp6_t216_.psipred_ss2.gz 2006-04-27 12:07:52 [ralph@home] Finished download of casp6_t216_.psipred_ss2.gz 2006-04-27 12:07:52 [ralph@home] Throughput 8188 bytes/sec 2006-04-27 12:11:04 [ralph@home] Finished download of casp6_aat216_09_05.200_v1_3.gz 2006-04-27 12:11:04 [ralph@home] Throughput 26233 bytes/sec 2006-04-27 12:11:06 [ralph@home] Starting result FA_CASP6_t216__451_30_0 using rosetta_beta version 506 2006-04-27 12:12:41 [ralph@home] Pausing result FA_CASP6_t216__451_30_0 (left in memory) 2006-04-27 12:12:42 [ralph@home] Unrecoverable error for result FA_CASP6_t216__451_30_0 (process exited with code 131 (0x83)) 2006-04-27 12:12:42 [ralph@home] Computation for result FA_CASP6_t216__451_30_0 finished 2006-04-27 12:13:42 [ralph@home] Sending scheduler request to https://ralph.bakerlab.org/ralph_cgi/cgi 2006-04-27 12:13:42 [ralph@home] Reason: To report results 2006-04-27 12:13:42 [ralph@home] Requesting 0.864 seconds of new work, and reporting 1 results 2006-04-27 12:13:48 [ralph@home] Scheduler request to https://ralph.bakerlab.org/ralph_cgi/cgi succeeded 2006-04-27 12:13:48 [ralph@home] Message from server: No work sent 2006-04-27 12:13:48 [ralph@home] Message from server: (reached daily quota of 6 results) https://ralph.bakerlab.org/result.php?resultid=98808 https://ralph.bakerlab.org/result.php?resultid=98790 https://ralph.bakerlab.org/result.php?resultid=98787 https://ralph.bakerlab.org/result.php?resultid=98747 https://ralph.bakerlab.org/result.php?resultid=98658 https://ralph.bakerlab.org/result.php?resultid=98658 https://ralph.bakerlab.org/result.php?resultid=98613 and there is still the problem of WU freezing at 100% done and other % too ... witout using CPU that I asked here what to do, to help fixing the problem but get no answer ... so I aborted these WUs Click signature for global team stats |
tralala Send message Joined: 12 Apr 06 Posts: 52 Credit: 15,257 RAC: 0 |
I think I saw that over at CPDN but it's no longer setable there as well. Perhaps I remembered it wrong perhaps it has been disabled in more recent BOINC releases. I'd still think about 10 WU/day is sufficient and this will further prevent people from building up big caches. |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
I'd still think about 10 WU/day is sufficient and this will further prevent people from building up big caches. I use to abort all WUs of previous version, When I notice a new version, I know not everyone do this ... However what is wrong if the boinc concept of limiting WUs by day What should be limited is "cache" of unreturned WUs ... may be on 2 Once a client exceed the quota of 2 it does not get more WUs, however if it return 1 it can download more 1 , even after quota exceeded. *Ops forget that a project reset does not return any WUs ... So, who do a project reset w/o aborting WUs first will have to wait next day Click signature for global team stats |
[B^S] sTrey Send message Joined: 15 Feb 06 Posts: 58 Credit: 15,430 RAC: 0 |
4 days? Ouch, I'd hoped for 6 or 7, and definitely with smaller quotas. seti beta has a painfully small return rate due to huge quotas. Shorter deadlines aren't as direct, well I guess they are here but if you're running a quorum of more than 1, short deadlines drag things out having to resend after earlier results time out... Meanwhile I've preferred to test with 16-hr runtimes, and I do run other projects. With my current mix I can probably just make 4 days. Of course when you want really fast returns those hit-and-quit wu's you've been sending, do the job. So I'm wondering is there little value for testing longer time settings here? Easy enough to drop back to 2 or 4 hour runtimes. p.s. If this discussion continues, maybe it's better moved out of the bug-report thread? |
rbpeake Send message Joined: 16 Feb 06 Posts: 19 Credit: 3,370 RAC: 0 |
Meanwhile I've preferred to test with 16-hr runtimes, and I do run other projects.... I wonder this, too. Maybe for each run, Rhiju, you could advise us testers what settings you would like us to use to achieve your goals for that particular run. In other words, what runtime setting would you like, would you also like us to run other projects at the same time, or just run Ralph by itself to get some results back really quickly, etc., etc. In this way we can more directly assist you in achieving your testing objectives. Thanks! |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
Maximum disk usage excedeed Linux https://ralph.bakerlab.org/result.php?resultid=98187 May be is difficult wipping out from disk the files of previous version before sending out a new version to test ? Thanks Click signature for global team stats |
Yeti Send message Joined: 19 Feb 06 Posts: 32 Credit: 316,371 RAC: 853 |
Back to possible bugs: Rosetta 5.06 using 161 MB of memory, 542 MB of virtuel memory The box is a very old one, the WU has run 11 hours now, sitting with 1,04 % I guess, it will never finish :-( Oh, my setting for RALPH Target CPU time is 4 hours ... This is the box: https://ralph.bakerlab.org/show_host_detail.php?hostid=1911 This is the result: https://ralph.bakerlab.org/result.php?resultid=98748 Abort or stay a little bit longer ? Supporting BOINC, a great concept ! |
Robert Everly Send message Joined: 16 Feb 06 Posts: 10 Credit: 2,333 RAC: 0 |
Just my 2 cents worth here. I've said this on other project boards as well. There should be two cache settings in each project. 1) Max Wu/cpu/day (current cache) 2) Max outstanding WU/CPU. I'd love to see #2 added. Personally I find it silly that some people and systems download hundreds of WUs to only return a portion of them. Just look at host 3755 on seti beta. Yes, the daily quota is down to 1 per day, but there were over 1000 outstanding WUs on the machine. My thought for #2 would be this. Project defines how many outstanding WUs/cpu is acceptable. You can download up to this amount over any number of days with #1. Once you hit the limit in #2, the server refuses to send you more work until you return work. Why keep sending work to hosts that are not returning work. |
casio7131 Send message Joined: 20 Mar 06 Posts: 15 Credit: 12,660 RAC: 0 |
28/04/2006 10:47:13 AM|ralph@home|Resuming task FA_CASP6_t198__435_26_1 using rosetta_beta version 505 https://ralph.bakerlab.org/result.php?resultid=97816 last night i quit boinc and this workunit was at about 1.0427% after 2h45m15s (model=1, step=340905, full atom relax) when i quit boinc. i've now restarted boinc today, and it's now at 1.0424% after 2h10m10s (model=1, step=340558, full atom relax) and still runnning. so it has started redoing the same work again as it had done already last night. it seems that the new checkpointing didn't work (since it was redone today). or, did it just not reach a "checkpointable stage" last night (since this seems like a rather large structure)? |
tralala Send message Joined: 12 Apr 06 Posts: 52 Credit: 15,257 RAC: 0 |
I think there is little value to crunch Ralph WU for more than 8 hours. I would suggest to deactivate this feature in Ralph and to send out WUs with fixed runtimes and to send out a mix most appropriate for the tested app/wu. But maybe Rhiju can give his opinion on this. Nevertheless if one can only crunch one WU in 4 days due to the ressource share of Ralph and runtime preference that is okay. I think the goal of Ralph is nto throughput but diversity. It is better to have 10 hosts trying 1 WU than 1 host trying 10. But perhaps Rhiju can give his opinion on that and post some advice in the news section (at least not to download 20 WUs at once). "Max outstanding WU/CPU" This would be a cool feature but that is something BOINC has to implement. It would certainly enable much better distribution of WU without restricting hosts on the maximum wu per day. |
tralala Send message Joined: 12 Apr 06 Posts: 52 Credit: 15,257 RAC: 0 |
Back to possible bugs: This t216 protein is really big. It used up to 250 MB on my box and needed over an hour for the first model to finish (on AMD 64 @ 2400 MHz). So I suggest not to abort but to see whether it will finish on your old machine. |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
Rosetta beta 5.06 Linux Success https://ralph.bakerlab.org/result.php?resultid=98212 I had success completing above job on a Linux PC with 256 MB ram. *All other jobs on above PC get some sort of error ! What I did ... 1) suspended all other projects running on this pc, left only ralph 2) opened some disk space by deleting some old stuff 3) shutdown one of my 10 mbps Internet links, and the load balancing stuff 4) cruched after midnight, while majority of my users are asleeping So, the 5.06 must be OK for Linux too However is weak ... any disturbance ... as big network traffic, or running multiple projects (even keeping in RAM) causes job ops WU to fail. suggestion: *Signal 11 needs be trapped to exit with 0 instead of with 183 So, the job will exit with 0 , but no finished file and next, boinc re-starts it. until it finish ... Click signature for global team stats |
Yeti Send message Joined: 19 Feb 06 Posts: 32 Credit: 316,371 RAC: 853 |
Back to possible bugs: okay, it seems, as if it finished without error :-) Supporting BOINC, a great concept ! |
suguruhirahara Send message Joined: 5 Mar 06 Posts: 40 Credit: 11,320 RAC: 0 |
Workunits are done well also on WindowsXP x64 Edition, Pentium D 2.8Ghz and 1GB RAM, using 129MB and 75MB of it. At this version, my computer doesn't experience the error, crashing workunits when graphics are shown on screen. very great. But completion time is not expected well. For example, before a workunit start, to completion was "01:51:20". But it is "01:57:00" even 35% of the work was already done. I've not noticed such a great difference at former version. |
Mike Gelvin Send message Joined: 17 Feb 06 Posts: 50 Credit: 55,397 RAC: 0 |
4/28/2006 12:53:48 AM||Rescheduling CPU: files downloaded 4/28/2006 3:15:49 AM||Rescheduling CPU: application exited 4/28/2006 3:15:49 AM|ralph@home|Computation for task WATCHDOG_KILL_VERY_LONG_JOBS_424_9_2 finished 4/28/2006 3:15:50 AM|ralph@home|Unrecoverable error for result WATCHDOG_KILL_VERY_LONG_JOBS_424_9_2 (<file_xfer_error> <file_name>WATCHDOG_KILL_VERY_LONG_JOBS_424_9_2_0</file_name> <error_code>-161</error_code></file_xfer_error>) result: https://ralph.bakerlab.org/result.php?resultid=97709 Win 2000 SP4 Intel Pentium 4 @ 2.4GHz w/ 512Meg RAM There was is an additional message in the result about a non-existant file: GZIP SILENT FILE: .xx1enh.out WARNING! attempt to gzip file .xx1enh.out failed: file does not exist. |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
Maybe for each run, Rhiju, you could advise us testers what settings you would like us to use to achieve your goals for that particular run. In other words, what runtime setting would you like, would you also like us to run other projects at the same time, or just run Ralph by itself to get some results back really quickly, etc., etc. If they are adding checkpointing and want more frequent switch between jobs, then that makes sense... once we're over these hurdles, I think the best test would be for everyone to have their Ralph preferences match their R@H preference... and the randomness of how we all have these set is the best beta test, the most similar to the user base of Rosetta. I guess what I'm saying is, if necessary, instruct us on preference changes you'd like to see... but then let's test same version another couple (several) days back on or normal settings. |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
Feet1st, you noticed how bad the problem was with 5.05; has your client tried any 5.06? I'm not positive, but I believe (irony is cruel sometimes) all the 5.06 WUs were gone by the time I got home to that PC to notice and abort 5.05 WUs. This is ironic for two reasons. One, I've been discussing the merits of getting WUs to more hosts by limiting WUs per day or resource share, or other means of assuring some WUs remain available for at least 24hrs. Two, I asked why no application version shows on an unreturned WU on the website, and was told it's because it's flexible, so from work, I can't see if the WUs on my PC at home are for 5.05 or 5.06 :) Even though we all know that the Work tab of that PC has a specific version associated with the WU. |
[AF>France>Est>Lorraine]Le Zam Send message Joined: 2 Mar 06 Posts: 9 Credit: 3,278 RAC: 0 |
Hello, i have some problem with this Wu : 5.06 FA_CASP6_t216__444_2_2 50' for 1.02% So i aborted it Bye and go on... |
Message boards :
RALPH@home bug list :
Bug reports for Ralph 5.05 and higher
©2024 University of Washington
http://www.bakerlab.org