Bug reports for Ralph 5.05 and higher

Message boards : RALPH@home bug list : Bug reports for Ralph 5.05 and higher

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
tralala

Send message
Joined: 12 Apr 06
Posts: 52
Credit: 15,257
RAC: 0
Message 1405 - Posted: 27 Apr 2006, 17:59:06 UTC - in response to Message 1403.  
Last modified: 27 Apr 2006, 18:05:14 UTC

Thanks for all the advice. I think we've largely killed the watchdog timer problem and are ready to release. (Feet1st, you noticed how bad the problem was with 5.05; has your client tried any 5.06?) We haven't seen any evidence for jobs being aborted prematurely by the watchdog, except for the tests where we forced an infinite loop.


So you are going to release it on Rosetta today? Good luck! ;-)


A few quick replies:

I'll bring the debate about shorter/longer deadlines (or a mix) to the attention of the other project scientists.

I really do like Feet1st's idea to ask ralph users to lower the fraction of time their client spends on ralph. That will distribute the jobs to as many different cpus as possible. I can make a note of it on the news page next time we release.


Asking is one thing making sure jobs will be distributed in the most useful manner is another. I really don't think one needs to rely on aware testers for that. Just lower the quota and shorten the deadlines and you get what you want. Probably a one-week deadline and a quota of 10 WU's is a first step and a compromise.

You can even make the WU/day quota editable by the participants. At least I saw it editable in one project not sure if this is still possible with the latest BOINC version. If you can I'd recommend to set the quota to 3/day and make it editable for those who want to continue testing for more than three Wu/s per day. That will prevent ignorant users to hijack the WUs which just join the project which their usual 3-day-cache and load 20 WUs at once (and returning them after 10 days or so).

ID: 1405 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 1407 - Posted: 27 Apr 2006, 19:32:58 UTC - in response to Message 1405.  

Tralala, nice advice. I'm lowering the RALPH deadline from 14 days to 4 days. We still value results that come back after two or three days, but you're right that its ridiculous to get back a job as ancient as two weeks old.

At least with the current BOINC system, I can't seem to set the max WU sent to a client per day. Can you post here which project allowed you to set that as a preference?



Asking is one thing making sure jobs will be distributed in the most useful manner is another. I really don't think one needs to rely on aware testers for that. Just lower the quota and shorten the deadlines and you get what you want. Probably a one-week deadline and a quota of 10 WU's is a first step and a compromise.

You can even make the WU/day quota editable by the participants. At least I saw it editable in one project not sure if this is still possible with the latest BOINC version. If you can I'd recommend to set the quota to 3/day and make it editable for those who want to continue testing for more than three Wu/s per day. That will prevent ignorant users to hijack the WUs which just join the project which their usual 3-day-cache and load 20 WUs at once (and returning them after 10 days or so).


ID: 1407 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 1408 - Posted: 27 Apr 2006, 19:53:22 UTC
Last modified: 27 Apr 2006, 20:04:35 UTC

I dont think 5.06 is good for Linux, However for Windows 5.06 is OK

May be u can trap that signal 11 to make it exit with 0 but no finshed file ?
So, boinc will restart that WUs again ... and possible finish OK.

These signal 11 are caused by a timing problem ... Not by an unallocated aray.
No heartbeat from core client for 31 sec - exiting
*too much network traffic ! 127.0.0.1 unserviced!

2006-04-27 11:58:14 [ralph@home] Finished download of 1tul__alltopologycodes.bar
2006-04-27 11:58:14 [ralph@home] Throughput 21465 bytes/sec
2006-04-27 11:58:15 [ralph@home] Starting result FACONTACTS_NOFILTERS_1tul__381_3_1 using rosetta_beta version 506
2006-04-27 12:02:48 [ralph@home] Pausing result FACONTACTS_NOFILTERS_1tul__381_3_1 (left in memory)
2006-04-27 12:02:49 [ralph@home] Unrecoverable error for result FACONTACTS_NOFILTERS_1tul__381_3_1 (process exited with code 131 (0x83))
2006-04-27 12:02:49 [ralph@home] Computation for result FACONTACTS_NOFILTERS_1tul__381_3_1 finished
2006-04-27 12:03:50 [ralph@home] Sending scheduler request to https://ralph.bakerlab.org/ralph_cgi/cgi
2006-04-27 12:03:50 [ralph@home] Reason: To report results
2006-04-27 12:03:50 [ralph@home] Requesting 0.864 seconds of new work, and reporting 1 results
2006-04-27 12:04:00 [ralph@home] Scheduler request to https://ralph.bakerlab.org/ralph_cgi/cgi succeeded
2006-04-27 12:04:02 [ralph@home] Started download of casp6_aat216_03_05.200_v1_3.gz
2006-04-27 12:04:02 [ralph@home] Started download of casp6_aat216_09_05.200_v1_3.gz
2006-04-27 12:07:47 [ralph@home] Finished download of casp6_aat216_03_05.200_v1_3.gz
2006-04-27 12:07:47 [ralph@home] Throughput 17163 bytes/sec
2006-04-27 12:07:47 [ralph@home] Started download of casp6_t216_.fasta.gz
2006-04-27 12:07:48 [ralph@home] Finished download of casp6_t216_.fasta.gz
2006-04-27 12:07:48 [ralph@home] Throughput 548 bytes/sec
2006-04-27 12:07:48 [ralph@home] Started download of casp6_t216.pdb.gz
2006-04-27 12:07:51 [ralph@home] Finished download of casp6_t216.pdb.gz
2006-04-27 12:07:51 [ralph@home] Throughput 21820 bytes/sec
2006-04-27 12:07:51 [ralph@home] Started download of casp6_t216_.psipred_ss2.gz
2006-04-27 12:07:52 [ralph@home] Finished download of casp6_t216_.psipred_ss2.gz
2006-04-27 12:07:52 [ralph@home] Throughput 8188 bytes/sec
2006-04-27 12:11:04 [ralph@home] Finished download of casp6_aat216_09_05.200_v1_3.gz
2006-04-27 12:11:04 [ralph@home] Throughput 26233 bytes/sec
2006-04-27 12:11:06 [ralph@home] Starting result FA_CASP6_t216__451_30_0 using rosetta_beta version 506
2006-04-27 12:12:41 [ralph@home] Pausing result FA_CASP6_t216__451_30_0 (left in memory)
2006-04-27 12:12:42 [ralph@home] Unrecoverable error for result FA_CASP6_t216__451_30_0 (process exited with code 131 (0x83))
2006-04-27 12:12:42 [ralph@home] Computation for result FA_CASP6_t216__451_30_0 finished
2006-04-27 12:13:42 [ralph@home] Sending scheduler request to https://ralph.bakerlab.org/ralph_cgi/cgi
2006-04-27 12:13:42 [ralph@home] Reason: To report results
2006-04-27 12:13:42 [ralph@home] Requesting 0.864 seconds of new work, and reporting 1 results
2006-04-27 12:13:48 [ralph@home] Scheduler request to https://ralph.bakerlab.org/ralph_cgi/cgi succeeded
2006-04-27 12:13:48 [ralph@home] Message from server: No work sent
2006-04-27 12:13:48 [ralph@home] Message from server: (reached daily quota of 6 results)
https://ralph.bakerlab.org/result.php?resultid=98808
https://ralph.bakerlab.org/result.php?resultid=98790
https://ralph.bakerlab.org/result.php?resultid=98787
https://ralph.bakerlab.org/result.php?resultid=98747
https://ralph.bakerlab.org/result.php?resultid=98658
https://ralph.bakerlab.org/result.php?resultid=98658
https://ralph.bakerlab.org/result.php?resultid=98613


and there is still the problem of WU freezing at 100% done
and other % too ... witout using CPU that I asked here what to do, to help fixing the problem

but get no answer ... so I aborted these WUs


Click signature for global team stats
ID: 1408 · Report as offensive    Reply Quote
tralala

Send message
Joined: 12 Apr 06
Posts: 52
Credit: 15,257
RAC: 0
Message 1409 - Posted: 27 Apr 2006, 19:58:56 UTC - in response to Message 1407.  
Last modified: 27 Apr 2006, 20:03:10 UTC


At least with the current BOINC system, I can't seem to set the max WU sent to a client per day. Can you post here which project allowed you to set that as a preference?


I think I saw that over at CPDN but it's no longer setable there as well. Perhaps I remembered it wrong perhaps it has been disabled in more recent BOINC releases.

I'd still think about 10 WU/day is sufficient and this will further prevent people from building up big caches.
ID: 1409 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 1410 - Posted: 27 Apr 2006, 20:16:47 UTC
Last modified: 27 Apr 2006, 20:22:37 UTC

I'd still think about 10 WU/day is sufficient and this will further prevent people from building up big caches.


I use to abort all WUs of previous version, When I notice a new version,
I know not everyone do this ...

However what is wrong if the boinc concept of limiting WUs by day

What should be limited is "cache" of unreturned WUs ... may be on 2

Once a client exceed the quota of 2 it does not get more WUs,
however if it return 1 it can download more 1 , even after quota exceeded.
*Ops forget that a project reset does not return any WUs ...
So, who do a project reset w/o aborting WUs first will have to wait next day

Click signature for global team stats
ID: 1410 · Report as offensive    Reply Quote
[B^S] sTrey
Avatar

Send message
Joined: 15 Feb 06
Posts: 58
Credit: 15,430
RAC: 0
Message 1411 - Posted: 27 Apr 2006, 20:29:43 UTC

4 days? Ouch, I'd hoped for 6 or 7, and definitely with smaller quotas.

seti beta has a painfully small return rate due to huge quotas. Shorter deadlines aren't as direct, well I guess they are here but if you're running a quorum of more than 1, short deadlines drag things out having to resend after earlier results time out...

Meanwhile I've preferred to test with 16-hr runtimes, and I do run other projects. With my current mix I can probably just make 4 days. Of course when you want really fast returns those hit-and-quit wu's you've been sending, do the job.

So I'm wondering is there little value for testing longer time settings here? Easy enough to drop back to 2 or 4 hour runtimes.

p.s.
If this discussion continues, maybe it's better moved out of the bug-report thread?
ID: 1411 · Report as offensive    Reply Quote
rbpeake

Send message
Joined: 16 Feb 06
Posts: 19
Credit: 3,370
RAC: 0
Message 1412 - Posted: 27 Apr 2006, 20:36:14 UTC - in response to Message 1411.  
Last modified: 27 Apr 2006, 20:37:05 UTC

Meanwhile I've preferred to test with 16-hr runtimes, and I do run other projects....

So I'm wondering is there little value for testing longer time settings here? Easy enough to drop back to 2 or 4 hour runtimes.


I wonder this, too. Maybe for each run, Rhiju, you could advise us testers what settings you would like us to use to achieve your goals for that particular run. In other words, what runtime setting would you like, would you also like us to run other projects at the same time, or just run Ralph by itself to get some results back really quickly, etc., etc.

In this way we can more directly assist you in achieving your testing objectives.

Thanks!

ID: 1412 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 1413 - Posted: 27 Apr 2006, 22:22:31 UTC

Maximum disk usage excedeed Linux
https://ralph.bakerlab.org/result.php?resultid=98187

May be is difficult wipping out from disk the files of previous version
before sending out a new version to test ?

Thanks
Click signature for global team stats
ID: 1413 · Report as offensive    Reply Quote
Yeti
Avatar

Send message
Joined: 19 Feb 06
Posts: 32
Credit: 316,371
RAC: 853
Message 1414 - Posted: 28 Apr 2006, 1:00:46 UTC
Last modified: 28 Apr 2006, 1:05:51 UTC

Back to possible bugs:

Rosetta 5.06

using 161 MB of memory, 542 MB of virtuel memory

The box is a very old one, the WU has run 11 hours now, sitting with 1,04 %

I guess, it will never finish :-(

Oh, my setting for RALPH Target CPU time is 4 hours ...

This is the box: https://ralph.bakerlab.org/show_host_detail.php?hostid=1911

This is the result: https://ralph.bakerlab.org/result.php?resultid=98748

Abort or stay a little bit longer ?


Supporting BOINC, a great concept !
ID: 1414 · Report as offensive    Reply Quote
Robert Everly

Send message
Joined: 16 Feb 06
Posts: 10
Credit: 2,333
RAC: 0
Message 1415 - Posted: 28 Apr 2006, 1:28:05 UTC

Just my 2 cents worth here. I've said this on other project boards as well.

There should be two cache settings in each project.

1) Max Wu/cpu/day (current cache)

2) Max outstanding WU/CPU.

I'd love to see #2 added. Personally I find it silly that some people and systems download hundreds of WUs to only return a portion of them. Just look at host 3755 on seti beta. Yes, the daily quota is down to 1 per day, but there were over 1000 outstanding WUs on the machine.

My thought for #2 would be this. Project defines how many outstanding WUs/cpu is acceptable. You can download up to this amount over any number of days with #1. Once you hit the limit in #2, the server refuses to send you more work until you return work.

Why keep sending work to hosts that are not returning work.
ID: 1415 · Report as offensive    Reply Quote
casio7131

Send message
Joined: 20 Mar 06
Posts: 15
Credit: 12,660
RAC: 0
Message 1416 - Posted: 28 Apr 2006, 2:47:34 UTC

28/04/2006 10:47:13 AM|ralph@home|Resuming task FA_CASP6_t198__435_26_1 using rosetta_beta version 505
https://ralph.bakerlab.org/result.php?resultid=97816

last night i quit boinc and this workunit was at about 1.0427% after 2h45m15s (model=1, step=340905, full atom relax) when i quit boinc. i've now restarted boinc today, and it's now at 1.0424% after 2h10m10s (model=1, step=340558, full atom relax) and still runnning. so it has started redoing the same work again as it had done already last night.

it seems that the new checkpointing didn't work (since it was redone today). or, did it just not reach a "checkpointable stage" last night (since this seems like a rather large structure)?
ID: 1416 · Report as offensive    Reply Quote
tralala

Send message
Joined: 12 Apr 06
Posts: 52
Credit: 15,257
RAC: 0
Message 1417 - Posted: 28 Apr 2006, 7:50:57 UTC

I think there is little value to crunch Ralph WU for more than 8 hours. I would suggest to deactivate this feature in Ralph and to send out WUs with fixed runtimes and to send out a mix most appropriate for the tested app/wu. But maybe Rhiju can give his opinion on this. Nevertheless if one can only crunch one WU in 4 days due to the ressource share of Ralph and runtime preference that is okay. I think the goal of Ralph is nto throughput but diversity. It is better to have 10 hosts trying 1 WU than 1 host trying 10. But perhaps Rhiju can give his opinion on that and post some advice in the news section (at least not to download 20 WUs at once).

"Max outstanding WU/CPU"

This would be a cool feature but that is something BOINC has to implement. It would certainly enable much better distribution of WU without restricting hosts on the maximum wu per day.
ID: 1417 · Report as offensive    Reply Quote
tralala

Send message
Joined: 12 Apr 06
Posts: 52
Credit: 15,257
RAC: 0
Message 1418 - Posted: 28 Apr 2006, 7:54:31 UTC - in response to Message 1414.  
Last modified: 28 Apr 2006, 8:01:27 UTC

Back to possible bugs:

Rosetta 5.06

using 161 MB of memory, 542 MB of virtuel memory

The box is a very old one, the WU has run 11 hours now, sitting with 1,04 %

I guess, it will never finish :-(

Oh, my setting for RALPH Target CPU time is 4 hours ...

This is the box: https://ralph.bakerlab.org/show_host_detail.php?hostid=1911

This is the result: https://ralph.bakerlab.org/result.php?resultid=98748

Abort or stay a little bit longer ?


This t216 protein is really big. It used up to 250 MB on my box and needed over an hour for the first model to finish (on AMD 64 @ 2400 MHz). So I suggest not to abort but to see whether it will finish on your old machine.

ID: 1418 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 1419 - Posted: 28 Apr 2006, 11:30:32 UTC
Last modified: 28 Apr 2006, 11:35:08 UTC

Rosetta beta 5.06 Linux Success
https://ralph.bakerlab.org/result.php?resultid=98212

I had success completing above job on a Linux PC with 256 MB ram.

*All other jobs on above PC get some sort of error !

What I did ...

1) suspended all other projects running on this pc, left only ralph
2) opened some disk space by deleting some old stuff
3) shutdown one of my 10 mbps Internet links, and the load balancing stuff
4) cruched after midnight, while majority of my users are asleeping

So, the 5.06 must be OK for Linux too
However is weak ... any disturbance ... as big network traffic,
or running multiple projects (even keeping in RAM) causes job ops WU to fail.

suggestion:
*Signal 11 needs be trapped to exit with 0 instead of with 183
So, the job will exit with 0 , but no finished file
and next, boinc re-starts it. until it finish ...
Click signature for global team stats
ID: 1419 · Report as offensive    Reply Quote
Yeti
Avatar

Send message
Joined: 19 Feb 06
Posts: 32
Credit: 316,371
RAC: 853
Message 1420 - Posted: 28 Apr 2006, 12:01:04 UTC - in response to Message 1418.  

Back to possible bugs:

Rosetta 5.06

using 161 MB of memory, 542 MB of virtuel memory

The box is a very old one, the WU has run 11 hours now, sitting with 1,04 %

I guess, it will never finish :-(

Oh, my setting for RALPH Target CPU time is 4 hours ...

This is the box: https://ralph.bakerlab.org/show_host_detail.php?hostid=1911

This is the result: https://ralph.bakerlab.org/result.php?resultid=98748

Abort or stay a little bit longer ?


This t216 protein is really big. It used up to 250 MB on my box and needed over an hour for the first model to finish (on AMD 64 @ 2400 MHz). So I suggest not to abort but to see whether it will finish on your old machine.


okay, it seems, as if it finished without error :-)


Supporting BOINC, a great concept !
ID: 1420 · Report as offensive    Reply Quote
suguruhirahara

Send message
Joined: 5 Mar 06
Posts: 40
Credit: 11,320
RAC: 0
Message 1421 - Posted: 28 Apr 2006, 13:22:18 UTC
Last modified: 28 Apr 2006, 13:22:51 UTC

Workunits are done well also on WindowsXP x64 Edition, Pentium D 2.8Ghz and 1GB RAM, using 129MB and 75MB of it.

At this version, my computer doesn't experience the error, crashing workunits when graphics are shown on screen. very great.

But completion time is not expected well. For example, before a workunit start, to completion was "01:51:20". But it is "01:57:00" even 35% of the work was already done. I've not noticed such a great difference at former version.
ID: 1421 · Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 17 Feb 06
Posts: 50
Credit: 55,397
RAC: 0
Message 1422 - Posted: 28 Apr 2006, 14:49:47 UTC
Last modified: 28 Apr 2006, 14:51:31 UTC

4/28/2006 12:53:48 AM||Rescheduling CPU: files downloaded
4/28/2006 3:15:49 AM||Rescheduling CPU: application exited
4/28/2006 3:15:49 AM|ralph@home|Computation for task WATCHDOG_KILL_VERY_LONG_JOBS_424_9_2 finished
4/28/2006 3:15:50 AM|ralph@home|Unrecoverable error for result WATCHDOG_KILL_VERY_LONG_JOBS_424_9_2 (<file_xfer_error> <file_name>WATCHDOG_KILL_VERY_LONG_JOBS_424_9_2_0</file_name> <error_code>-161</error_code></file_xfer_error>)


result: https://ralph.bakerlab.org/result.php?resultid=97709

Win 2000 SP4 Intel Pentium 4 @ 2.4GHz w/ 512Meg RAM


There was is an additional message in the result about a non-existant file:
GZIP SILENT FILE: .xx1enh.out
WARNING! attempt to gzip file .xx1enh.out failed: file does not exist.

ID: 1422 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 1424 - Posted: 28 Apr 2006, 17:19:02 UTC - in response to Message 1412.  

Maybe for each run, Rhiju, you could advise us testers what settings you would like us to use to achieve your goals for that particular run. In other words, what runtime setting would you like, would you also like us to run other projects at the same time, or just run Ralph by itself to get some results back really quickly, etc., etc.

In this way we can more directly assist you in achieving your testing objectives.


If they are adding checkpointing and want more frequent switch between jobs, then that makes sense... once we're over these hurdles, I think the best test would be for everyone to have their Ralph preferences match their R@H preference... and the randomness of how we all have these set is the best beta test, the most similar to the user base of Rosetta.

I guess what I'm saying is, if necessary, instruct us on preference changes you'd like to see... but then let's test same version another couple (several) days back on or normal settings.
ID: 1424 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 1425 - Posted: 28 Apr 2006, 17:23:25 UTC - in response to Message 1403.  

Feet1st, you noticed how bad the problem was with 5.05; has your client tried any 5.06?

I'm not positive, but I believe (irony is cruel sometimes) all the 5.06 WUs were gone by the time I got home to that PC to notice and abort 5.05 WUs.

This is ironic for two reasons. One, I've been discussing the merits of getting WUs to more hosts by limiting WUs per day or resource share, or other means of assuring some WUs remain available for at least 24hrs. Two, I asked why no application version shows on an unreturned WU on the website, and was told it's because it's flexible, so from work, I can't see if the WUs on my PC at home are for 5.05 or 5.06 :) Even though we all know that the Work tab of that PC has a specific version associated with the WU.

ID: 1425 · Report as offensive    Reply Quote
Profile [AF>France>Est>Lorraine]Le Zam
Avatar

Send message
Joined: 2 Mar 06
Posts: 9
Credit: 3,278
RAC: 0
Message 1430 - Posted: 29 Apr 2006, 10:49:09 UTC

Hello, i have some problem with this Wu : 5.06
FA_CASP6_t216__444_2_2
50' for 1.02%
So i aborted it
Bye and go on...

ID: 1430 · Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : RALPH@home bug list : Bug reports for Ralph 5.05 and higher



©2024 University of Washington
http://www.bakerlab.org