Bug reports for 5.56-5.59

Message boards : RALPH@home bug list : Bug reports for 5.56-5.59

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 2929 - Posted: 29 Mar 2007, 5:43:19 UTC

There are several issues that we're trying to resolve with this update:

(1) The new FOLD_AND_DOCK workunits were crashing on Windows machines for a pretty subtle reason, hopefully fixed on this update.

(2) Some Macs have been having consistent problems running Rosetta after recent updates. This update attempts to fix a potential stack overflows on Macs, and hopefully the problem computers (we now have a couple attached to ralph) will be happier.

(3) Pct complete should remain sane (i.e. never infinity ot some huge number).

If you have problems with preempting and resuming apps on Macs, also please post here!

We'll probably do another RALPH update Thursday or Friday, based on how things go with this one.


ID: 2929 · Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 16 Feb 06
Posts: 166
Credit: 131,419
RAC: 0
Message 2932 - Posted: 29 Mar 2007, 9:36:34 UTC
Last modified: 29 Mar 2007, 9:36:55 UTC

The issue on hanging after preemting did start in the middel of a version.
I think there was some kind of security update on the OS about that time.
Just a thought.
Anders n
ID: 2932 · Report as offensive    Reply Quote
Profile ashriel

Send message
Joined: 3 Mar 07
Posts: 11
Credit: 648
RAC: 0
Message 2933 - Posted: 29 Mar 2007, 9:43:49 UTC
Last modified: 29 Mar 2007, 10:36:23 UTC

The percentage of the WU 1kka__BOINC_SMOOTH_INCREASE_CYCLES10_RNA_ABINITIO-1kka_-_1877_20 increased normally till 84%. Then jumped on 100% and finished (after 57 minutes).

At WU 1xjr__BOINC_SMOOTH_INCREASE_CYCLES10_RNA_ABINITIO-1xjr_-_1877_30 i realised that the percentage shows 1% for the first 3 minutes, then increases continuously - but having a remaining time that the total would be 2 hours.
(Maybe it has happened to the WU mentioned above, too - that one i didn't watch from the beginning.)
After 34 minutes it showed about 16% - then jumped to 100% and finished.
ID: 2933 · Report as offensive    Reply Quote
ramostol

Send message
Joined: 29 Mar 07
Posts: 24
Credit: 31,121
RAC: 0
Message 2935 - Posted: 29 Mar 2007, 12:12:57 UTC

Still problems on Macs, see report

-- R. A. Mostol
ID: 2935 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 2939 - Posted: 29 Mar 2007, 15:29:17 UTC

If you could let us know what approach you have taken to doing the estimated time to completion, that would be helpful in describing our observations and making suggestions on further improvement. Which number(s) do you actually have control over? And which are computed by the BOINC Manager? You just provide the % complete to BOINC?
ID: 2939 · Report as offensive    Reply Quote
tralala

Send message
Joined: 12 Apr 06
Posts: 52
Credit: 15,257
RAC: 0
Message 2941 - Posted: 29 Mar 2007, 16:10:28 UTC

Got a strange "Maximum Disk usage exceeded2 error. Runtime was 4 hours and Disk has 70 GB free, BOINC is allowed up to 50 GB and 99% disk usage. I guess the output file was too big or some other WU-specific settings were not correct.

https://ralph.bakerlab.org/result.php?resultid=475000
ID: 2941 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 2942 - Posted: 29 Mar 2007, 16:43:05 UTC

There is a maximum that the project sets somewhere that prevents things from consuming all your disk if a loop should occur or something. So this may be a problem with the WU.
ID: 2942 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 2943 - Posted: 29 Mar 2007, 16:49:06 UTC

With my 24hr runtime pref. it seems the time to completion still ticks up, (until the end of a model). Although the "progress" increases, so that should give users a warm fuzzy that they are making "progress".

In my case, I don't see the time to completion actually tick down at all. And based on the jump in % completed between models, it appears if you could tick up the % completed at about double the current rate, then you'd be right on. And, I'm not sure how BOINC works exactly, but perhaps that would result in my completion time actually ticking down instead. At present, completion time seems to tick up roughly 1 second for every 2 seconds of CPU used. I presume this ratio will differ as I get deeper in to my 24hr runtime though.

If doubling the increases in % completed isn't how things work... then if you could recalibrate the number you DO control, based on model 1, then at least remaining models would be much closer.
ID: 2943 · Report as offensive    Reply Quote
tralala

Send message
Joined: 12 Apr 06
Posts: 52
Credit: 15,257
RAC: 0
Message 2944 - Posted: 29 Mar 2007, 17:40:31 UTC - in response to Message 2941.  

Got a strange "Maximum Disk usage exceeded2 error. Runtime was 4 hours and Disk has 70 GB free, BOINC is allowed up to 50 GB and 99% disk usage. I guess the output file was too big or some other WU-specific settings were not correct.

https://ralph.bakerlab.org/result.php?resultid=475000


I guess stdout.txt reached the 100MB limit. I have a similar WU now and stdout ist at 50MB after two hours.
ID: 2944 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 2945 - Posted: 29 Mar 2007, 18:54:58 UTC - in response to Message 2943.  

Checking now on the Max Disk usage.

Feet1st -- in the past, did the "Time to completion" also increase during the run? Or did this start happpening with the % complete "fix". I'm a little puzzled by this behavior.

With my 24hr runtime pref. it seems the time to completion still ticks up, (until the end of a model). Although the "progress" increases, so that should give users a warm fuzzy that they are making "progress".

In my case, I don't see the time to completion actually tick down at all. And based on the jump in % completed between models, it appears if you could tick up the % completed at about double the current rate, then you'd be right on. And, I'm not sure how BOINC works exactly, but perhaps that would result in my completion time actually ticking down instead. At present, completion time seems to tick up roughly 1 second for every 2 seconds of CPU used. I presume this ratio will differ as I get deeper in to my 24hr runtime though.

If doubling the increases in % completed isn't how things work... then if you could recalibrate the number you DO control, based on model 1, then at least remaining models would be much closer.


ID: 2945 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 2946 - Posted: 29 Mar 2007, 19:03:46 UTC - in response to Message 2945.  
Last modified: 29 Mar 2007, 19:05:39 UTC

Feet1st -- in the past, did the "Time to completion" also increase during the run? Or did this start happpening with the % complete "fix". I'm a little puzzled by this behavior.


...of course, has always increased. I had just assumed that with the increasing % completed that the result of a decreasing time to completion could be achieved. See question here.
ID: 2946 · Report as offensive    Reply Quote
Profile UBT - Janea
Avatar

Send message
Joined: 17 Dec 06
Posts: 1
Credit: 1,673
RAC: 0
Message 2947 - Posted: 29 Mar 2007, 19:20:01 UTC

I've got rather a strange bug. The WU I'm running appears to be continuing to run, even though it is showing as "Waiting to run". It is also showing that it's 208% complete and counting. I tried doing an update to the server but that hasn't stopped it. I would cut and paste the WU details, but it won't let me. I'm running it on BOINC 5.8.15 on Windows XP.




ID: 2947 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 2949 - Posted: 29 Mar 2007, 21:04:54 UTC

Unable to rotate v5.56 graphic, unless full screen... or if you move window to upper left, just as reported by Teppo here.
ID: 2949 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 2950 - Posted: 30 Mar 2007, 1:30:43 UTC - in response to Message 2949.  

I had this issue too on Windows. Still don't know what causes it -- if there are any GLUT experts out there who use Visual Studio, please let me know if you have any insight. I need to know where the mouse starts clicking, and its like those coordinates (which actually come from a BOINC API) are messed up. This might actually be an issue with the BOINC/GLUT interface.

Graphics do work perfectly on Macs...

Unable to rotate v5.56 graphic, unless full screen... or if you move window to upper left, just as reported by Teppo here.


ID: 2950 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 2951 - Posted: 30 Mar 2007, 1:41:43 UTC
Last modified: 30 Mar 2007, 1:41:59 UTC

New stuff in 5.57
In principle this should be a new thread, but I started getting confused by all the simultaneous discussions!

1. I've rebuilt the apps with the latest BOINC api. Cross your fingers -- let's see if this takes care of the "Process not found" problems upon preempting, and the fraction of Macs that can't seem to run anything properly.

Small note for aficionados: in previous apps, I was putting a call into the BOINC API code to increase default stack size for Rosetta on macs because they kept giving overflows on workunits with large RNAs. I removed this stack-size "fix", to check whether it might be causing any of the issues we've been seeing; however, as a result, some of the RNA workunits will error out on Macs (both powerpc and intel). This is temporary, I'll put the fix back in after this test with 5.57.

2. Cured "maximum disk space exceeded" problem for RNA workunits.

3. Fixed a pretty subtle bug that was crashing symmetric FOLD_AND_DOCK workunits once in a while; the fix may also help further reduce the error rates on "normal" workunits too.

4. Percentage complete is updated differently. Thanks for all your input on this:

Historically, when a decoy has been completed, the % complete jumps up to

fraction complete = current_cpu_time/user_preferred_cpu_time.

Now the same simple formula is used every five seconds. If this works properly (I have high hopes!), the estimated time to completion should make sense, dropping every five seconds by five seconds. There may be some issues with BOINC trying to make this estimation in a "smart" way.

One more thing: if the estimated time to completion becomes 10 minutes, it won't go below 10 minutes.

fraction complete = current_cpu_time/(current_cpu_time + 10 minutes)

The idea is that there are certain runs that go a little overtime due to variance in how long it takes to make a decoy. In those cases, we don't want to artificially run up against 100%. What you'll see instead is that % complete will asymptotically approach 100% (but not get there until a decoy is completed), and estimated time to completion should stay around 10 minutes. Not perfect behavior, but its better than before!




ID: 2951 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 2952 - Posted: 30 Mar 2007, 1:46:56 UTC - in response to Message 2932.  

Anders n, I've been able to reproduce this preemption problem on my machine. Its completely puzzling to me. I'm very intrigued by your idea that the problem is correlated with the OS X security update -- the new OS might get rid of processes that haven't been active for a while. We'll look into it!

The issue on hanging after preemting did start in the middel of a version.
I think there was some kind of security update on the OS about that time.
Just a thought.
Anders n


ID: 2952 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 2953 - Posted: 30 Mar 2007, 4:00:22 UTC
Last modified: 30 Mar 2007, 4:21:26 UTC

You mean:
fraction complete = current_cpu_time/(user_preferred_cpu_time + 10 minutes)

don't you?

No... you said *IF* the estimated time to completion becomes 10 minutes... so
if:
F = fraction complete = current_cpu_time/user_preferred_cpu_time
if (1 - F) * user_preferred_cpu_time < 600 seconds
then F = current_cpu_time/(current_cpu_time + 600 seconds)
so once we hit that magic 600 seconds we might see a gap in % complete there.

If that's the forumla, then why does the %complete gap up at the end of a model? It should just be another 5 seconds of runtime.

And why does % complete begin at 1%? This doesn't follow the forumla.
[edit] I now see the 5.57 version does NOT start at 1% anymore.

And beyond that % complete figure, BOINC does some twiddling with my historical time per task? Perhaps this is why my estimated time to completion didn't count down? I intentionally left my RT pref at my normal 24hr figure that BOINC is accustomed to.

So, if I have a rather slow machine, let's say it will take 6hrs to complete a single model (in fact, I had one the other day taking ~5hrs per model on a 3ghz machine), and a runtime pref. of just 1hr (I don't for the life of me know WHY people do that), such a task would show 90+% completed for like 5hrs? As you say, at least it's moving, and counting UP... but still has room to improve.

[edit] I see with this new forumla that my estimated runtime now only increases for 5 seconds at a time. So, appears you've successfully addressed my earlier question about the estimate increasing clear through to the end of the model... and I take it that once I complete a model here that I will NOT see the gap up that I saw in the prior release.

Very nice. Now I will soon be ready to begin phase II of progress % testing (...sinister laugh here). Thanks for all the effort on improving the user experience.
ID: 2953 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 2954 - Posted: 30 Mar 2007, 8:34:31 UTC - in response to Message 2953.  

Feet1st, thanks much for your help so far with this and other bugs (or shall we call them "features"?). I've been trying to put out about a dozen fires over the last three days, with the hopes of sending out some new science on Rosetta@home soon, and it would be impossible without user feedback. I think for the first time we have an app that has >98% success rate on all platforms. Sweet!

Regarding your extreme example of the 6 hour per decoy workunit, actually the watchdog would kill it at 4 hours (if that's the CPU run time preference). So for the first hour everything would look fine, and then the users would probably get annoyed for the next three hours... of course, our solution to this problem is to be careful -- we do *not* plan to send workunits to Rosetta@home that take more than an hour per decoy!

You mean:
fraction complete = current_cpu_time/(user_preferred_cpu_time + 10 minutes)

don't you?

No... you said *IF* the estimated time to completion becomes 10 minutes... so
if:
F = fraction complete = current_cpu_time/user_preferred_cpu_time
if (1 - F) * user_preferred_cpu_time < 600 seconds
then F = current_cpu_time/(current_cpu_time + 600 seconds)
so once we hit that magic 600 seconds we might see a gap in % complete there.

If that's the forumla, then why does the %complete gap up at the end of a model? It should just be another 5 seconds of runtime.

And why does % complete begin at 1%? This doesn't follow the forumla.
[edit] I now see the 5.57 version does NOT start at 1% anymore.

And beyond that % complete figure, BOINC does some twiddling with my historical time per task? Perhaps this is why my estimated time to completion didn't count down? I intentionally left my RT pref at my normal 24hr figure that BOINC is accustomed to.

So, if I have a rather slow machine, let's say it will take 6hrs to complete a single model (in fact, I had one the other day taking ~5hrs per model on a 3ghz machine), and a runtime pref. of just 1hr (I don't for the life of me know WHY people do that), such a task would show 90+% completed for like 5hrs? As you say, at least it's moving, and counting UP... but still has room to improve.

[edit] I see with this new forumla that my estimated runtime now only increases for 5 seconds at a time. So, appears you've successfully addressed my earlier question about the estimate increasing clear through to the end of the model... and I take it that once I complete a model here that I will NOT see the gap up that I saw in the prior release.

Very nice. Now I will soon be ready to begin phase II of progress % testing (...sinister laugh here). Thanks for all the effort on improving the user experience.


ID: 2954 · Report as offensive    Reply Quote
ramostol

Send message
Joined: 29 Mar 07
Posts: 24
Credit: 31,121
RAC: 0
Message 2955 - Posted: 30 Mar 2007, 9:05:49 UTC

Congratulations (Rosetta 5.57), for the first time in a week I have had Rosetta keeping a wu for 17 minutes, still working.

-- R. A. Mostol
ID: 2955 · Report as offensive    Reply Quote
[B^S] sTrey
Avatar

Send message
Joined: 15 Feb 06
Posts: 58
Credit: 15,430
RAC: 0
Message 2956 - Posted: 30 Mar 2007, 9:17:30 UTC
Last modified: 30 Mar 2007, 9:18:04 UTC

Ditto, I have two 5.57wu's whose reported progress looks right, are half way through their 4 hour preferred time without aborting, altogether look much better than .56 and .55. Thanks!
ID: 2956 · Report as offensive    Reply Quote
1 · 2 · 3 · 4 · Next

Message boards : RALPH@home bug list : Bug reports for 5.56-5.59



©2024 University of Washington
http://www.bakerlab.org