Bug reports for 5.56-5.59

Message boards : RALPH@home bug list : Bug reports for 5.56-5.59

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Profile ashriel

Send message
Joined: 3 Mar 07
Posts: 11
Credit: 648
RAC: 0
Message 2957 - Posted: 30 Mar 2007, 10:01:37 UTC
Last modified: 30 Mar 2007, 10:01:58 UTC

Running 5.57, default: 1 hour, WU 1zih__BOINC_SMOOTH_INCREASE_CYCLES10_RNA_ABINITIO-1zih_-_1882_35:

Time: 30 Minutes - Percentage: 50 - Time left: 35 Minutes
Time: 45 Minutes - Percentage: 75 - Time left: 16 Minutes
Time: 59 Minutes - Percentage:100 - Time left: -

Nice :D
ID: 2957 · Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 16 Feb 06
Posts: 364
Credit: 1,368,421
RAC: 0
Message 2958 - Posted: 30 Mar 2007, 10:36:46 UTC

Work unit https://ralph.bakerlab.org/result.php?resultid=474413
gives ERROR EXIT CODE 131. SIGSEGV ERROR.

Was also posting the following Maximum Disk space usage ERROR from 5.55 and 5.56, but it may now be fixed in 5.57?

https://ralph.bakerlab.org/result.php?resultid=474639
https://ralph.bakerlab.org/result.php?resultid=474622
https://ralph.bakerlab.org/result.php?resultid=475356
https://ralph.bakerlab.org/result.php?resultid=475357.


ID: 2958 · Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 16 Feb 06
Posts: 166
Credit: 131,419
RAC: 0
Message 2961 - Posted: 30 Mar 2007, 14:21:06 UTC

% issue
I have a Wu that was at 40% and had started model no 4.

I restarted Boinc and the Wu restarted at model no 4 but with 0% and
started counting up.

Anders n
ID: 2961 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 2962 - Posted: 30 Mar 2007, 14:31:01 UTC - in response to Message 2954.  

...we do *not* plan to send workunits to Rosetta@home that take more than an hour per decoy!


...well, THAT would certainly be one approach to solving the problem :)

...but... um... "an hour" on how fast of a machine? On the minimum required for the project, 500MHz machine?

I just note that while the science is obviously improving and the well-seasoned types of runs are generally 10-15min. per model... there always seems to be NEW types of runs as well (Docking, RNA, now FOLD_AND_DOCK) which always seem to take significantly longer then the normal. Assuming that trend continues, you will always have some new type of work that has very long crunchtime per model.

One possible way to address that would be if you could pick a mid-model point at which you define yourself to be x% done. Say pick three points, one near 25%, one near 50% and one near 75%. Then you could "know" in advance that a given single model will exceed the RT pref. and show a more linear progression on % completed. ...but, as you say, you've got lots of other fish to fry. I think the current progress indication is a VAST improvement and will avoid a lot of confusion with new users.

...now... about that checkpointing?? (we always gotta ask for more, it's our job!). Of course, if you change the models to average <1hr, this minimizes the need for more checkpoints as well.
ID: 2962 · Report as offensive    Reply Quote
Profile ashriel

Send message
Joined: 3 Mar 07
Posts: 11
Credit: 648
RAC: 0
Message 2963 - Posted: 30 Mar 2007, 15:09:49 UTC - in response to Message 2957.  
Last modified: 30 Mar 2007, 15:12:22 UTC

Running 5.57, default: 1 hour, WU 1fna__BOINC_NOFILTERS_ABRELAX_SAVE_ALL_OUT_NEWRELAXFLAGS_frags83__1881_7:

Time: 30 Minutes - Percentage: 50
Time: 42 Minutes - Percentage:100

Why do some WUs run shorter than they should (1 hour) - percentage then doesn't work, of course.

ID: 2963 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 2964 - Posted: 30 Mar 2007, 15:16:19 UTC - in response to Message 2961.  
Last modified: 30 Mar 2007, 15:54:18 UTC

% issue
I have a Wu that was at 40% and had started model no 4.

I restarted Boinc and the Wu restarted at model no 4 but with 0% and
started counting up.

Anders n


So, I decided to try that as well, end BOINC, restart, had two WUs running.

Upon restart this one went to 100% immediately.
Says: Completed 30 RNA decoys above the report that 62 decoys were generated.

This one is a "FOLD_AND_DOCK" and it went to zero % after a second or two, and then according to the msgs, 20 seconds later, it went to 100% as well. No indication of why it didn't continue to crunch, it completed 63 decoys and 159nstructs. (graphic showed the 63 as the "model").
ID: 2964 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 2965 - Posted: 30 Mar 2007, 15:18:29 UTC - in response to Message 2963.  
Last modified: 30 Mar 2007, 15:20:50 UTC

Why do some WUs run shorter than they should (1 hour) - percentage then doesn't work, of course.

You have to look at how many models you completed. Ralph estimated that it would take longer then an hour if it began another model, so it had to end a little early rather then keep you later then your preference. So, to the nearest model, your preference is met. When models take significant time, this can get to be an even more noticible difference as versus your expectation.

In your case, you crunched two models in 2546 seconds. And so the client should estimate that a third model would take about 1275 seconds more. Which would exceed your hour preference.
ID: 2965 · Report as offensive    Reply Quote
Profile ashriel

Send message
Joined: 3 Mar 07
Posts: 11
Credit: 648
RAC: 0
Message 2966 - Posted: 30 Mar 2007, 15:25:05 UTC - in response to Message 2965.  

thx for that explanation :)
ID: 2966 · Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 16 Feb 06
Posts: 166
Credit: 131,419
RAC: 0
Message 2967 - Posted: 30 Mar 2007, 15:48:44 UTC

MAC
I tried to get Ralph to "hang" again by pause and resume then
by manually get it to swich between Einstein and Ralph... no success. :)
I'll let it run by it self hopfully starting to swich by itself to se if
it still works as it should.
Anders n
ID: 2967 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 2968 - Posted: 30 Mar 2007, 18:10:41 UTC - in response to Message 2962.  
Last modified: 30 Mar 2007, 18:12:51 UTC

Hi everybody:

For the first time in a long time, we have outstanding rates of success:

Version OS Total Results Pass Rate Fail Rate
557 Darwin 282 98.58 1.42
557 Linux 72 98.61 1.39
557 Unknown 17 100.00 0.00
557 Windows 1517 96.97 2.70

(Sorry for the formatting.)

Many of the failures are due to dowload errors, so the true error rate is quite low.

There's still an issue on Mac of preempted workunits not returning to memory -- thanks Anders n for pointing this out. Another developer (David K) and I can both reproduce this on our mac laptops. We wonder if its a OS X issue; it may also be a BOINC client issue. We'll keep you posted. The good news is that the Macs that were having consistent problems with all workunits are running again!

Feet1st, I agree that checkpointing is the best solution for potentially long workunits. I've figured out a way to easily put in checkpoints for most of our code, but its going to take a couple weeks of development to write the appropriate helper code and test. Stay tuned!

Finally, you can expect a couple more ralph updates today and this weekend. I would like to try out a stack overflow fix for Macs, which allows them to carry out work on larger RNAs (interestingly Windows works fine, as do Macs when compiled without graphics). So I'll give that a shot today, and if it doesn't work, take out the code tomorrow. Anyway looks like we're on our way to a Rosetta@home update with some cool new science and several useful bug fixes around Sunday or Monday.

Thanks to everybody!


...we do *not* plan to send workunits to Rosetta@home that take more than an hour per decoy!


...well, THAT would certainly be one approach to solving the problem :)

...but... um... "an hour" on how fast of a machine? On the minimum required for the project, 500MHz machine?

I just note that while the science is obviously improving and the well-seasoned types of runs are generally 10-15min. per model... there always seems to be NEW types of runs as well (Docking, RNA, now FOLD_AND_DOCK) which always seem to take significantly longer then the normal. Assuming that trend continues, you will always have some new type of work that has very long crunchtime per model.

One possible way to address that would be if you could pick a mid-model point at which you define yourself to be x% done. Say pick three points, one near 25%, one near 50% and one near 75%. Then you could "know" in advance that a given single model will exceed the RT pref. and show a more linear progression on % completed. ...but, as you say, you've got lots of other fish to fry. I think the current progress indication is a VAST improvement and will avoid a lot of confusion with new users.

...now... about that checkpointing?? (we always gotta ask for more, it's our job!). Of course, if you change the models to average <1hr, this minimizes the need for more checkpoints as well.


ID: 2968 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 2969 - Posted: 30 Mar 2007, 21:08:29 UTC

Update to 5.58
This is basically the same app as 5.57, except for a small science fix to symmetric and docking (those workunits have been running beautifully otherwise) and a change in the Macs.

I'm now trying to set stack sizes based on the maximum allowed by your system (typically 32-64 Mb). This is necessary for the large RNA jobs, as well as for future work that involves, e.g., designs of transcription factors that bind DNA and could be used for gene therapy. I'm also reporting the stack sizes in stderr.txt which is returned from your clients to our server, so I can get some info. This may crash some Macs, in which case, I'll revert the change, and test again.

ID: 2969 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 2970 - Posted: 30 Mar 2007, 22:15:43 UTC - in response to Message 2954.  
Last modified: 30 Mar 2007, 22:16:21 UTC

...our solution to this problem is to be careful -- we do *not* plan to send workunits to Rosetta@home that take more than an hour per decoy!

It's funny you should say that RIGHT when I've got 4 Ralph WUs and all 4 are taking more then an hour for their first model. (but TRUE, you said "on Rosetta" not "on Ralph").

I ended and started BOINC again (I've got 5.57), this time they didn't end and report in the way they did before... but the completion % of the two tasks is the same, even though one had completed two models, and the other lost all of it's work on model 1. So, right now one has 2:20:xx of CPU, and is showing same %completed as the one that just restarted 5min ago. So it isn't taking total CPU time for the WU in to account, just CPU since last BOINC start up.

See also anders n's post where they observed similar behavior.
ID: 2970 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 2971 - Posted: 31 Mar 2007, 3:20:54 UTC - in response to Message 2970.  

Yes, I did send out some massively long workunits -- just testing out the system!

Hmm, I hadn't carefully thought about what would happen if two models were completed on the first pass. Let me see if I can figure out a fix...

...our solution to this problem is to be careful -- we do *not* plan to send workunits to Rosetta@home that take more than an hour per decoy!

It's funny you should say that RIGHT when I've got 4 Ralph WUs and all 4 are taking more then an hour for their first model. (but TRUE, you said "on Rosetta" not "on Ralph").

I ended and started BOINC again (I've got 5.57), this time they didn't end and report in the way they did before... but the completion % of the two tasks is the same, even though one had completed two models, and the other lost all of it's work on model 1. So, right now one has 2:20:xx of CPU, and is showing same %completed as the one that just restarted 5min ago. So it isn't taking total CPU time for the WU in to account, just CPU since last BOINC start up.

See also anders n's post where they observed similar behavior.


ID: 2971 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 2972 - Posted: 31 Mar 2007, 5:18:58 UTC - in response to Message 2970.  

OK, just talked to David K about this. Right now we keep track of time crunched based on a call to the BOINC API ... i.e. the BOINC manager keeps track of how much time was spent on each workunit. If you preempt after an hour and resume later, the BOINC manager will tell Rosetta about the hour already spent.

But if you shut BOINC down and restart that could cause a problem in a lot of estimates... we can try to make the Rosetta app more self-sufficient, keeping track of cpu time spent so far, but that might be a can of worms. Worth the time? I think its a better use of our time to figure out what's going wrong with Mac's preempt/resume so that most users will not need to shut down BOINC and restart like Anders n has been doing! And we'll spend time getting in those checkpoints...


...our solution to this problem is to be careful -- we do *not* plan to send workunits to Rosetta@home that take more than an hour per decoy!

It's funny you should say that RIGHT when I've got 4 Ralph WUs and all 4 are taking more then an hour for their first model. (but TRUE, you said "on Rosetta" not "on Ralph").

I ended and started BOINC again (I've got 5.57), this time they didn't end and report in the way they did before... but the completion % of the two tasks is the same, even though one had completed two models, and the other lost all of it's work on model 1. So, right now one has 2:20:xx of CPU, and is showing same %completed as the one that just restarted 5min ago. So it isn't taking total CPU time for the WU in to account, just CPU since last BOINC start up.

See also anders n's post where they observed similar behavior.


ID: 2972 · Report as offensive    Reply Quote
alexpoon

Send message
Joined: 9 Sep 06
Posts: 4
Credit: 87
RAC: 0
Message 2973 - Posted: 31 Mar 2007, 9:11:15 UTC

I found out that after suspending the wu(not leave in memery), if I start it,
it will recount the %finish but the work is still continuing.(start at model 5 as an example)
ID: 2973 · Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 16 Feb 06
Posts: 166
Credit: 131,419
RAC: 0
Message 2974 - Posted: 31 Mar 2007, 10:52:30 UTC

Update on my MAC
Ralph 5.57 and Einstein has been swiching all-night without any errors.

On to 5.58 :)
Anders n
ID: 2974 · Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 16 Feb 06
Posts: 166
Credit: 131,419
RAC: 0
Message 2975 - Posted: 31 Mar 2007, 14:24:23 UTC - in response to Message 2972.  

OK, just talked to David K about this. Right now we keep track of time crunched based on a call to the BOINC API ... i.e. the BOINC manager keeps track of how much time was spent on each workunit. If you preempt after an hour and resume later, the BOINC manager will tell Rosetta about the hour already spent.

But if you shut BOINC down and restart that could cause a problem in a lot of estimates... we can try to make the Rosetta app more self-sufficient, keeping track of cpu time spent so far, but that might be a can of worms. Worth the time?


Just so we have all the facts right. When a Ralph Wu 5.58 is resumed after preemt the % done goes back to 0 and time to complete goes very high.
I just had one it preemted at 2 H and when restarted time to complete was
at nearly 6 H (rapidly going down as % was going up). I have a 4 H setting for
Ralph on that computer.

Anders n

ID: 2975 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 2976 - Posted: 1 Apr 2007, 0:04:18 UTC - in response to Message 2975.  
Last modified: 1 Apr 2007, 0:06:29 UTC

Anders n, I think the behavior you observe is partly due to an additional "correction" that the BOINC API applies when estimating time to completion -- it should never really be over 4 hours, right? We really don't have any control over that extra "correction".

But we do have control over percent complete, and that shouldn't go to zero upon resuming ralph! So I'm still worried. On my mac intel machine, I just tried to suspend a ralph WU, and ran einstein@home for a few minutes; then suspended the einstein@home workunit, and resumed the ralph WU. Everything was fine (pct complete never dropped to zero)... when you try this, does pct complete drop to zero?

[edit]
Another question: you posted that 5.57 was fine; are you seeing an issue only with 5.58? If so, this is totally puzzling, since the small change I made to the Mac app shouldn't affect behacior of pct complete.


OK, just talked to David K about this. Right now we keep track of time crunched based on a call to the BOINC API ... i.e. the BOINC manager keeps track of how much time was spent on each workunit. If you preempt after an hour and resume later, the BOINC manager will tell Rosetta about the hour already spent.

But if you shut BOINC down and restart that could cause a problem in a lot of estimates... we can try to make the Rosetta app more self-sufficient, keeping track of cpu time spent so far, but that might be a can of worms. Worth the time?


Just so we have all the facts right. When a Ralph Wu 5.58 is resumed after preemt the % done goes back to 0 and time to complete goes very high.
I just had one it preemted at 2 H and when restarted time to complete was
at nearly 6 H (rapidly going down as % was going up). I have a 4 H setting for
Ralph on that computer.

Anders n


ID: 2976 · Report as offensive    Reply Quote
[PST]Howard

Send message
Joined: 16 Feb 06
Posts: 1
Credit: 200,001
RAC: 0
Message 2977 - Posted: 1 Apr 2007, 5:17:00 UTC
Last modified: 1 Apr 2007, 5:17:39 UTC

On AMD 2000XP running winxp, symm_fold wus, Rosetta 5.57, stick at 97.672%, aborted them after 7+hrs run time. Other types of wus have run ok on this box, run time is set in preferences to 2 hrs, ie:

https://ralph.bakerlab.org/result.php?resultid=479508
ID: 2977 · Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 16 Feb 06
Posts: 166
Credit: 131,419
RAC: 0
Message 2978 - Posted: 1 Apr 2007, 6:01:42 UTC - in response to Message 2976.  
Last modified: 1 Apr 2007, 6:07:42 UTC

Anders n, I think the behavior you observe is partly due to an additional "correction" that the BOINC API applies when estimating time to completion -- it should never really be over 4 hours, right? We really don't have any control over that extra "correction".

But we do have control over percent complete, and that shouldn't go to zero upon resuming ralph! So I'm still worried. On my mac intel machine, I just tried to suspend a ralph WU, and ran einstein@home for a few minutes; then suspended the einstein@home workunit, and resumed the ralph WU. Everything was fine (pct complete never dropped to zero)... when you try this, does pct complete drop to zero?

[edit]
Another question: you posted that 5.57 was fine; are you seeing an issue only with 5.58? If so, this is totally puzzling, since the small change I made to the Mac app shouldn't affect behacior of pct complete.


OK, just talked to David K about this. Right now we keep track of time crunched based on a call to the BOINC API ... i.e. the BOINC manager keeps track of how much time was spent on each workunit. If you preempt after an hour and resume later, the BOINC manager will tell Rosetta about the hour already spent.

But if you shut BOINC down and restart that could cause a problem in a lot of estimates... we can try to make the Rosetta app more self-sufficient, keeping track of cpu time spent so far, but that might be a can of worms. Worth the time?


Just so we have all the facts right. When a Ralph Wu 5.58 is resumed after preemt the % done goes back to 0 and time to complete goes very high.
I just had one it preemted at 2 H and when restarted time to complete was
at nearly 6 H (rapidly going down as % was going up). I have a 4 H setting for
Ralph on that computer.

Anders n




Oups sorry I should have said that it was on a windows XP computer
the % went to 0. It works ok on the MAC.
As a side effect it does not happen when I suspend and resume in the middel of a model, it only happens when at model swich by Boinc it self.
(I have "Leave applications in memory while suspended" set to yes)

[edit] The MAC has done one more night swiching with Einstein without any trouble now with 5.58 [/edit]

Anders n
ID: 2978 · Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : RALPH@home bug list : Bug reports for 5.56-5.59



©2024 University of Washington
http://www.bakerlab.org