Bug reports for 5.56-5.59

Author	Message
ashriel Send message Joined: 3 Mar 07 Posts: 11 Credit: 648 RAC: 0	Message 2957 - Posted: 30 Mar 2007, 10:01:37 UTC Last modified: 30 Mar 2007, 10:01:58 UTC Running 5.57, default: 1 hour, WU 1zih__BOINC_SMOOTH_INCREASE_CYCLES10_RNA_ABINITIO-1zih_-_1882_35: Time: 30 Minutes - Percentage: 50 - Time left: 35 Minutes Time: 45 Minutes - Percentage: 75 - Time left: 16 Minutes Time: 59 Minutes - Percentage:100 - Time left: - Nice :D ID: 2957 · Reply Quote

Conan Send message Joined: 16 Feb 06 Posts: 364 Credit: 1,368,421 RAC: 0	Message 2958 - Posted: 30 Mar 2007, 10:36:46 UTC Work unit https://ralph.bakerlab.org/result.php?resultid=474413 gives ERROR EXIT CODE 131. SIGSEGV ERROR. Was also posting the following Maximum Disk space usage ERROR from 5.55 and 5.56, but it may now be fixed in 5.57? https://ralph.bakerlab.org/result.php?resultid=474639 https://ralph.bakerlab.org/result.php?resultid=474622 https://ralph.bakerlab.org/result.php?resultid=475356 https://ralph.bakerlab.org/result.php?resultid=475357. ID: 2958 · Reply Quote

anders n Send message Joined: 16 Feb 06 Posts: 166 Credit: 131,419 RAC: 0	Message 2961 - Posted: 30 Mar 2007, 14:21:06 UTC % issue I have a Wu that was at 40% and had started model no 4. I restarted Boinc and the Wu restarted at model no 4 but with 0% and started counting up. Anders n ID: 2961 · Reply Quote

feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 2962 - Posted: 30 Mar 2007, 14:31:01 UTC - in response to Message 2954. ...we do not plan to send workunits to Rosetta@home that take more than an hour per decoy! ...well, THAT would certainly be one approach to solving the problem :) ...but... um... "an hour" on how fast of a machine? On the minimum required for the project, 500MHz machine? I just note that while the science is obviously improving and the well-seasoned types of runs are generally 10-15min. per model... there always seems to be NEW types of runs as well (Docking, RNA, now FOLD_AND_DOCK) which always seem to take significantly longer then the normal. Assuming that trend continues, you will always have some new type of work that has very long crunchtime per model. One possible way to address that would be if you could pick a mid-model point at which you define yourself to be x% done. Say pick three points, one near 25%, one near 50% and one near 75%. Then you could "know" in advance that a given single model will exceed the RT pref. and show a more linear progression on % completed. ...but, as you say, you've got lots of other fish to fry. I think the current progress indication is a VAST improvement and will avoid a lot of confusion with new users. ...now... about that checkpointing?? (we always gotta ask for more, it's our job!). Of course, if you change the models to average <1hr, this minimizes the need for more checkpoints as well. ID: 2962 · Reply Quote

ashriel Send message Joined: 3 Mar 07 Posts: 11 Credit: 648 RAC: 0	Message 2963 - Posted: 30 Mar 2007, 15:09:49 UTC - in response to Message 2957. Last modified: 30 Mar 2007, 15:12:22 UTC Running 5.57, default: 1 hour, WU 1fna__BOINC_NOFILTERS_ABRELAX_SAVE_ALL_OUT_NEWRELAXFLAGS_frags83__1881_7: Time: 30 Minutes - Percentage: 50 Time: 42 Minutes - Percentage:100 Why do some WUs run shorter than they should (1 hour) - percentage then doesn't work, of course. ID: 2963 · Reply Quote

feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 2964 - Posted: 30 Mar 2007, 15:16:19 UTC - in response to Message 2961. Last modified: 30 Mar 2007, 15:54:18 UTC % issue I have a Wu that was at 40% and had started model no 4. I restarted Boinc and the Wu restarted at model no 4 but with 0% and started counting up. Anders n So, I decided to try that as well, end BOINC, restart, had two WUs running. Upon restart this one went to 100% immediately. Says: Completed 30 RNA decoys above the report that 62 decoys were generated. This one is a "FOLD_AND_DOCK" and it went to zero % after a second or two, and then according to the msgs, 20 seconds later, it went to 100% as well. No indication of why it didn't continue to crunch, it completed 63 decoys and 159nstructs. (graphic showed the 63 as the "model"). ID: 2964 · Reply Quote

feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 2965 - Posted: 30 Mar 2007, 15:18:29 UTC - in response to Message 2963. Last modified: 30 Mar 2007, 15:20:50 UTC Why do some WUs run shorter than they should (1 hour) - percentage then doesn't work, of course. You have to look at how many models you completed. Ralph estimated that it would take longer then an hour if it began another model, so it had to end a little early rather then keep you later then your preference. So, to the nearest model, your preference is met. When models take significant time, this can get to be an even more noticible difference as versus your expectation. In your case, you crunched two models in 2546 seconds. And so the client should estimate that a third model would take about 1275 seconds more. Which would exceed your hour preference. ID: 2965 · Reply Quote

ashriel Send message Joined: 3 Mar 07 Posts: 11 Credit: 648 RAC: 0	Message 2966 - Posted: 30 Mar 2007, 15:25:05 UTC - in response to Message 2965. thx for that explanation :) ID: 2966 · Reply Quote

anders n Send message Joined: 16 Feb 06 Posts: 166 Credit: 131,419 RAC: 0	Message 2967 - Posted: 30 Mar 2007, 15:48:44 UTC MAC I tried to get Ralph to "hang" again by pause and resume then by manually get it to swich between Einstein and Ralph... no success. :) I'll let it run by it self hopfully starting to swich by itself to se if it still works as it should. Anders n ID: 2967 · Reply Quote

Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0	Message 2968 - Posted: 30 Mar 2007, 18:10:41 UTC - in response to Message 2962. Last modified: 30 Mar 2007, 18:12:51 UTC Hi everybody: For the first time in a long time, we have outstanding rates of success: Version OS Total Results Pass Rate Fail Rate 557 Darwin 282 98.58 1.42 557 Linux 72 98.61 1.39 557 Unknown 17 100.00 0.00 557 Windows 1517 96.97 2.70 (Sorry for the formatting.) Many of the failures are due to dowload errors, so the true error rate is quite low. There's still an issue on Mac of preempted workunits not returning to memory -- thanks Anders n for pointing this out. Another developer (David K) and I can both reproduce this on our mac laptops. We wonder if its a OS X issue; it may also be a BOINC client issue. We'll keep you posted. The good news is that the Macs that were having consistent problems with all workunits are running again! Feet1st, I agree that checkpointing is the best solution for potentially long workunits. I've figured out a way to easily put in checkpoints for most of our code, but its going to take a couple weeks of development to write the appropriate helper code and test. Stay tuned! Finally, you can expect a couple more ralph updates today and this weekend. I would like to try out a stack overflow fix for Macs, which allows them to carry out work on larger RNAs (interestingly Windows works fine, as do Macs when compiled without graphics). So I'll give that a shot today, and if it doesn't work, take out the code tomorrow. Anyway looks like we're on our way to a Rosetta@home update with some cool new science and several useful bug fixes around Sunday or Monday. Thanks to everybody! ...we do not plan to send workunits to Rosetta@home that take more than an hour per decoy! ...well, THAT would certainly be one approach to solving the problem :) ...but... um... "an hour" on how fast of a machine? On the minimum required for the project, 500MHz machine? I just note that while the science is obviously improving and the well-seasoned types of runs are generally 10-15min. per model... there always seems to be NEW types of runs as well (Docking, RNA, now FOLD_AND_DOCK) which always seem to take significantly longer then the normal. Assuming that trend continues, you will always have some new type of work that has very long crunchtime per model. One possible way to address that would be if you could pick a mid-model point at which you define yourself to be x% done. Say pick three points, one near 25%, one near 50% and one near 75%. Then you could "know" in advance that a given single model will exceed the RT pref. and show a more linear progression on % completed. ...but, as you say, you've got lots of other fish to fry. I think the current progress indication is a VAST improvement and will avoid a lot of confusion with new users. ...now... about that checkpointing?? (we always gotta ask for more, it's our job!). Of course, if you change the models to average <1hr, this minimizes the need for more checkpoints as well. ID: 2968 · Reply Quote

Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0	Message 2969 - Posted: 30 Mar 2007, 21:08:29 UTC Update to 5.58 This is basically the same app as 5.57, except for a small science fix to symmetric and docking (those workunits have been running beautifully otherwise) and a change in the Macs. I'm now trying to set stack sizes based on the maximum allowed by your system (typically 32-64 Mb). This is necessary for the large RNA jobs, as well as for future work that involves, e.g., designs of transcription factors that bind DNA and could be used for gene therapy. I'm also reporting the stack sizes in stderr.txt which is returned from your clients to our server, so I can get some info. This may crash some Macs, in which case, I'll revert the change, and test again. ID: 2969 · Reply Quote

feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 2970 - Posted: 30 Mar 2007, 22:15:43 UTC - in response to Message 2954. Last modified: 30 Mar 2007, 22:16:21 UTC ...our solution to this problem is to be careful -- we do not plan to send workunits to Rosetta@home that take more than an hour per decoy! It's funny you should say that RIGHT when I've got 4 Ralph WUs and all 4 are taking more then an hour for their first model. (but TRUE, you said "on Rosetta" not "on Ralph"). I ended and started BOINC again (I've got 5.57), this time they didn't end and report in the way they did before... but the completion % of the two tasks is the same, even though one had completed two models, and the other lost all of it's work on model 1. So, right now one has 2:20:xx of CPU, and is showing same %completed as the one that just restarted 5min ago. So it isn't taking total CPU time for the WU in to account, just CPU since last BOINC start up. See also anders n's post where they observed similar behavior. ID: 2970 · Reply Quote

Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0	Message 2971 - Posted: 31 Mar 2007, 3:20:54 UTC - in response to Message 2970. Yes, I did send out some massively long workunits -- just testing out the system! Hmm, I hadn't carefully thought about what would happen if two models were completed on the first pass. Let me see if I can figure out a fix... ...our solution to this problem is to be careful -- we do not plan to send workunits to Rosetta@home that take more than an hour per decoy! It's funny you should say that RIGHT when I've got 4 Ralph WUs and all 4 are taking more then an hour for their first model. (but TRUE, you said "on Rosetta" not "on Ralph"). I ended and started BOINC again (I've got 5.57), this time they didn't end and report in the way they did before... but the completion % of the two tasks is the same, even though one had completed two models, and the other lost all of it's work on model 1. So, right now one has 2:20:xx of CPU, and is showing same %completed as the one that just restarted 5min ago. So it isn't taking total CPU time for the WU in to account, just CPU since last BOINC start up. See also anders n's post where they observed similar behavior. ID: 2971 · Reply Quote

Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0	Message 2972 - Posted: 31 Mar 2007, 5:18:58 UTC - in response to Message 2970. OK, just talked to David K about this. Right now we keep track of time crunched based on a call to the BOINC API ... i.e. the BOINC manager keeps track of how much time was spent on each workunit. If you preempt after an hour and resume later, the BOINC manager will tell Rosetta about the hour already spent. But if you shut BOINC down and restart that could cause a problem in a lot of estimates... we can try to make the Rosetta app more self-sufficient, keeping track of cpu time spent so far, but that might be a can of worms. Worth the time? I think its a better use of our time to figure out what's going wrong with Mac's preempt/resume so that most users will not need to shut down BOINC and restart like Anders n has been doing! And we'll spend time getting in those checkpoints... ...our solution to this problem is to be careful -- we do not plan to send workunits to Rosetta@home that take more than an hour per decoy! It's funny you should say that RIGHT when I've got 4 Ralph WUs and all 4 are taking more then an hour for their first model. (but TRUE, you said "on Rosetta" not "on Ralph"). I ended and started BOINC again (I've got 5.57), this time they didn't end and report in the way they did before... but the completion % of the two tasks is the same, even though one had completed two models, and the other lost all of it's work on model 1. So, right now one has 2:20:xx of CPU, and is showing same %completed as the one that just restarted 5min ago. So it isn't taking total CPU time for the WU in to account, just CPU since last BOINC start up. See also anders n's post where they observed similar behavior. ID: 2972 · Reply Quote

alexpoon Send message Joined: 9 Sep 06 Posts: 4 Credit: 87 RAC: 0	Message 2973 - Posted: 31 Mar 2007, 9:11:15 UTC I found out that after suspending the wu(not leave in memery), if I start it, it will recount the %finish but the work is still continuing.(start at model 5 as an example) ID: 2973 · Reply Quote

anders n Send message Joined: 16 Feb 06 Posts: 166 Credit: 131,419 RAC: 0	Message 2974 - Posted: 31 Mar 2007, 10:52:30 UTC Update on my MAC Ralph 5.57 and Einstein has been swiching all-night without any errors. On to 5.58 :) Anders n ID: 2974 · Reply Quote

anders n Send message Joined: 16 Feb 06 Posts: 166 Credit: 131,419 RAC: 0	Message 2975 - Posted: 31 Mar 2007, 14:24:23 UTC - in response to Message 2972. OK, just talked to David K about this. Right now we keep track of time crunched based on a call to the BOINC API ... i.e. the BOINC manager keeps track of how much time was spent on each workunit. If you preempt after an hour and resume later, the BOINC manager will tell Rosetta about the hour already spent. But if you shut BOINC down and restart that could cause a problem in a lot of estimates... we can try to make the Rosetta app more self-sufficient, keeping track of cpu time spent so far, but that might be a can of worms. Worth the time? Just so we have all the facts right. When a Ralph Wu 5.58 is resumed after preemt the % done goes back to 0 and time to complete goes very high. I just had one it preemted at 2 H and when restarted time to complete was at nearly 6 H (rapidly going down as % was going up). I have a 4 H setting for Ralph on that computer. Anders n ID: 2975 · Reply Quote

Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0	Message 2976 - Posted: 1 Apr 2007, 0:04:18 UTC - in response to Message 2975. Last modified: 1 Apr 2007, 0:06:29 UTC Anders n, I think the behavior you observe is partly due to an additional "correction" that the BOINC API applies when estimating time to completion -- it should never really be over 4 hours, right? We really don't have any control over that extra "correction". But we do have control over percent complete, and that shouldn't go to zero upon resuming ralph! So I'm still worried. On my mac intel machine, I just tried to suspend a ralph WU, and ran einstein@home for a few minutes; then suspended the einstein@home workunit, and resumed the ralph WU. Everything was fine (pct complete never dropped to zero)... when you try this, does pct complete drop to zero? [edit] Another question: you posted that 5.57 was fine; are you seeing an issue only with 5.58? If so, this is totally puzzling, since the small change I made to the Mac app shouldn't affect behacior of pct complete. OK, just talked to David K about this. Right now we keep track of time crunched based on a call to the BOINC API ... i.e. the BOINC manager keeps track of how much time was spent on each workunit. If you preempt after an hour and resume later, the BOINC manager will tell Rosetta about the hour already spent. But if you shut BOINC down and restart that could cause a problem in a lot of estimates... we can try to make the Rosetta app more self-sufficient, keeping track of cpu time spent so far, but that might be a can of worms. Worth the time? Just so we have all the facts right. When a Ralph Wu 5.58 is resumed after preemt the % done goes back to 0 and time to complete goes very high. I just had one it preemted at 2 H and when restarted time to complete was at nearly 6 H (rapidly going down as % was going up). I have a 4 H setting for Ralph on that computer. Anders n ID: 2976 · Reply Quote

[PST]Howard Send message Joined: 16 Feb 06 Posts: 1 Credit: 200,001 RAC: 0	Message 2977 - Posted: 1 Apr 2007, 5:17:00 UTC Last modified: 1 Apr 2007, 5:17:39 UTC On AMD 2000XP running winxp, symm_fold wus, Rosetta 5.57, stick at 97.672%, aborted them after 7+hrs run time. Other types of wus have run ok on this box, run time is set in preferences to 2 hrs, ie: https://ralph.bakerlab.org/result.php?resultid=479508 ID: 2977 · Reply Quote

anders n Send message Joined: 16 Feb 06 Posts: 166 Credit: 131,419 RAC: 0	Message 2978 - Posted: 1 Apr 2007, 6:01:42 UTC - in response to Message 2976. Last modified: 1 Apr 2007, 6:07:42 UTC Anders n, I think the behavior you observe is partly due to an additional "correction" that the BOINC API applies when estimating time to completion -- it should never really be over 4 hours, right? We really don't have any control over that extra "correction". But we do have control over percent complete, and that shouldn't go to zero upon resuming ralph! So I'm still worried. On my mac intel machine, I just tried to suspend a ralph WU, and ran einstein@home for a few minutes; then suspended the einstein@home workunit, and resumed the ralph WU. Everything was fine (pct complete never dropped to zero)... when you try this, does pct complete drop to zero? [edit] Another question: you posted that 5.57 was fine; are you seeing an issue only with 5.58? If so, this is totally puzzling, since the small change I made to the Mac app shouldn't affect behacior of pct complete. OK, just talked to David K about this. Right now we keep track of time crunched based on a call to the BOINC API ... i.e. the BOINC manager keeps track of how much time was spent on each workunit. If you preempt after an hour and resume later, the BOINC manager will tell Rosetta about the hour already spent. But if you shut BOINC down and restart that could cause a problem in a lot of estimates... we can try to make the Rosetta app more self-sufficient, keeping track of cpu time spent so far, but that might be a can of worms. Worth the time? Just so we have all the facts right. When a Ralph Wu 5.58 is resumed after preemt the % done goes back to 0 and time to complete goes very high. I just had one it preemted at 2 H and when restarted time to complete was at nearly 6 H (rapidly going down as % was going up). I have a 4 H setting for Ralph on that computer. Anders n Oups sorry I should have said that it was on a windows XP computer the % went to 0. It works ok on the MAC. As a side effect it does not happen when I suspend and resume in the middel of a model, it only happens when at model swich by Boinc it self. (I have "Leave applications in memory while suspended" set to yes) [edit] The MAC has done one more night swiching with Einstein without any trouble now with 5.58 [/edit] Anders n ID: 2978 · Reply Quote