Posts by Rhiju

21) Message boards : RALPH@home bug list : Bug reports for 5.71 (Message 3264)
Posted 3 Jul 2007 by Rhiju
Post:
Thanks for reporting errors!
22) Message boards : RALPH@home bug list : Bug reports for 5.69-5.70 (Message 3235)
Posted 26 Jun 2007 by Rhiju
Post:
Great, that's what I was hoping for actually. We're testing a mode of ralph in which we can run an old app version and a new app version at the same time. This is to allow some of our workunits to continue with a stable version to enable consistent results for publication, while other workunits can take advantage of later bug fixes and features. That workunit has a checkpointing issue with 5.68, but works well with 5.69 (now 5.70); I wanted to make sure I could send out one batch of jobs for the old app and one for the newer app!


This Wu failed after trying to restart from last checkpoint.

(I update BOINC so it was taken out of memory)

http://ralph.bakerlab.org/result.php?resultid=569191

Anders n

23) Message boards : RALPH@home bug list : Bug reports for 5.69-5.70 (Message 3230)
Posted 25 Jun 2007 by Rhiju
Post:
Hi: Thanks, I figured out the problem with this WU!


Result ID 566072
Name 1csl__BOINC_EXCLUDENB_PERMITZERO_RNA_ABINITIO-1csl_-_2156_10_1
Workunit 500961
Created 25 Jun 2007 4:05:10 UTC
Sent 25 Jun 2007 4:47:22 UTC
Received 25 Jun 2007 8:41:02 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status 1 (0x1)
Computer ID 7180
Report deadline 29 Jun 2007 4:47:22 UTC
CPU time 0
stderr out

<core_client_version>5.10.2</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>
Graphics are disabled due to configuration...
# cpu_run_time_pref: 7200
ERROR:: Exit from: pose.cc line: 769

</stderr_txt>
]]>

Validate state Invalid
Claimed credit 0
Granted credit 0
application version 5.69

24) Message boards : RALPH@home bug list : Bug reports for 5.69-5.70 (Message 3219)
Posted 24 Jun 2007 by Rhiju
Post:
Thanks for continuing to post problems...
25) Message boards : RALPH@home bug list : Bug reports for 5.66-5.68 (Message 3215)
Posted 24 Jun 2007 by Rhiju
Post:
Hi:

We're looking at these now..


Result ID 563617
Name 1FAB_BOINC_MFR_ABRELAX_2144_24_1
Workunit 499540
Created 23 Jun 2007 6:10:18 UTC
Sent 23 Jun 2007 6:10:25 UTC
Received 23 Jun 2007 17:42:55 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status 0 (0x0)
Computer ID 8763
Report deadline 27 Jun 2007 6:10:25 UTC
CPU time 6887.484375
stderr out

<core_client_version>5.10.2</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 7200
# random seed: 2605369
======================================================
DONE :: 1 starting structures 6886.77 cpu seconds
This process generated 4 decoys from 4 attempts
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
<message>
<file_xfer_error>
<file_name>1FAB_BOINC_MFR_ABRELAX_2144_24_1_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>
]]>

Validate state Invalid
Claimed credit 28.4623439083363
Granted credit 0
application version 5.68

26) Message boards : RALPH@home bug list : Bug reports for 5.66-5.68 (Message 3195)
Posted 12 Jun 2007 by Rhiju
Post:
Hi: Yea, about half the workunits failed on all the platforms. I'm looking into this now...

I'm 8.5hrs in to this symm fold dock relax task and still have not completed the second model. Seems significantly higher then the 1hr/model mentioned previously.

27) Message boards : RALPH@home bug list : Bug reports for 5.66-5.68 (Message 3185)
Posted 30 May 2007 by Rhiju
Post:
Hi: Yea I wish I'd seen all these crashes before sending out the same job to Rosetta@home. The first jobs that came back seemed OK -- now I realize that its because all the adversely affected computers were taking suuuper long and then crashed. We didn't expect those workunits to have such big memory footprints, so we'll have to spend a bit of time debugging.

After a fair run of successes my G5 Mac got a computation error (no crash AFAICT) with exit code 1 (0x1), running v5.68 on gp04__BOINC_SYMM_FOLD_AND_DOCK_RELAX_SUBSYSTEM-gp04_-delC126__2078_10 after a little over six hours of crunching. The system had been running with the screensaver blacked out and the display sleeping for at least twelve hours. The output ends with
ERROR:: Exit from: hbonds.cc line: 636


28) Message boards : RALPH@home bug list : Bug reports for 5.66-5.68 (Message 3158)
Posted 25 May 2007 by Rhiju
Post:
Thanks for the post. I doubt that it is the graphics start and stop, but it might be. Please do post again if you find your mac crashing when you play with graphics. Those are tough bugs to fix, because a lot of the graphics stuff is out of our direct control. The good news (well, maybe bad to start with) is that the BOINC infrastructure will be moving to a new way of doing graphics that is apparently more robust, I think by the end of the summer. So after we iron out the kinks, that might help the graphics-related errors...

Incidentally, those workunits do take a long time (we have implemented checkpointing so that work should be saved freuqently in case of crashese), and Mac G4's are pretty slow for running Rosetta, unfortunately.


My Mac G4/733 crashed after more than ten hours of crunching 1gidA_BOINC_MG_SASAPAIR_ALLRES_RNA_ABINITIO_SAVE_ALL_OUT_BARCODE_RNA_CONTACT_RNA_LONG_RANGE_CONTACT_RNA_SASA-1gidA-_2068_172; last time I looked it was showing only about ten minutes to go but hadn’t decremented that time for quite a while. (The percent done was over 98% and continuing to increment.) Exit status 1 (0x1), with the all-too-familiar “SIGBUS: bus error” message in the output file. Once again, the crash occurred either while the display was blacked out (having displayed the screensaver for a minute) or when I interrupted it. BTW this system has always been set to work while in use and to keep apps in memory, so I don’t understand why starting and stopping the graphics should be a problem—if that’s indeed the case.

29) Message boards : RALPH@home bug list : Bug reports for 5.66-5.68 (Message 3157)
Posted 25 May 2007 by Rhiju
Post:
Hi feet1st -- yea, its because Rosetta changes its fold while the graphics thread finishes its drawing. We considered at one point freezing Rosetta until each graphics frame finishes, but were worried about the performance cost! So these large molecules may continue to get rendered in freaky ways!


My sidechains still fall off on 5.67, as they did with 5.65
Example screenshot

30) Message boards : RALPH@home bug list : Bug reports for 5.66-5.68 (Message 3146)
Posted 24 May 2007 by Rhiju
Post:
Ralph 5.66 fixed a problem where the graphics thread was crashing when sidechains were shown.
Ralph 5.67 fixes an issue in the output of symmetric proteins.
Thanks in advance for your posts! The posts for 5.65 helped a lot.
31) Message boards : RALPH@home bug list : Bug reports for 5.65 (Message 3131)
Posted 23 May 2007 by Rhiju
Post:
Hi everybody:

Looks like there are a lot of problems with this version, actually -- a very high error rate. I'll track it down! Thanks for posting.


Error:
http://ralph.bakerlab.org/result.php?resultid=523659

<core_client_version>5.8.16</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 7200
# random seed: 2662174


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x009B93DB read attempt to address 0x1133FE5C

Engaging BOINC Windows Runtime Debugger...



********************

32) Message boards : RALPH@home bug list : Bug reports for 5.65 (Message 3119)
Posted 22 May 2007 by Rhiju
Post:
So far things have been pretty stable with 5.64; thanks to everyone for posting about crashes on ralph, its helped us fine-tune our workunits. This update just has a small addition to give us more control over the energy function assumed in RNA workunits.
33) Message boards : RALPH@home bug list : Bug reports for 5.63 (Message 3066)
Posted 5 May 2007 by Rhiju
Post:
Hi Odyssesus:

Thanks much for the detailed post! The good news is that we've fixed the bug you reported. We're glad to hear you can see the graphics, apologize a little for that "time to complete" weirdness (it happens on Macs, unforunately) and the occasional lack of work -- welcome to ralph! Soon you'll be able to run on rosetta@home (we'll probably update the application on Sunday), and there will be plenty of work for your computer to do.

Hello, everyone! I seem to have been recruited into my first alpha project (the first that identifies itself as such, at least). I regret that my first posting here has to be negative. Well, mixed …

After getting “No work from project” messages for a few hours after attaching, my G4/400 running BOINC v5.4.9 under Mac OS 10.3.9 received its first WU, 2tif__DIVERSE_ABRELAX__NEWRELAXFLAGS_BCFROMFRAGS20_SAVE_ALL_OUT_frags83__1986_4. It got a “Compute error” in less than twenty seconds:
<message>
process exited with code 1 (0x1)
</message>
<stderr_txt>
Rosetta@home Macintosh Stack Size checker.
  Original size:            0.
  Maximum size:      8388608.
  RLIM_INFINITY            0
# cpu_run_time_pref: 3600
ERROR:: Unable to determine sequence length from pdb file
# random seed: 2680619
ERROR:: Exit from: pose.cc line: 1929
</stderr_txt>

The other two hosts that crunched it, both Windows/AMD, also had errors.

On the brighter side, the next WU I received seems to be doing OK. It began with an estimate of around 25 hours and kicked my host into “panic mode”, but got most of the way through the task in just a couple of hours. Then it seemed to slow down; when I left the shop it was somewhat over 90% done, having found a configuration that vaguely resembled the ‘template’, but still showing about 13 hours to go. When I got home, however, I was encouraged to see that the host had downloaded another WU, indicating that BOINC thinks the current one is almost done.

Awesome graphics, BTW. :)

P.S. Re checkpoints: I forgot to mention that my Mac crashed while running this task (@#$%! Acrobat!) but when I rebooted it resumed very near where it had been when interrupted.

34) Message boards : RALPH@home bug list : Bug reports for 5.63 (Message 3057)
Posted 4 May 2007 by Rhiju
Post:
Thanks to all for posting -- I think we found the source of the error, and
we're sending out some more test jobs.

Same for me

http://ralph.bakerlab.org/result.php?resultid=505151
http://ralph.bakerlab.org/result.php?resultid=505150

=Lupus=

35) Message boards : RALPH@home bug list : Bug reports for 5.63 (Message 3048)
Posted 4 May 2007 by Rhiju
Post:
Please especially report any problems you might notice with checkpointing, or running on powerpc macs.
36) Message boards : RALPH@home bug list : Bug reports for 5.56-5.59 (Message 2986)
Posted 2 Apr 2007 by Rhiju
Post:
Thanks, Feet1st, that's a great explanation. We indeed try to keep the avg time per model at less than one hour; actually our ralph runs help us calibrate this!

Maion, I believe your time remaining is working just the way Rhiju intended for it to. Once the remaining time estimate gets <10min. then time starts moving slower. This is avoid exceeding 100%. So, basically, once you get below a 10 minute estimated time remaining, the estimate is not on track anymore. Basically, the client is unsure exactly when it will finish, but in each case, the 15 and 17 minutes estimates were not far from right.

...But Rhiju assures us they won't be sending WUs which take more then an hour per model on Rosetta. And so on Rosetta, with shorter WUs, the estimates should appear better. The 1hr time preference is always going to be the toughest to provide a good estimate on. As it is the time preference that will see the most variation (in percentage terms) between the actual time and the preference.

37) Message boards : RALPH@home bug list : Bug reports for 5.56-5.59 (Message 2980)
Posted 2 Apr 2007 by Rhiju
Post:
Updates in 5.59
I think this is the last update. Everything ran pretty smoothly in 5.58. This just has
some small updates in the science, to get back some useful scores for each decoy and
a small set of fixes for the symmetric FOLD_AND_DOCK workunits.
38) Message boards : RALPH@home bug list : Bug reports for 5.56-5.59 (Message 2976)
Posted 1 Apr 2007 by Rhiju
Post:
Anders n, I think the behavior you observe is partly due to an additional "correction" that the BOINC API applies when estimating time to completion -- it should never really be over 4 hours, right? We really don't have any control over that extra "correction".

But we do have control over percent complete, and that shouldn't go to zero upon resuming ralph! So I'm still worried. On my mac intel machine, I just tried to suspend a ralph WU, and ran einstein@home for a few minutes; then suspended the einstein@home workunit, and resumed the ralph WU. Everything was fine (pct complete never dropped to zero)... when you try this, does pct complete drop to zero?

[edit]
Another question: you posted that 5.57 was fine; are you seeing an issue only with 5.58? If so, this is totally puzzling, since the small change I made to the Mac app shouldn't affect behacior of pct complete.


OK, just talked to David K about this. Right now we keep track of time crunched based on a call to the BOINC API ... i.e. the BOINC manager keeps track of how much time was spent on each workunit. If you preempt after an hour and resume later, the BOINC manager will tell Rosetta about the hour already spent.

But if you shut BOINC down and restart that could cause a problem in a lot of estimates... we can try to make the Rosetta app more self-sufficient, keeping track of cpu time spent so far, but that might be a can of worms. Worth the time?


Just so we have all the facts right. When a Ralph Wu 5.58 is resumed after preemt the % done goes back to 0 and time to complete goes very high.
I just had one it preemted at 2 H and when restarted time to complete was
at nearly 6 H (rapidly going down as % was going up). I have a 4 H setting for
Ralph on that computer.

Anders n

39) Message boards : RALPH@home bug list : Bug reports for 5.56-5.59 (Message 2972)
Posted 31 Mar 2007 by Rhiju
Post:
OK, just talked to David K about this. Right now we keep track of time crunched based on a call to the BOINC API ... i.e. the BOINC manager keeps track of how much time was spent on each workunit. If you preempt after an hour and resume later, the BOINC manager will tell Rosetta about the hour already spent.

But if you shut BOINC down and restart that could cause a problem in a lot of estimates... we can try to make the Rosetta app more self-sufficient, keeping track of cpu time spent so far, but that might be a can of worms. Worth the time? I think its a better use of our time to figure out what's going wrong with Mac's preempt/resume so that most users will not need to shut down BOINC and restart like Anders n has been doing! And we'll spend time getting in those checkpoints...


...our solution to this problem is to be careful -- we do *not* plan to send workunits to Rosetta@home that take more than an hour per decoy!

It's funny you should say that RIGHT when I've got 4 Ralph WUs and all 4 are taking more then an hour for their first model. (but TRUE, you said "on Rosetta" not "on Ralph").

I ended and started BOINC again (I've got 5.57), this time they didn't end and report in the way they did before... but the completion % of the two tasks is the same, even though one had completed two models, and the other lost all of it's work on model 1. So, right now one has 2:20:xx of CPU, and is showing same %completed as the one that just restarted 5min ago. So it isn't taking total CPU time for the WU in to account, just CPU since last BOINC start up.

See also anders n's post where they observed similar behavior.

40) Message boards : RALPH@home bug list : Bug reports for 5.56-5.59 (Message 2971)
Posted 31 Mar 2007 by Rhiju
Post:
Yes, I did send out some massively long workunits -- just testing out the system!

Hmm, I hadn't carefully thought about what would happen if two models were completed on the first pass. Let me see if I can figure out a fix...

...our solution to this problem is to be careful -- we do *not* plan to send workunits to Rosetta@home that take more than an hour per decoy!

It's funny you should say that RIGHT when I've got 4 Ralph WUs and all 4 are taking more then an hour for their first model. (but TRUE, you said "on Rosetta" not "on Ralph").

I ended and started BOINC again (I've got 5.57), this time they didn't end and report in the way they did before... but the completion % of the two tasks is the same, even though one had completed two models, and the other lost all of it's work on model 1. So, right now one has 2:20:xx of CPU, and is showing same %completed as the one that just restarted 5min ago. So it isn't taking total CPU time for the WU in to account, just CPU since last BOINC start up.

See also anders n's post where they observed similar behavior.



Previous 20 · Next 20



©2024 University of Washington
http://www.bakerlab.org