minirosetta v1.48-1.51 bug thread

Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 6 · Next

AuthorMessage
mtyka
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 19 Mar 08
Posts: 79
Credit: 0
RAC: 0
Message 4449 - Posted: 19 Jan 2009, 1:38:00 UTC
Last modified: 22 Jan 2009, 22:16:01 UTC

This (long anticipated, yes i know ) new release has a whole lot of new features and bugfixes revolving mainly around making the app more stable, obey user runtimes more precisely as well as give back more information to us when things go wrong. We are (as always but with this one particularily) keen to hear how this app runs out there on your computers. Any feedback in invaluable to us to make progress in getting minirosetta as stable as rosetta++ and other projects. There is more new science on its way too but the priority right now is code stabilty.


1.48 Release CHANGELOG

Faster loop closing in FoldCST/Abinitio (affects cc_* cc2_* cs_* WUs), should help with overrunning WUs.

Bug fix concerning intermittent crashes in _rlbd_ jobs.

Bug fix for a potential instability in handling text files (affects all types of WUs).

Bug fix in checkpointing machinery, states were not being correctly restored, probably contributing to long runtimes. (affects cc_* cc2_* cs_* WUs)

Increased the density of checkpoints to lose less time on restarts and address the weired "backjumping" of the time reported in this thread.

Added checkpointing to Loopclosing part of FoldCST. (affects cc_* cc2_* cs_* WUs)

Added checkpointing to Looprelax.

The Watchdog has been checked and improved, now returning information on the aborted jobs to help us figure out how the remaining long running models come about. The watchdog will now abort if the runtime exceeds your preferred runtime + 4 hours. In other words the WUs should not overrun for more than around 4 hours. If they do please let us know !!


Thank you all for helping us fix all these problems. Especially all of you who temporarily switched over from Rosetta@HOME, we really appreciate your efford.

What's next ?

We will be submiting a whole variety of different WUs on RALPH, see if we have improved the stability or have inadertedly created new problems. Then we will either relewase another app here (1.49) to address outstanding issues or move this version (1.48) directly to BOINC. Fingers crossed.


EDIT:

1.49 CHANGELOG

Something screwed up during the 1.48 release of the database. Supplying the database post-facto seemed to only help those that hadnt already grabbed the new app, so this release is merely a copy of the previous one with proper database to make sure all the clients are downloading the databse correctly.
THe apps are identical to 1.48 though!



EDIT:

1.50 CHANGELOG

Minor update - essentially identical to 1.49. Added another error-reporting mechanism and repaired the symbol store mechanism to help us figure out remaining problems.
ID: 4449 · Report as offensive    Reply Quote
Pepo
Avatar

Send message
Joined: 8 Sep 06
Posts: 104
Credit: 36,890
RAC: 0
Message 4450 - Posted: 19 Jan 2009, 8:17:50 UTC - in response to Message 4449.  

As a first preliminary report:
This (long anticipated, yes i know ) new release ...
... is mistakenly dated to December 12, 2008 (probably a copy of the 1.47 release).

Peter
ID: 4450 · Report as offensive    Reply Quote
mtyka
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 19 Mar 08
Posts: 79
Credit: 0
RAC: 0
Message 4451 - Posted: 19 Jan 2009, 17:46:15 UTC - in response to Message 4450.  

whoops yes - thanks for noticing that ;)
ID: 4451 · Report as offensive    Reply Quote
mtyka
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 19 Mar 08
Posts: 79
Credit: 0
RAC: 0
Message 4452 - Posted: 19 Jan 2009, 18:22:53 UTC - in response to Message 4451.  

Update: something went wrong with the database during the update - this has probably nothing todo with the new application itself but wit hthe fact that our update machine went down last week and so this update was done from a new machine that evidently failed to update the databse correctly. We're fixing that right now - the project will be down for a few hours before we get this sorted out. SOrry for the delay.

Mike
ID: 4452 · Report as offensive    Reply Quote
Profile sslickerson

Send message
Joined: 15 Feb 06
Posts: 17
Credit: 4,006
RAC: 0
Message 4453 - Posted: 19 Jan 2009, 19:27:56 UTC - in response to Message 4452.  

Would this be why the last 26 WU or so have failed on my PC with the following message, or is this something different?

<core_client_version>6.5.0</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>

ERROR: in::file::zip minirosetta_database.zip does not exist!
ERROR:: Exit from: ....srcappspublicboincminirosetta.cc line: 83
called boinc_finish

</stderr_txt>
]]>
ID: 4453 · Report as offensive    Reply Quote
mtyka
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 19 Mar 08
Posts: 79
Credit: 0
RAC: 0
Message 4454 - Posted: 19 Jan 2009, 20:22:07 UTC - in response to Message 4453.  

Yes! Ignore that database error message - for some reason the databse did not get uploaded to the server when i did the update on sunday. Something to do with the move to a new update machine i suspect..

ID: 4454 · Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 14 Jan 09
Posts: 62
Credit: 33,293
RAC: 0
Message 4455 - Posted: 20 Jan 2009, 7:51:17 UTC

I don't know if you like success stories, but I have run 4 tasks now I think on OS-X Intel and they all have completed successfully.
ID: 4455 · Report as offensive    Reply Quote
I _ quit

Send message
Joined: 13 Jan 09
Posts: 44
Credit: 88,562
RAC: 0
Message 4456 - Posted: 20 Jan 2009, 15:34:38 UTC

8 tasks on win xp home sp3 and no errors so far
had a few 1 hour runs before i updated the prefs. to 4hrs
ID: 4456 · Report as offensive    Reply Quote
mtyka
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 19 Mar 08
Posts: 79
Credit: 0
RAC: 0
Message 4457 - Posted: 20 Jan 2009, 17:11:56 UTC

Awesome guys! Keep me posted on what you see out there. The error rate so far is looking fabulous.

I'll probably update the app once more today to fix an issue with the symbol store such that we get code traces in cases where it still fails.

Mike :)
ID: 4457 · Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 14 Jan 09
Posts: 62
Credit: 33,293
RAC: 0
Message 4458 - Posted: 20 Jan 2009, 19:55:04 UTC - in response to Message 4457.  

Awesome guys! Keep me posted on what you see out there. The error rate so far is looking fabulous.

I'll probably update the app once more today to fix an issue with the symbol store such that we get code traces in cases where it still fails.

Mike :)


I only got one 1.5 task so that will be all I have to report ... so, my latest bug report is that version 1.5 is repelling the creation of new tasks ...
ID: 4458 · Report as offensive    Reply Quote
Profile Ian_D

Send message
Joined: 16 Feb 06
Posts: 16
Credit: 39,518
RAC: 0
Message 4459 - Posted: 20 Jan 2009, 20:52:47 UTC

https://ralph.bakerlab.org/result.php?resultid=1250838

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
Watchdog active.

ERROR: target_strands.size()
ERROR:: Exit from: ....srcprotocolsabinitioTemplateJumpSetup.cc line: 94
called boinc_finish

</stderr_txt>
]]>

ID: 4459 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4460 - Posted: 20 Jan 2009, 21:42:33 UTC
Last modified: 20 Jan 2009, 21:44:44 UTC

More checkpointing is great! But... this is a bit extreme. My write to disk at MOST every... setting is at 1800 seconds. My harddrive will never be able to spin down and go in to power saver mode all night long if the checkpoints continue at this pace.


1/20/2009 3:31:58 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0 checkpointed
1/20/2009 3:32:28 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t342__IGNORE_THE_REST_2G0QA_13_6824_1_0 checkpointed
1/20/2009 3:32:39 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0 checkpointed
1/20/2009 3:33:14 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t342__IGNORE_THE_REST_2G0QA_13_6824_1_0 checkpointed
1/20/2009 3:33:22 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0 checkpointed
1/20/2009 3:34:02 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t342__IGNORE_THE_REST_2G0QA_13_6824_1_0 checkpointed
1/20/2009 3:34:04 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0 checkpointed
1/20/2009 3:34:42 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0 checkpointed
1/20/2009 3:34:44 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t342__IGNORE_THE_REST_2G0QA_13_6824_1_0 checkpointed
1/20/2009 3:35:22 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0 checkpointed
1/20/2009 3:35:28 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t342__IGNORE_THE_REST_2G0QA_13_6824_1_0 checkpointed
1/20/2009 3:36:04 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0 checkpointed
1/20/2009 3:36:17 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t342__IGNORE_THE_REST_2G0QA_13_6824_1_0 checkpointed
1/20/2009 3:36:44 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0 checkpointed
1/20/2009 3:36:58 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t342__IGNORE_THE_REST_2G0QA_13_6824_1_0 checkpointed
1/20/2009 3:37:24 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0 checkpointed
1/20/2009 3:37:39 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t342__IGNORE_THE_REST_2G0QA_13_6824_1_0 checkpointed
1/20/2009 3:38:04 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0 checkpointed
1/20/2009 3:38:21 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t342__IGNORE_THE_REST_2G0QA_13_6824_1_0 checkpointed
1/20/2009 3:38:43 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0 checkpointed
1/20/2009 3:39:02 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t342__IGNORE_THE_REST_2G0QA_13_6824_1_0 checkpointed
1/20/2009 3:39:22 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0 checkpointed
1/20/2009 3:39:44 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t342__IGNORE_THE_REST_2G0QA_13_6824_1_0 checkpointed

ID: 4460 · Report as offensive    Reply Quote
mtyka
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 19 Mar 08
Posts: 79
Credit: 0
RAC: 0
Message 4461 - Posted: 20 Jan 2009, 22:00:26 UTC - in response to Message 4459.  

https://ralph.bakerlab.org/result.php?resultid=1250838

6.4.5

Incorrect function. (0x1) - exit code 1 (0x1)


Watchdog active.

ERROR: target_strands.size()
ERROR:: Exit from: ....srcprotocolsabinitioTemplateJumpSetup.cc line: 94
called boinc_finish


]]>


Awesome !! Our new debug tools are working. This rare error (i've never seen it in 1000ds of runs) would have gone unnoticed before and led to a segfault. Now it gets caught at least and we can find its cause.

Thanks!

ID: 4461 · Report as offensive    Reply Quote
mtyka
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 19 Mar 08
Posts: 79
Credit: 0
RAC: 0
Message 4462 - Posted: 20 Jan 2009, 22:01:10 UTC - in response to Message 4460.  

More checkpointing is great! But... this is a bit extreme. My write to disk at MOST every... setting is at 1800 seconds. My harddrive will never be able to spin down and go in to power saver mode all night long if the checkpoints continue at this pace.


1/20/2009 3:31:58 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0 checkpointed
1/20/2009 3:32:28 PM|ralph@home|[checkpoint_debug] result


Hmm ok, i'll look into this.
ID: 4462 · Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 14 Jan 09
Posts: 62
Credit: 33,293
RAC: 0
Message 4463 - Posted: 20 Jan 2009, 23:11:44 UTC

Need more work ...

I have run with 1.50 and both were success ... of course the Mac Application has been stable for me, even the awful 1.47 which really farbled up my XP machines ... well, I be doing my part ... :)
ID: 4463 · Report as offensive    Reply Quote
mtyka
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 19 Mar 08
Posts: 79
Credit: 0
RAC: 0
Message 4464 - Posted: 21 Jan 2009, 0:56:23 UTC - in response to Message 4463.  

as you wish ...
ID: 4464 · Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 14 Jan 09
Posts: 62
Credit: 33,293
RAC: 0
Message 4465 - Posted: 21 Jan 2009, 5:31:15 UTC - in response to Message 4464.  

as you wish ...


Well, it is the only defect I have found so far on OS-X ...I can't get work ... :)

of course, 1.47 works well on OS-X no hung tasks, no long running tasks ... no illegal functions ... so ... well, I will, be looking to add a windows machine the next drop ...

Anyway, got three more tasks ... thanks ...
ID: 4465 · Report as offensive    Reply Quote
HA-SOFT, s.r.o.

Send message
Joined: 19 Jan 09
Posts: 6
Credit: 19,644
RAC: 0
Message 4466 - Posted: 21 Jan 2009, 9:10:21 UTC

I still have problems on my new W2008 X64 server.
Every taks of 1.5 minirosetta hangs at startup with 3MB memory and stdout:

[2009- 1-21 9:52:36:] :: BOINC :: boinc_init()
Created shared memory segment

These tasks hangs and I have to kill them from taskbar. After killing stderr is:


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x778806CF read attempt to address 0x00000004

Engaging BOINC Windows Runtime Debugger...
ID: 4466 · Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 14 Jan 09
Posts: 62
Credit: 33,293
RAC: 0
Message 4467 - Posted: 21 Jan 2009, 9:35:58 UTC

First ever error Task on OS-X ... I got this error:


<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>
Watchdog active.

ERROR: target_strands.size()
ERROR:: Exit from: src/protocols/abinitio/TemplateJumpSetup.cc line: 94
called boinc_finish

</stderr_txt>


Which seems to be the same error reported below ...
ID: 4467 · Report as offensive    Reply Quote
Profile sslickerson

Send message
Joined: 15 Feb 06
Posts: 17
Credit: 4,006
RAC: 0
Message 4468 - Posted: 21 Jan 2009, 13:54:50 UTC

My Windows Vista 64 laptop has received about 8 WU and all have completed successfully without error so this looks good, hopefully I will be able to attach to Rosetta soon!


ID: 4468 · Report as offensive    Reply Quote
1 · 2 · 3 · 4 . . . 6 · Next

Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread



©2024 University of Washington
http://www.bakerlab.org