minirosetta 1.58

Message boards : RALPH@home bug list : minirosetta 1.58

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Profile [Toscana]SickBoy88

Send message
Joined: 27 Jan 09
Posts: 3
Credit: 17,581
RAC: 0
Message 4744 - Posted: 18 Mar 2009, 12:53:20 UTC

This WU
https://ralph.bakerlab.org/result.php?resultid=1371832
Give me a compute error.
ID: 4744 · Report as offensive    Reply Quote
svincent

Send message
Joined: 4 Apr 08
Posts: 34
Credit: 51,768
RAC: 0
Message 4745 - Posted: 18 Mar 2009, 20:09:54 UTC

Just has a bunch of WU's fail at the start (Mac OS X 10.4.11) all in the same way: sample output below.

1213821
1213853
1213852
1213701
1213693
1213692
1213677

ERROR: [ERROR] Unable to open constraints file: t297_.cst.best.multi
ERROR:: Exit from: src/core/scoring/constraints/ConstraintIO.cc line: 330
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


ID: 4745 · Report as offensive    Reply Quote
Profile Pentti Kiesi

Send message
Joined: 2 Jan 09
Posts: 2
Credit: 111,437
RAC: 0
Message 4747 - Posted: 19 Mar 2009, 12:15:40 UTC - in response to Message 4728.  


https://ralph.bakerlab.org/workunit.php?wuid=1185064

One WU seems not willing to upload at all. Others before and after it are
uploading correctly:
13.3.2009 21:38:40|ralph@home|[error] Error reported by file upload server: [loopbuild_mamaln_full_hb_t303__IGNORE_THE_REST_1te2_99_8360_1_0_0] locked by file_upload_handler PID=-1

As a reminder. Still hanging on my upload queue. 12h CPU time on Quad 2.66GHz. Should I cancel this at last, or are you still intereseted on it?
ID: 4747 · Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 13 Jan 09
Posts: 95
Credit: 327,911
RAC: 0
Message 4748 - Posted: 22 Mar 2009, 14:01:52 UTC

Another lockfile problem:

https://ralph.bakerlab.org/result.php?resultid=1374201

Running at 95% CPU, with BOINC 6.2.28 under Vista SP1 with graphics disabled.

Will try resetting Ralph@home soon.
ID: 4748 · Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 13 Jan 09
Posts: 95
Credit: 327,911
RAC: 0
Message 4749 - Posted: 22 Mar 2009, 15:20:49 UTC - in response to Message 4748.  

Tried resetting Ralph@home, got these error messages:

3/22/2009 9:04:01 AM|ralph@home|Resetting project
3/22/2009 9:04:06 AM|ralph@home|[error] Couldn't delete file projects/ralph.bakerlab.org/minirosetta_1.58_windows_intelx86.exe

Similar problem with Rosetta@home, except with the 1.54 executable.

I use BOINC 6.2.28 under Vista SP1, with graphics not enabled.

An attempt to manually delete this file failed when I couldn't find the directory containing it, or even anything under the BOINC directory specific to the Ralph@home project.

I intend to leave both Ralph@home and Rosetta@home on no new tasks until I get some usable advice on how to complete the resets.
ID: 4749 · Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 07
Posts: 75
Credit: 69,584
RAC: 0
Message 4750 - Posted: 22 Mar 2009, 18:54:11 UTC

Validate errors
1374039
1374038
1374037

All run for around 5 hours. No doubt the second run will take half the time or less as has happened in previous work units

ID: 4750 · Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 13 Jan 09
Posts: 95
Credit: 327,911
RAC: 0
Message 4751 - Posted: 23 Mar 2009, 2:38:44 UTC - in response to Message 4749.  
Last modified: 23 Mar 2009, 2:39:46 UTC

More on the lockfile problem:

When this problem shows up, expect a few subdirectories of BOINCslots to have three files each, unrelated to any workunit in progress and including the lockfile. I suspect that Rosetta@home and Ralph@home workunits are unable to run successfully if assigned to any of these slots, even if workunits from other BOINC projects can. Attempts to manually delete these files also fail.

However, the following may have helped for me: Set Rosetta@home and/or Ralph@home to no new tasks and wait until all tasks for either of them complete. Do an update for either that has tasks not reported yet. Suspend all workunits and network activity. Shut down the BOINC client, then find process boinc.exe and kill it. Reboot. If these subdirectories of BOINCslots have disappeared, enable network activity and do another reset on Rosetta@home and/or Ralph@home. If these resets complete without error messages, it's safe to resume activity on any other BOINC projects, then allow new tasks on Rosetta@home and/or Ralph@home.

However, I haven't completed any workunits for either Rosetta@home or Ralph@home since doing this, so it will be at least tomorrow before I can check if this actually took care of the lockfile problem, at least temporarily.

I wouldn't be surprised if this procedure includes some unneccessary steps, but wanted to report this before any effort to find out.
ID: 4751 · Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 13 Jan 09
Posts: 95
Credit: 327,911
RAC: 0
Message 4752 - Posted: 23 Mar 2009, 16:46:06 UTC - in response to Message 4751.  

Didn't help enough - the first Rosetta@home 1.54 workunit completed after the above procedure had the lockfile problem again, but two more started since then and not complete yet haven't had that problem yet.

Suggestion: Modify minirosetta so that it checks for a lockfile as it starts up, preferably before trying to create one, and if this first check finds a lockfile, reduce the number of times minirosetta is allowed to restart before it is able to write the first checkpoint.

Suggestion: Modify minirosetta so that it reports which slot it was run under if it is able to do this, since the problem looks likely to repeat for any minirosetta workunit run in a slot where a previous workunit's lockfile was not erased when the previous workunit completed and was reported.

Suggestion: Check the procedure used for failed workunits to see if it leaves a lockfile behind after abandoning efforts to restart the workunit.

Suggestion: Check what program is supposed to delete the lockfiles for workunits that have been completed and reported.

Suggestion: Check if BOINC allows any way to request that a workunit be restarted, but in a different slot.

Suggestion: If BOINC is supposed to clean up the slots after workunits complete and are reported, check if BOINC 6.2.28 is known to have any problems with doing this.

I haven't had any 1.58 workunits since trying the procedure, so I don't know whether these continued problems also apply to 1.58.

I often let BOINC run for a few days between reboots.

I still use BOINC 6.2.28 under Vista SP1, with 95% CPU time.
ID: 4752 · Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 14 Jan 09
Posts: 62
Credit: 33,293
RAC: 0
Message 4753 - Posted: 23 Mar 2009, 17:06:10 UTC

Robert,

Thanks for all the work on the lock file ... I hope we can figure out what is going on with this ...

On my part I have a Validate error though the task seems to have failed with another error that did not get reported as an error:

ERROR: dis==0 in pairtermderiv!
ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 338

What does this mean? Beats me ...
ID: 4753 · Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 14 Jan 09
Posts: 62
Credit: 33,293
RAC: 0
Message 4754 - Posted: 24 Mar 2009, 18:54:31 UTC

New error:
ERROR: aFrame->nr_frags()
ERROR:: Exit from: ....srccorefragmentFragSet.cc line: 168
ID: 4754 · Report as offensive    Reply Quote
svincent

Send message
Joined: 4 Apr 08
Posts: 34
Credit: 51,768
RAC: 0
Message 4755 - Posted: 24 Mar 2009, 21:24:14 UTC

More problems on Mac O S X 10.4.11

WU's 1376869,1376870,1376871 failed: see below

ERROR: Conformation: fold_tree nres should match conformation nres. conformation nres: 137 fold_tree nres: 156589050
ERROR:: Exit from: src/core/conformation/Conformation.cc line: 224
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>

ID: 4755 · Report as offensive    Reply Quote
Chu
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 26 Sep 06
Posts: 61
Credit: 12,545
RAC: 0
Message 4756 - Posted: 24 Mar 2009, 22:27:50 UTC - in response to Message 4755.  
Last modified: 25 Mar 2009, 5:27:38 UTC

Thanks for your reporting. Some input and output files were not compressed properly for the WUs ending with "BOINC_MPZN_with_zinc_loop_modeling" and therefore caused pre-matured failures/exits. Sorry about it.

More problems on Mac O S X 10.4.11

WU's 1376869,1376870,1376871 failed: see below

ERROR: Conformation: fold_tree nres should match conformation nres. conformation nres: 137 fold_tree nres: 156589050
ERROR:: Exit from: src/core/conformation/Conformation.cc line: 224
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish



ID: 4756 · Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 07
Posts: 75
Credit: 69,584
RAC: 0
Message 4757 - Posted: 27 Mar 2009, 12:00:35 UTC
Last modified: 27 Mar 2009, 12:01:43 UTC

It seems that the work units that I downloaded this morning have an incomplete nomenclature. They are missing the final _0 or _1 that indicates whether it is a first or second attempt.
ID: 4757 · Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 07
Posts: 75
Credit: 69,584
RAC: 0
Message 4758 - Posted: 27 Mar 2009, 13:32:14 UTC - in response to Message 4757.  

It seems that the work units that I downloaded this morning have an incomplete nomenclature. They are missing the final _0 or _1 that indicates whether it is a first or second attempt.


Correction! They are correct on the task list but missing on the work details on the website
ID: 4758 · Report as offensive    Reply Quote
svincent

Send message
Joined: 4 Apr 08
Posts: 34
Credit: 51,768
RAC: 0
Message 4759 - Posted: 27 Mar 2009, 17:48:08 UTC
Last modified: 27 Mar 2009, 17:49:06 UTC

This workunit 1223308 gave a Validate Error on Mac: it claimed to generate 99 decoys from 99 attempts in 12 minutes. Seem unlikely.

The end of stderr output

Starting work on structure: _1UFBA_5_00097
Starting work on structure: _1UFBA_5_00098
Starting work on structure: _1UFBA_5_00099
======================================================
DONE :: 1 starting structures 782.2 cpu seconds
This process generated 99 decoys from 99 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>
ID: 4759 · Report as offensive    Reply Quote
svincent

Send message
Joined: 4 Apr 08
Posts: 34
Credit: 51,768
RAC: 0
Message 4760 - Posted: 27 Mar 2009, 18:01:50 UTC

Another unzipping issue with workunit 1223005 on Mac

Unpacking zip data: ../../projects/ralph.bakerlab.org/minirosetta_database_rev26003.zip
Unpacking WU data ...
Unpacking data: ../../projects/ralph.bakerlab.org/frb_0_8_el_chosen.foldcst_chunk_general_cf.t325_.mtyka.boinc_files.zip
Setting database description ...
Setting up checkpointing ...
Setting up folding (abrelax) ...

ERROR: ERROR: FragmentIO: could not open file aa9mer.1_3.gz
ERROR:: Exit from: src/core/fragment/FragmentIO.cc line: 245
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
]]>
ID: 4760 · Report as offensive    Reply Quote
Tonno

Send message
Joined: 23 Nov 06
Posts: 16
Credit: 49,841
RAC: 0
Message 4761 - Posted: 27 Mar 2009, 21:42:15 UTC - in response to Message 4760.  
Last modified: 27 Mar 2009, 21:44:23 UTC

error after few seconds on windows XP.
Also the other WU gives an error.
Is the same error already posted for Mac:
ERROR: ERROR: FragmentIO: could not open file aa9mer.1_3.gz
ID: 4761 · Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 14 Jan 09
Posts: 62
Credit: 33,293
RAC: 0
Message 4762 - Posted: 28 Mar 2009, 9:38:55 UTC

Several incidents of the error reported by svincent below,

ERROR: ERROR: FragmentIO: could not open file aa9mer.1_3.gz
ERROR:: Exit from: src/core/fragment/FragmentIO.cc line: 245

Task IDs:
1378040
1378221
1384682
1379800

I noted on at least two of them that the other wingman also had the task fail ... configuration issue?

Another error in this latest batch is:

ERROR: aFrame->nr_frags()
ERROR:: Exit from: ....srccorefragmentFragSet.cc line: 168

Task ID: 1376838

The only good news I suppose is that the failures happen almost right away...
ID: 4762 · Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 07
Posts: 75
Credit: 69,584
RAC: 0
Message 4763 - Posted: 28 Mar 2009, 19:32:58 UTC

compute error with:

1390773
1390766

both with message:

ERROR: ERROR: FragmentIO: could not open file aa9mer.1_3.gz
ERROR:: Exit from: ....srccorefragmentFragmentIO.cc line: 245
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

looks a similar fault to some already posted
ID: 4763 · Report as offensive    Reply Quote
Tonno

Send message
Joined: 23 Nov 06
Posts: 16
Credit: 49,841
RAC: 0
Message 4764 - Posted: 29 Mar 2009, 9:19:07 UTC - in response to Message 4763.  

Some WUs take longer to complete than the default runtime.
I'm not sure, but it seems that are all frb_1_8_template_enriched_hb_t286.
For exemple this one. link
ID: 4764 · Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : RALPH@home bug list : minirosetta 1.58



©2021 University of Washington
http://www.bakerlab.org