Message boards : RALPH@home bug list : minirosetta 1.58
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
[Toscana]SickBoy88 Send message Joined: 27 Jan 09 Posts: 3 Credit: 17,581 RAC: 0 |
This WU https://ralph.bakerlab.org/result.php?resultid=1371832 Give me a compute error. |
svincent Send message Joined: 4 Apr 08 Posts: 34 Credit: 51,768 RAC: 0 |
Just has a bunch of WU's fail at the start (Mac OS X 10.4.11) all in the same way: sample output below. 1213821 1213853 1213852 1213701 1213693 1213692 1213677 ERROR: [ERROR] Unable to open constraints file: t297_.cst.best.multi ERROR:: Exit from: src/core/scoring/constraints/ConstraintIO.cc line: 330 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish |
Pentti Kiesi Send message Joined: 2 Jan 09 Posts: 2 Credit: 111,437 RAC: 0 |
https://ralph.bakerlab.org/workunit.php?wuid=1185064 One WU seems not willing to upload at all. Others before and after it are uploading correctly: 13.3.2009 21:38:40|ralph@home|[error] Error reported by file upload server: [loopbuild_mamaln_full_hb_t303__IGNORE_THE_REST_1te2_99_8360_1_0_0] locked by file_upload_handler PID=-1 As a reminder. Still hanging on my upload queue. 12h CPU time on Quad 2.66GHz. Should I cancel this at last, or are you still intereseted on it? |
robertmiles Send message Joined: 13 Jan 09 Posts: 103 Credit: 331,865 RAC: 0 |
Another lockfile problem: https://ralph.bakerlab.org/result.php?resultid=1374201 Running at 95% CPU, with BOINC 6.2.28 under Vista SP1 with graphics disabled. Will try resetting Ralph@home soon. |
robertmiles Send message Joined: 13 Jan 09 Posts: 103 Credit: 331,865 RAC: 0 |
Tried resetting Ralph@home, got these error messages: 3/22/2009 9:04:01 AM|ralph@home|Resetting project 3/22/2009 9:04:06 AM|ralph@home|[error] Couldn't delete file projects/ralph.bakerlab.org/minirosetta_1.58_windows_intelx86.exe Similar problem with Rosetta@home, except with the 1.54 executable. I use BOINC 6.2.28 under Vista SP1, with graphics not enabled. An attempt to manually delete this file failed when I couldn't find the directory containing it, or even anything under the BOINC directory specific to the Ralph@home project. I intend to leave both Ralph@home and Rosetta@home on no new tasks until I get some usable advice on how to complete the resets. |
Evan Send message Joined: 23 Dec 07 Posts: 75 Credit: 69,584 RAC: 0 |
|
robertmiles Send message Joined: 13 Jan 09 Posts: 103 Credit: 331,865 RAC: 0 |
More on the lockfile problem: When this problem shows up, expect a few subdirectories of BOINCslots to have three files each, unrelated to any workunit in progress and including the lockfile. I suspect that Rosetta@home and Ralph@home workunits are unable to run successfully if assigned to any of these slots, even if workunits from other BOINC projects can. Attempts to manually delete these files also fail. However, the following may have helped for me: Set Rosetta@home and/or Ralph@home to no new tasks and wait until all tasks for either of them complete. Do an update for either that has tasks not reported yet. Suspend all workunits and network activity. Shut down the BOINC client, then find process boinc.exe and kill it. Reboot. If these subdirectories of BOINCslots have disappeared, enable network activity and do another reset on Rosetta@home and/or Ralph@home. If these resets complete without error messages, it's safe to resume activity on any other BOINC projects, then allow new tasks on Rosetta@home and/or Ralph@home. However, I haven't completed any workunits for either Rosetta@home or Ralph@home since doing this, so it will be at least tomorrow before I can check if this actually took care of the lockfile problem, at least temporarily. I wouldn't be surprised if this procedure includes some unneccessary steps, but wanted to report this before any effort to find out. |
robertmiles Send message Joined: 13 Jan 09 Posts: 103 Credit: 331,865 RAC: 0 |
Didn't help enough - the first Rosetta@home 1.54 workunit completed after the above procedure had the lockfile problem again, but two more started since then and not complete yet haven't had that problem yet. Suggestion: Modify minirosetta so that it checks for a lockfile as it starts up, preferably before trying to create one, and if this first check finds a lockfile, reduce the number of times minirosetta is allowed to restart before it is able to write the first checkpoint. Suggestion: Modify minirosetta so that it reports which slot it was run under if it is able to do this, since the problem looks likely to repeat for any minirosetta workunit run in a slot where a previous workunit's lockfile was not erased when the previous workunit completed and was reported. Suggestion: Check the procedure used for failed workunits to see if it leaves a lockfile behind after abandoning efforts to restart the workunit. Suggestion: Check what program is supposed to delete the lockfiles for workunits that have been completed and reported. Suggestion: Check if BOINC allows any way to request that a workunit be restarted, but in a different slot. Suggestion: If BOINC is supposed to clean up the slots after workunits complete and are reported, check if BOINC 6.2.28 is known to have any problems with doing this. I haven't had any 1.58 workunits since trying the procedure, so I don't know whether these continued problems also apply to 1.58. I often let BOINC run for a few days between reboots. I still use BOINC 6.2.28 under Vista SP1, with 95% CPU time. |
Paul D. Buck Send message Joined: 14 Jan 09 Posts: 62 Credit: 33,293 RAC: 0 |
Robert, Thanks for all the work on the lock file ... I hope we can figure out what is going on with this ... On my part I have a Validate error though the task seems to have failed with another error that did not get reported as an error: ERROR: dis==0 in pairtermderiv! ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 338 What does this mean? Beats me ... |
Paul D. Buck Send message Joined: 14 Jan 09 Posts: 62 Credit: 33,293 RAC: 0 |
|
svincent Send message Joined: 4 Apr 08 Posts: 34 Credit: 51,768 RAC: 0 |
More problems on Mac O S X 10.4.11 WU's 1376869,1376870,1376871 failed: see below ERROR: Conformation: fold_tree nres should match conformation nres. conformation nres: 137 fold_tree nres: 156589050 ERROR:: Exit from: src/core/conformation/Conformation.cc line: 224 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish </stderr_txt> |
Chu Volunteer moderator Project developer Project scientist Send message Joined: 26 Sep 06 Posts: 61 Credit: 12,545 RAC: 0 |
Thanks for your reporting. Some input and output files were not compressed properly for the WUs ending with "BOINC_MPZN_with_zinc_loop_modeling" and therefore caused pre-matured failures/exits. Sorry about it. More problems on Mac O S X 10.4.11 |
Evan Send message Joined: 23 Dec 07 Posts: 75 Credit: 69,584 RAC: 0 |
It seems that the work units that I downloaded this morning have an incomplete nomenclature. They are missing the final _0 or _1 that indicates whether it is a first or second attempt. |
Evan Send message Joined: 23 Dec 07 Posts: 75 Credit: 69,584 RAC: 0 |
It seems that the work units that I downloaded this morning have an incomplete nomenclature. They are missing the final _0 or _1 that indicates whether it is a first or second attempt. Correction! They are correct on the task list but missing on the work details on the website |
svincent Send message Joined: 4 Apr 08 Posts: 34 Credit: 51,768 RAC: 0 |
This workunit 1223308 gave a Validate Error on Mac: it claimed to generate 99 decoys from 99 attempts in 12 minutes. Seem unlikely. The end of stderr output Starting work on structure: _1UFBA_5_00097 Starting work on structure: _1UFBA_5_00098 Starting work on structure: _1UFBA_5_00099 ====================================================== DONE :: 1 starting structures 782.2 cpu seconds This process generated 99 decoys from 99 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down cleanly ... called boinc_finish </stderr_txt> |
svincent Send message Joined: 4 Apr 08 Posts: 34 Credit: 51,768 RAC: 0 |
Another unzipping issue with workunit 1223005 on Mac Unpacking zip data: ../../projects/ralph.bakerlab.org/minirosetta_database_rev26003.zip Unpacking WU data ... Unpacking data: ../../projects/ralph.bakerlab.org/frb_0_8_el_chosen.foldcst_chunk_general_cf.t325_.mtyka.boinc_files.zip Setting database description ... Setting up checkpointing ... Setting up folding (abrelax) ... ERROR: ERROR: FragmentIO: could not open file aa9mer.1_3.gz ERROR:: Exit from: src/core/fragment/FragmentIO.cc line: 245 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish </stderr_txt> ]]> |
Tonno Send message Joined: 23 Nov 06 Posts: 16 Credit: 49,841 RAC: 0 |
|
Paul D. Buck Send message Joined: 14 Jan 09 Posts: 62 Credit: 33,293 RAC: 0 |
Several incidents of the error reported by svincent below, ERROR: ERROR: FragmentIO: could not open file aa9mer.1_3.gz ERROR:: Exit from: src/core/fragment/FragmentIO.cc line: 245 Task IDs: 1378040 1378221 1384682 1379800 I noted on at least two of them that the other wingman also had the task fail ... configuration issue? Another error in this latest batch is: ERROR: aFrame->nr_frags() ERROR:: Exit from: ....srccorefragmentFragSet.cc line: 168 Task ID: 1376838 The only good news I suppose is that the failures happen almost right away... |
Evan Send message Joined: 23 Dec 07 Posts: 75 Credit: 69,584 RAC: 0 |
compute error with: 1390773 1390766 both with message: ERROR: ERROR: FragmentIO: could not open file aa9mer.1_3.gz ERROR:: Exit from: ....srccorefragmentFragmentIO.cc line: 245 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish looks a similar fault to some already posted |
Tonno Send message Joined: 23 Nov 06 Posts: 16 Credit: 49,841 RAC: 0 |
Some WUs take longer to complete than the default runtime. I'm not sure, but it seems that are all frb_1_8_template_enriched_hb_t286. For exemple this one. link |
Message boards :
RALPH@home bug list :
minirosetta 1.58
©2024 University of Washington
http://www.bakerlab.org