| Author | Message |
|
|
|
Please post issues and bugs here. We are particularly interested in excessive disk usage and memory errors. We do expect some jobs to use up to 600-700MB of memory and we'll submit these to higher memory clients. We are also interested in a possible dead lock of the main application and the graphics app where the cpu usage goes to zero for both apps.
____________
|
|
|
|
|
|
Had one blow up on a sin/cos range error.
=Mike
____________
Don't believe everything you think. |
|
|
|
|
|
thanks for the info. that's a known issue with that type of job.
____________
|
|
|
|
|
|
One where BOINC thinks the workunit is still running, but it's using no CPU time at all now:
http://ralph.bakerlab.org/workunit.php?wuid=1802588
Elapsed 07:01:04
48.46% progress and no longer changing
To completion 06:27:09
I normally don't have the graphics portion showing, but when I asked for it, it came up solid black.
Anything special I need to do to send back useful information on why? |
|
|
|
|
|
A few more details:
The workunit not using CPU time had a 530 MB maximum working set size.
Was running in 32-bit mode. Any plans to offer a 64-bit version of this application, even if its main advantage is to help computers like mine that seem to have a limit of around 4 GB on the maximum amount of memory that can be assigned to the entire set of 32-bit programs (BOINC or not) that are in memory at once?
More memory is installed, but seems useful mainly for 64-bit programs.
I haven't found a task name for the graphics app. What should I be looking for?
My other computer also has a 3.14 workunit, running in high priority mode but at least still showing an increasing progress. |
|
|
|
|
|
I've now found something that might be the graphics application:
Minirosetta Beta 3.14 - Windows Internet Explorer
Listed under Applications under Windows Task Manager, not under Processes, and therefore shown without any task name.
Have not found any way to show the resource usage of anything listed only as an application.
Total disk usage by all programs about 1 MB per minute, and mainly by system programs.
Total network usage about 1 MB per minute, mainly by boincmgr.exe and boinc.exe.
BOINC 6.10.58
64-bit Vista SP2, with almost all updates offered except Internet Explorer 9
My other computer has already returned its 3.14 workunit hours sooner than its previous estimated time to completion; already marked as a success. Same versions of BOINC and Windows. |
|
|
|
|
|
I've now identified:
Minirosetta Beta 3.14 - Windows Internet Explorer
It was the browser window under which I entered the last few messages.
CPU time at last checkpoint of the faulty workunit: 03:33:00
CPU time for the workunit: 03:33:15
Could this indicate a problem with resuming normal operation after checkpoints? I've forgotten just which BOINC project has often been showing workunits stopping any use of CPU time about that soon after a checkpoint lately. Would a separate thread used mainly for checking for such conditions be useful?
I've added up the memory currently reported as in use by 32-bit programs. About 1.7 GB total, so I don't expect any problem from that.
|
|
|
|
|
|
I decided to inspect the list of files in the slot for the failed workunit; it appears that the last file modified there was about 6 hours ago.
I also inspected the files lists under minirosetta-database and found that the sections for metal ions do not appear to list aluminum, even though it is connected to the brain damage in one of the later stages of Alzheimer's, or copper, even though the human brain's natural defense against Alzheimer's uses a copper-binding protein. I assume that is not important for this workunit, but how important is it for Rosetta@Home workunits aimed at Alzheimer's? |
|
|
|
|
|
Still more:
I clicked on the workunit, then Show graphics. Another window, all black inside. I clicked on the X to close that window and got a windows error message for minirosetta_graphics_3.13_windows_x86_64.exe. Details too long to copy, but I used the snipping tool to capture pictures of it.
If those details would be useful, how do I send the pictures?
Windows Task Manager does not list any program with that name among the programs now running or suspended, and did not when I started this series of messages. |
|
|
|
|
|
robertmiles,
Sounds like it might be a dead lock issue. You can manually kill the minirosetta process. We'll look into this further. Let us know if it happens again.
____________
|
|
|
|
|
|
Thanks for replying. |
|
|
|
|
|
All error after few seconds on win7:
2056476
2056475
2056474
2056471
2056465
ERROR: unrecognized aa LIG
ERROR:: Exit from: ..\..\..\src\core\io\pdb\file_data.cc line: 641
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish |
|
|
|
|
|
Failing on Mac also. Slightly different error message
ERROR: Cannot open PDB file "2p9hA_suc_0001.pdb"
ERROR:: Exit from: src/core/import_pose/import_pose.cc line: 199
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish
Task 2056185
|
|
|
|
|
ERROR: unrecognized aa LIG
Sorry about that - there was a file missing from the input files. It should be corrected in newer submissions.
ERROR: Cannot open PDB file "2p9hA_suc_0001.pdb"
A different input file issue - also should be corrected with newer submissions.
--
(I will double check my input files before submitting.
I will double check my input files before submitting.
I will double check my input files before submitting. ...) |
|
|
|
|
(I will double check my input files before submitting.
I will double check my input files before submitting.
I will double check my input files before submitting. ...)
:-) |
|
|
|
|
|
Had this error on 3 of my last few work units
ERROR: unrecognized aa LIG
ERROR:: Exit from: src/core/io/pdb/file_data.cc line: 641
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish
See 2056632
2056714
2057573
Also had the following error on another 2 work units
ERROR: Cannot open PDB file "2p9hA_suc_0001.pdb"
ERROR:: Exit from: ..\..\..\src\core\import_pose\import_pose.cc line: 199
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish
See 2057602
2057618
Conan
____________
 |
|
|
|
|
|
2058828
ERROR: ERROR: FragmentIO: could not open file frags_w_cs_wt_200.11mers
ERROR:: Exit from: ..\..\..\src\core\fragment\FragmentIO.cc line: 230
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish |
|
|
|
|
|
Had two errors with the following error code
ERROR: ct == final_atoms
ERROR:: Exit from: ..\..\..\src\core\scoring\rms_util.cc line: 524
BOINC:: Error reading and gzipping output datafile: default.out
On 2078013
and 2078108
Both failed for the resend as well.
Conan
____________
 |
|
|
|
|
|
Same error here 2077145 wingmans unit died also.
ERROR: ct == final_atoms
ERROR:: Exit from: ..\..\..\src\core\scoring\rms_util.cc line: 524
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish
|
|
|
|
|
|
Anyone having watchdog problems with the cleft.cyca.CYCA... units? I have three all gone past the 12hr target point and bouncing between 9:59 and 10:00 minutes remaining. Longest one is at about 13 hrs 25 mins. Going to let them run this morning to see if they finish on their own.
edit: morning eyes, time for a shower, changed 'deft' to 'cleft' |
|
|
|
|
|
Here is another error
ERROR: ERROR: FragmentIO: could not open file frags_w_cs_wt_200.11mers
ERROR:: Exit from: src/core/fragment/FragmentIO.cc line: 230
BOINC:: Error reading and gzipping output datafile: default.out
On 2083601
Conan
____________
 |
|
|
|
|
|
2078238
ERROR: ct == final_atoms
ERROR:: Exit from: ..\..\..\src\core\scoring\rms_util.cc line: 524
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish |
|
|
|
|
|
Work Units 2081438 and 2081487
Both failed with "Maximum elapsed time exceeded"
Conan
____________
 |
|
|
|
|
|
2092449
ERROR: ERROR: FragmentIO: could not open file frags_w_cs_wt_200.11mers
ERROR:: Exit from: ..\..\..\src\core\fragment\FragmentIO.cc line: 230
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish |
|
|
|
|
|
33 failures (complete batch) this morning all failing with 'Client Error'
http://ralph.bakerlab.org/results.php?userid=527
Example error:-
<core_client_version>6.12.33</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
[2011- 8- 2 10:28:58:] :: BOINC:: Initializing ... ok.
[2011- 8- 2 10:28:58:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/ralph.bakerlab.org/minirosetta_database_rev42272.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
Tag::read - parse error, printing backtrace.
Tag::read - parse error - file:istream line:5 column:1 - </SFXN5>
Tag::read - parse error - file:istream line:5 column:1 - ^
Tag::read - parse error - file:istream line:6 column:1 - </SCOREFXNS>
Tag::read - parse error - file:istream line:6 column:1 - ^
Tag::read - parse error - file:istream line:9 column:1 - </FILTERS>
Tag::read - parse error - file:istream line:9 column:1 - ^
Tag::read - parse error - file:istream line:13 column:1 - </TASKOPERATIONS>
Tag::read - parse error - file:istream line:13 column:1 - ^
Tag::read - parse error - file:istream line:15 column:1 - <FlxbbDesign name=flxbb ncycles=3 sfxn_design=SFXN5 sfxn_relax=SFXN5 SFXN5 clear_all_residues=1 task_operations=limitchi2,layer_allclear_all_residues=0 blueprint="master.blueprint" constraints_NtoC=1.0 />
Tag::read - parse error - file:istream line:15 column:1 - ^
Tag::read - parse error - file:istream line:14 column:1 - <MOVERS>
Tag::read - parse error - file:istream line:14 column:1 - ^
Tag::read - parse error - file:istream line:1 column:1 - <dock_design>
Tag::read - parse error - file:istream line:1 column:1 - ^
ERROR: false
ERROR:: Exit from: ..\..\..\src\utility\tag\Tag.cc line: 387
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish
</stderr_txt>
]]>
____________
  |
|
|
|
|
33 failures (complete batch) this morning all failing with 'Client Error'
ERROR: false
ERROR:: Exit from: ..\..\..\src\utility\tag\Tag.cc line: 387
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish
Same here after few seconds
2111200
and others 30 wus.... |
|
|
|
|
|
Yes same Error here as well (although I did a few work units that ran OK).
Tag::read - parse error, printing backtrace.
Tag::read - parse error - file:istream line:5 column:1 - </SFXN5>
Tag::read - parse error - file:istream line:5 column:1 - ^
Tag::read - parse error - file:istream line:6 column:1 - </SCOREFXNS>
Tag::read - parse error - file:istream line:6 column:1 - ^
Tag::read - parse error - file:istream line:9 column:1 - </FILTERS>
Tag::read - parse error - file:istream line:9 column:1 - ^
Tag::read - parse error - file:istream line:13 column:1 - </TASKOPERATIONS>
Tag::read - parse error - file:istream line:13 column:1 - ^
Tag::read - parse error - file:istream line:15 column:1 - <FlxbbDesign name=flxbb ncycles=3 sfxn_design=SFXN5 sfxn_relax=SFXN5 SFXN5 clear_all_residues=1 task_operations=limitchi2,layer_allclear_all_residues=0 blueprint="master.blueprint" constraints_NtoC=1.0 />
Tag::read - parse error - file:istream line:15 column:1 - ^
Tag::read - parse error - file:istream line:14 column:1 - <MOVERS>
Tag::read - parse error - file:istream line:14 column:1 - ^
Tag::read - parse error - file:istream line:1 column:1 - <dock_design>
Tag::read - parse error - file:istream line:1 column:1 - ^
ERROR: false
ERROR:: Exit from: ..\..\..\src\utility\tag\Tag.cc line: 387
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish
On Work units
2112936
2112877
2112918
2112933
2112861
2112941
2112898
2112851
2112893
2112934
2112900
2112847
2112940
Conan
____________
 |
|
|
|
|
|
Task 2136909 (3stub_patch_CYCA_1sq2_ProteinInterfaceDesign_11Aug2011_15516_1_0) failed on Mac with a Compute Error.
SIGPIPE: write on a pipe with no reader
0 0x00ba68b9 SIGPIPE: write on a pipe with no reader
etc. |
|
|
|
|
|
2171363
Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00996545 read attempt to address 0x00000000
Engaging BOINC Windows Runtime Debugger...
********************
BOINC Windows Runtime Debugger Version 6.5.0
Dump Timestamp : 08/26/11 09:15:30
Install Directory : C:\Programmi\BOINC\
Data Directory : C:\Documents and Settings\All Users\Dati applicazioni\BOINC
Project Symstore : http://boinc.bakerlab.org/rosetta/symstore
Loaded Library : C:\Programmi\BOINC\\dbghelp.dll
Loaded Library : C:\Programmi\BOINC\\symsrv.dll
Loaded Library : C:\Programmi\BOINC\\srcsrv.dll
LoadLibraryA( C:\Programmi\BOINC\\version.dll ): GetLastError = 126
Loaded Library : version.dll
</stderr_txt>
]]> |
|
|
|
|
|
Pls check this one out... Ran for 21.5 ksec and looks like it finished fine after 901 decoys, but someone else had already sent in a bad one, so mine was marker as a 'validate error' ???
2178636 |
|
|
|
|
|
All wu in error:
2179787
2179786
2179786
etc, etc
ERROR: in::file::boinc_wu_zip 1uaoA.zip does not exist!
ERROR:: Exit from: ..\..\..\src\apps\public\boinc\minirosetta.cc line: 167
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish |
|
|
|
|
|
All 80 abrelax ended with client error / compute error
Task ID 2179800
Name 2WWEA_abrelax_15647_3_0
Workunit 1921159
Created 12 Sep 2011 5:49:30 UTC
Sent 12 Sep 2011 5:52:07 UTC
Received 12 Sep 2011 14:18:59 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status 1 (0x1)
Computer ID 20656
Report deadline 16 Sep 2011 5:52:07 UTC
CPU time 2.059213
stderr out <core_client_version>6.12.33</core_client_version>
<![CDATA[
<message>
Unzul�ssige Funktion. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
[2011- 9-12 15:49:49:] :: BOINC:: Initializing ... ok.
[2011- 9-12 15:49:49:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/ralph.bakerlab.org/minirosetta_database_rev42272.zip
Unpacking WU data ...
ERROR: in::file::boinc_wu_zip 1uaoA.zip does not exist!
ERROR:: Exit from: ..\..\..\src\apps\public\boinc\minirosetta.cc line: 167
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish
</stderr_txt>
]]>
Validate state Invalid
|
|
|
|
|
|
Had this one fail
Result 2185423
ERROR: in::file::boinc_wu_zip 1uaoA.zip does not exist!
ERROR:: Exit from: ..\..\..\src\apps\public\boinc\minirosetta.cc line: 167
BOINC:: Error reading and gzipping output datafile: default.out
Conan
____________
 |
|
|
|
|
|
Why we are running a lot of 3.14?? Now last version in 3.17..... |
|
|
|
|
|
Could be trying to track down a bug with 3.14 that I saw several times over on Rosetta@Home. If so, something to watch for: Shortly after a checkpoint, the workunit stops using any CPU time at all, WITHOUT telling BOINC it has encountered a problem so that some other workunit can be started instead. If so, the time limit checking cannot run, so the workunit can easily sit there looking like it's running, but not actually doing anything, for many times as long as you've selected for workunits to run. |
|
|