Posts by Thomas Leibold

1) Message boards : RALPH@home bug list : minirosetta v1.26 bug thread (Message 4099)
Posted 1 Jun 2008 by Thomas Leibold
Post:
CASP8_t0391_BOINC_LOOP_IGNORE_THE_REST-S25-10-S3-5--1a19A-_4062_1_1
Workunit http://ralph.bakerlab.org/workunit.php?wuid=903281
Crashed after about 10 seconds and has been stuck like this ever since. It is now overdue and boinc manager reports the workunit as "Running, high priority" (but not consuming any cpu time). I'm going to abort it now.

The same workunit also failed for two other users, but apparently didn't get stuck on their systems.

I didn't see anything unusual in the stdout.txt file, lots of 'inflating ...' messages followed by the line 'Starting watchdog ...' and then silence.

Contents of stderr.txt (note that crash occurred prior to reporting the cpu runtime preference):
::::::::::::::
stderr.txt
::::::::::::::
WARNING: Override of option -out:nstruct sets a different value
SIGABRT: abort called
Stack trace (19 frames):
[0x87e0bcb]
[0x880b110]
[0xffffe420]
[0x886dcd4]
[0x8883553]
[0x88885b9]
[0x8888897]
[0x8859221]
[0x885acc9]
[0x846d7de]
[0x81e0b39]
[0x846da2d]
[0x847134b]
[0x81d858e]
[0x813ba92]
[0x80803c4]
[0x804bd84]
[0x8866bdc]
[0x8048111]

Exiting...
# cpu_run_time_pref: 14400
2) Message boards : RALPH@home bug list : minirosetta v1.18 bug thread (Message 3987)
Posted 4 May 2008 by Thomas Leibold
Post:
How many (or few) workunits were send out for version 1.18 ? I have not been able to get a single 1.18 workunit. The last work from Ralph was a 1.17 workunit 2 days ago.

2008-05-04 14:28:31 [ralph@home] Scheduler RPC succeeded [server version 509]
2008-05-04 14:28:31 [ralph@home] Deferring communication for 3 min 17 sec
2008-05-04 14:28:31 [ralph@home] Reason: no work from project
3) Message boards : RALPH@home bug list : minirosetta 1.15 bug thread (Message 3974)
Posted 27 Apr 2008 by Thomas Leibold
Post:
About half of the 1.15 workunits failed with the already reported "smooth_etables" error (the rest was successful).

Of course since 1.15 went to Rosetta the same day it went to Ralph, there may not be any value in reporting these issues :(
4) Message boards : RALPH@home bug list : minirosetta 1.13 bug thread (Message 3944)
Posted 22 Apr 2008 by Thomas Leibold
Post:
Running Linux, Boinc 5.10.45 this WU failed.

Same error here: Workunit 826331
<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>

ERROR: Cannot open file 1vie.pdb
ERROR:: Exit from: src/core/io/pdb/pose_io.cc line: 157
called boinc_finish

</stderr_txt>
]]>

However there is clearly progress since most minirosetta 1.13 are now succeeding on my two Ralph Linux servers.

Are those minirosetta workunits supposed to have graphics ? The "Show Graphics" button stays grayed out.
5) Message boards : RALPH@home bug list : Bug Reports for Rosetta Mini Versions 1.+ (Message 3829)
Posted 14 Mar 2008 by Thomas Leibold
Post:
I had two 1.09 workunits fail with the message:
"<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>
ERROR: Unique best command line context option match not found for -weights

</stderr_txt>"

Workunits 740191 and 740147.

This is on SuSE Linux (32bit) with Boinc 5.10.21.
6) Message boards : RALPH@home bug list : Bug Reports for Rosetta Mini Versions 1.+ (Message 3813)
Posted 8 Mar 2008 by Thomas Leibold
Post:
Have there been enough workunits for Mini Rosetta 1.09 distributed ? I never got any of those. After the last Mini Rosetta 1.08 workunit there was about a day of the usual 'no work from project' for Ralph and I'm now getting Rosetta Beta 5.95 workunits (which makes sense since this too is a new client).

Just wondering if Mini Rosetta 1.09 gets the testing it deserves.


Edit: Looking again, it seems 1.09 was released after 5.95 so perhaps workunits are still coming. Now I wonder why I thought 1.09 had been out a day earlier ?
7) Message boards : RALPH@home bug list : Bug Reports for Rosetta Mini Versions 1.+ (Message 3787)
Posted 23 Feb 2008 by Thomas Leibold
Post:
Linux x86 (SuSE 9.3), Boinc 5.10.21: While most workunits succeed I saw very similar errors for the following workunits:

674358 with mini-Rosetta 1.07
697535 with mini-Rosetta 1.08
698221 with mini-Rosetta 1.08

Stderr.txt shows: "Too many restarts with no progress. Keep application in memory while preempted." as well as exit code -161.
8) Message boards : RALPH@home bug list : Bug Reports for Rosetta Mini Versions 1.+ (Message 3686)
Posted 5 Feb 2008 by Thomas Leibold
Post:
Another bad batch of work units went out. Please just ignore these failures.


I'm assuming the bad batch were the Rosetta Mini 1.04 workunits ? All of the ones I got failed immediately. I only got a single 1.05 workunit and it was successful.

Currently my system is working on a 1.06 workunit and so far it seems to be running ok, with just a little oddity in the task display of Boinc Manager.
I have Ralph configured for 4 hour preferred runtime and the task display shows 2:37:51 cpu time, 63.244%, 7:09:26 to completion.
9) Message boards : RALPH@home bug list : Bug reports for 5.90/5.91 (Message 3581)
Posted 28 Dec 2007 by Thomas Leibold
Post:


Actually 4 times its preference time. Observed and proven.


Correct, but probably not in the sense you meant it: it has been proven that the 5.90 workunits (with the defect Linux client) will NOT terminate by themselves regardless how much they exceed the preferred run time!

While cleaning up the 5.90 mess I did find that this is apparently not entirely new either. On one server I found a stuck 5.81 workunit that had not only exceeded its preferred run time over ONE HUNDRED TIMES, but also exceeded the workunit deadline by a MONTH!

It would not surprise me if the failure to terminate those kinds of rogue workunits is in some way related to the longstanding issues regarding Ralph/Rosetta watchdog on Linux (e.g.: timing dependencies during watchdog initiated client shutdown, corrupting the memory allocation chain by freeing the same block of memory twice resulting in subsequent segmentation faults which sometimes crashes the watchdog and fails to terminate the client IIRC).
10) Message boards : RALPH@home bug list : Bug reports for 5.90/5.91 (Message 3573)
Posted 22 Dec 2007 by Thomas Leibold
Post:

Hello Thomas,
The ones that I had only stopped when I stopped Boinc Manager and then restarted it. The Work Units then stopped and uploaded their results, also giving time taken and other stats.

Hi Conan, it appears that restarting Boinc is indeed the only way to dislodge one of those stuck 5.90 workunits. Left unattended those 5.90 workunits will continue indefinitely!. Unfortunately restarting Boinc is not a complete fix either: as soon as Boinc starts up again it will process the next 5.90 that it still has in its queue.

I did not have any of the ones I had error by doing this.

Out of around 250 stuck 5.90 workunits less than 10 have made it back to the Rosetta servers on their own (some with 299 and more decoys completed). I don't know why those finished when most others don't, but most of those that did finish failed in the Rosetta validator (presumably too little cpu time such as 0.0004 seconds reported to be credible ?). On the other hand two of them did get a large amount of credit for the high number of completed decoys.

The two 5.90 workunits on my home server (one Ralph, one Rosetta) both passed the Validator and did get huge amounts of credit when I restarted Boinc on that machine. More importantly the amount of cpu time reported back to Rosetta was correct for those two workunits! At least part of the Boinc/Rosetta system must be aware of the correct amount of time spend on the workunit and record it properly so that this amount of cpu time spend is available on restart.


Another post I read in Rosetta said that there is a rule that will stop the Rosetta WU once it has run for 6 times its preference time, I have yet to verify this rule or prove it works.

I wish that was true, but I have proof to the contrary. That Ralph 5.90 task was running 50 hours which is a lot more than 6 times the preference time of 4 hours.

I believe the only options available to cleanup this mess for Linux users are:
Option A:
1.) Where possible, use Boinc Manager to abort all 5.90 workunits that are in "Ready to Start" state. This is necessary so that they will not become stuck too!
2.) Stop and restart the Boinc client. This should immediately send home all those 5.90 workunits that have been in progress and that continued to run beyond the allocated preference time. Those workunits should validate fine and will contribute to the Rosetta Science (the time was not wasted).
3.) Use Boinc Manager to verify that all remaining 5.90 tasks in the system are now in the "Ready to Report" state. Note: if you have a large Boinc cache and/or run multiple Boinc projects there is a possibility of a currently running 5.90 workunit not yet having exceeded the preferred runtime. In this case you will need to restart Boinc again when that workunit has exceeded its preferred run time.

Option B:
If you are unable to use the Boinc Manager gui (perhaps on headless servers in remote locations) you need to reset the Rosetta project. Within the BOINC directory run "./boinc_cmd --project http://boinc.bakerlab.org/rosetta reset". Note: this will remove all workunits and clients for the Rosetta project. This means all completed, but not yet reported work and all work in progress for this project will be lost! The boinc client will perform a fresh load of 5.91 workunits along with the 5.91 client as a result of the reset operation.
11) Message boards : RALPH@home bug list : Bug reports for 5.90/5.91 (Message 3571)
Posted 22 Dec 2007 by Thomas Leibold
Post:
Rosetta Thread

I have problems with 5.90 in both Ralph and Rosetta (details posted in the Rosetta thread above). Besides the task display, it also appears that the Ralph task is running indefinitely (with 4 hour runtime configuration the workunit is already running over 24 hours).


Update: the Ralph 5.90 task is now running for 48.5 hours (over 2 days). Is there are anything that will stop those 5.90 tasks from running indefinitely ?

What happens when the task/workunit reaches the Report Deadline (other then no longer getting credit for it) ? Will that stop those 5.90 workunits or will they still continue on ?

12) Message boards : RALPH@home bug list : Bug reports for 5.90/5.91 (Message 3559)
Posted 21 Dec 2007 by Thomas Leibold
Post:
Rosetta Thread

I have problems with 5.90 in both Ralph and Rosetta (details posted in the Rosetta thread above). Besides the task display, it also appears that the Ralph task is running indefinitely (with 4 hour runtime configuration the workunit is already running over 24 hours).
13) Message boards : RALPH@home bug list : Bug reports for 5.72-5.76 (Message 3310)
Posted 8 Aug 2007 by Thomas Leibold
Post:
Two that have failed with the same error
This one
And this one



> Same fault and again it is the "1vie" type Work Unit, they ALL fail,
600557


Just a 'me too' post :-)
Boinc 5.8.15, Ralph 5.73, Workunit 531614, Result 599889, OS Linux 32-bit.
14) Message boards : RALPH@home bug list : Bug reports for 5.66-5.68 (Message 3216)
Posted 24 Jun 2007 by Thomas Leibold
Post:
Hi:

We're looking at these now..



I got some of those too. Here is what it looks like on the boinc console:

2007-06-23 17:05:39 [ralph@home] Computation for task 1FAB_BOINC_MFR_ABRELAX_2144_38_1 finished
2007-06-23 17:05:39 [ralph@home] Output file 1FAB_BOINC_MFR_ABRELAX_2144_38_1_0 for task 1FAB_BOINC_MFR_ABRELAX_2144_38_1 absent
2007-06-23 17:05:39 [rosetta@home] Resuming task BENCH_051207_ABRELAX_SAVE_ALL_OUT_-1a19A-_BARCODE_R55_filters_1804_1116_0 using rosetta version 568
2007-06-23 17:05:40 [ralph@home] Deferring communication for 2 hr 27 min 5 sec
2007-06-23 17:05:40 [ralph@home] Reason: Unrecoverable error for result 1FAB_BOINC_MFR_ABRELAX_2144_38_1 (<file_xfer_error>
<file_name>1FAB_BOINC_MFR_ABRELAX_2144_38_1_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>
)
15) Message boards : RALPH@home bug list : not sure if this is a bug (Message 3178)
Posted 27 May 2007 by Thomas Leibold
Post:
I try to download new WU but everytime it tries to connect to server it says:

'27/05/2007 10:21:26|ralph@home|Scheduler RPC succeeded [server version 509]
27/05/2007 10:21:26|ralph@home|Deferring communication for 4 min 2 sec
27/05/2007 10:21:26|ralph@home|Reason: requested by project'

but on the homepage there are WU available to compute.


The portion of the messages that you posted are normal after every connection to the scheduler: in order to avoid that any single user overloads the scheduler with repeated requests the response from the scheduler indicates a minimum amount of time to wait before trying again.
The lines just before the ones you posted would show why the scheduler was contacted. This may included messages such as "requested by user" or "to report completed tasks" and "not requesting new work" in which case the scheduler won't send work even if there is some available. On the other hand if work is available and the messages indicate something along the lines of "requesting XYZ seconds of new work" then you should get one or more new workunits after the completion of the scheduler request. The messages immediately following the ones you posted will show whether you did get work or say something along the lines of "deferring communication for ABC" (where ABC is different from the scheduler requested communications delay, this delay is calculated by your own boinc client) followed by "Reason: no work from project".

The Ralph home page shows 0 queued workunits meaning no work is available for assignment to clients. A look at the server status right now shows one single Ralph workunit available but I'm not sure what this discrepancy between stats means.

The number of 'in progress' workunits is not an indicator of workunits being available. Workunits are in progress after they have been assigned to clients.
16) Message boards : RALPH@home bug list : Bug reports for 5.66-5.68 (Message 3174)
Posted 27 May 2007 by Thomas Leibold
Post:
Error on workunit 460745 and 463219:

<core_client_version>5.8.15</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1)
</message>
<stderr_txt>
Graphics are disabled due to configuration...
# cpu_run_time_pref: 14400
trouble finding jump_templates_RNA_basepairs_v2.dat
ERROR:: Exit from: read_paths.cc line: 360

</stderr_txt>
]]>

Same error in both cases. Both workunits failed for other users as well.
17) Message boards : RALPH@home bug list : 64bit app just added. (Message 3118)
Posted 20 May 2007 by Thomas Leibold
Post:
This doesn't seem to be the case. the process isn't even running (when rosetta claims to be running, top doesn't show it. it shows all other tasks). I'm very confused now


With the 'keep application in memory' option on this is rare, but still seems to happen occassionally. I recently had one workunit in that state with the 32-bit linux client. What appears to be happening is that during an abnormal shutdown the communication between the processes for the task (you may have seen that each rosetta task uses 4 processes of which one does all the computation and therefore consumes most cpu time) gets in a state where they wait for each other indefinitely. Suspending and resuming the task in that situation does not help (except that Boinc will run another task while this stuck one is suspended). The rosetta watchdog timer doesn't help either since it is one of the processes involved in the communications deadlock. Boinc itself is unaware of the problem since the processes still exist (the fact that they don't actually consume cpu cycles isn't something the boinc client monitors, nor could it do that without introducing dependencies on the varies project clients).

Stopping and restarting the entire boinc client with all project tasks will restart such a workunit at the last checkpoint before it encountered the problem. If the problem is repeatable the workunit will be processed until it once again reaches the trouble spot and will again stop consuming cpu time. In such a case the only remedy is to abort the workunit.
18) Message boards : RALPH@home bug list : bug report for version 5.64 (Message 3090)
Posted 11 May 2007 by Thomas Leibold
Post:
This error on workunit 448473 is almost identical to the earlier one posted by Dr. Who Fan:

<core_client_version>5.8.15</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1)
</message>
<stderr_txt>
Graphics are disabled due to configuration...
# cpu_run_time_pref: 14400
# random seed: 2679311
Unable to find fullatom number for constraint atom:
Atom Type5Residue type 6 Type CB Fullatom aa 6 Fullatom aav 1
ERROR:: Exit from: constraints.cc line: 464

</stderr_txt>
]]>
19) Message boards : RALPH@home bug list : 64bit app just added. (Message 3076)
Posted 6 May 2007 by Thomas Leibold
Post:

======================================================
DONE :: 1 starting structures 2883.67 cpu seconds
This process generated 4 decoys from 4 attempts
======================================================

should that really be valid?


It looks like your client produced 4 models (possibly before the error ?) and therefore should get credit for those.

The issue you mention in your previous post ('running' state, but cpu time not increasing) sounds like something I see when the preference "leave application in memory" is set to "no" and Boinc switches tasks between different projects or suspends a task because of other activity on the system.
20) Message boards : RALPH@home bug list : bug report for version 5.64 (Message 3075)
Posted 6 May 2007 by Thomas Leibold
Post:
Please report problem you encountered for version 5.64 here. Thanks!


First 5.64 workunit 448473 failed pretty quick:

<core_client_version>5.8.15</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1)
</message>
<stderr_txt>
Graphics are disabled due to configuration...
# cpu_run_time_pref: 14400
# random seed: 2679311
Unable to find fullatom number for constraint atom:
Atom Type5Residue type 6 Type CB Fullatom aa 6 Fullatom aav 1
ERROR:: Exit from: constraints.cc line: 464

</stderr_txt>
]]>


Next 20



©2024 University of Washington
http://www.bakerlab.org