I also have one that has run 2 hours past preference time and seems to be stuck at 9.57 minutes to go. Goes up at the rate of 0.001 percent about every 10 sec, the time is not decreasing in the last 2 hours.
This is a similar problem to about 2 application versions ago. I doubt it will get to 100% by the time the Watchdog terminates the WU in about 1 hours time.
process exited with code 1 (0x1, -255)
recovering checkpoint of tag S_1VL7A_12_00000001 with id abrelax_rg_state
ERROR: Loops::add_loop error -- overlapping loop regions
existing loop begin/end: 31/40
new loop begin/end: 40/54
ERROR:: Exit from: src/protocols/loops/LoopClass.cc line: 232
ID: 4364 |
James Forum moderator Project developer Project scientist Joined: Jun 22 06 Posts: 19 ID: 1548 Credit: 278 RAC: 0
feet1st: I've solicited information from the developers in our lab about what went into our update of Ralph v1.42, and I'll let some of the other project developers speak for themselves.
Here are two update messages from our protein interface design team:
- "We've revised the way that filters for protein-interface design runs are executed and reported in minirosetta 1.42. This will mean shorter run times for the Rosetta @ Home participants with more meaningful output for us."
- "On the design front, we've made an effort to significantly reduce the memory overhead associated with design, so users with less RAM should be able to run these tasks without it bringing their system to a halt. However, design calculations are by nature very memory-intensive. Therefore, we have also restricted all design WUs to machines that allocate no less than 512MB for Rosetta @ Home."
Here's a list of updates from our structure prediction team:
"Constraints: Can now specify multiple constraints.
Can now specify seperate constraints for centroid/fullatom
Silent files now in AbrelaxApplication (FoldCST)
Relax: A bunch of previously hardcoded parameters in Relax are now parameters.
Native constraints to keep natives from drifting away.
Start structure constraints with Loop Selection to restrain homology modelling cores from drifting too far.
BugFix in Gaussian Constraints.
Added a Pose Recombiner Mode that allows proteins to be spliced together.
Job Distribution: Added a shuffle mode which allows us to run the large scale relax benchmark without destroying the BOINC database by flooding it with millions of command lines."
Conan - those loop boundary errors were input errors by the person who submitted those workunits. The validate errors are the result of a new format added that's not yet supported by the BOINC server, and we'll have to update our server code to deal with it over the weekend. That slow workunit bug looks like something that we fixed several months ago, we've alerted the person who submitted those jobs and he's looking into them.
Thanks for crunching, and from those of us in America - Happy Thanksgiving!
the loop-errors are indeed an error in the input data. It shows why the ralph-project is so useful to us. Without ralph I would'nt have been able to spot these errors before running the jobs in a bigger scale on boinc.
For the current project - development of a general and automatable comparative modelling machinery - we have ca. 40 target proteins each coming with 200 alignments to homologues proteins. These homologues proteins are somewhat similar to the target, and hence provide valuable structural clues, however, some parts are wrong and other parts are missing. A typical strategy is to rebuild everything which is missing and a couple of residues around that region.
We are curious, however, if we can improve on that by also rebuilding other parts of the aligned regions, since these can be quite far from the target structure.
Right now we try to find out where exactly we should struck the balance between rebuilding and copying homologues structure. This requires to scan a range of cutoffs. this is done by a script that creates loop-files, which encode exactly what has to be rebuild and what shall be kept rigid.
The script generated 40*200*10 = 80.000 files. A lot of files. Due to a bug in the script, however, some contained subtle errors. I checked a handful of input conditions on our local machines and they were fine. So I went ahead and checked a larger number of input conditions on ralph. This revealed errors in some cases and thus valuable information to revise the script.
Thank you for your interest and your help crunching for our science,
These details are what will be needed to release on Rosetta, right? So, if we're testing here, let's be testing the descriptions as well, and if they don't make sense to lay-people, or whatever, we can work through those kinks as well.
Those descriptions are great, but perhaps the summarized version that will appear in the Rosetta news box would be good too "Brought memory usage down for tasks that were previously using more then normal, reduced per-model runtimes for most of the previously long-running models resulting in more consistent runtimes, additional refinements to the modeling logic...more details here"
What about the problem where suspended tasks keep running? I've not seen it occur here, but is that because it has been addressed? Or, luck of the task draw?
*** G'Day feet1st,
I have often had the problem where both Ralph and Rosetta keep running even though Boinc Manager has switched the jobs and shows different tasks running.
I have a 4 core computer and the other night had 7 tasks running. This has happened a number of times.
If I stop and restart Boinc all is back to normal. If I let them run then it keeps happening till all the started jobs finish.
Can even happen when say Ralph and Cosmology are running together as well.
It is usually Ralph that has been doing this but I noticed Rosetta do it with Ralph only two nights ago.
*** G'Day to you James
and thanks for the follow up information, we then at least know we are helping. So if you know about the Validate errors then I will just say I did have another 7 of these (WU's 1186686, 1186687, 1186763, 1186909, 1186910 and 1186916).
*** G'Day olange,
Thanks for the feedback, not everyone processing work for this project reports problems, so I like to report what I find, also the other testers that report not only help you but also help me know when problems are occuring.
This WU did not do anything when it started. Unknown how long it was running before I realised it was not doing much as Boinc Manager said it was running but no cpu time showed and no percent done had happened. I aborted the WU.
I also have had three work units over the past week request access to my trusted zone and also internet access.
I have allowed these 3 requests but after allowing them the Work Units then error out anyway.
See 1187543 1187544 1187570
Trust this all helps.
ID: 4368 |
mtyka Forum moderator Project developer Project scientist Joined: Mar 19 08 Posts: 79 ID: 4144 Credit: 0 RAC: 0
Ok, i've just checked in a fix for
a) the validator errors.
b) the checkpoint errors.
c) NANs in hbonding
This version will go out tomorrow onto ralph.
In total this version should address the following bugs (as far as i'm aware of)
- Excessive memory usage (design team)
- Long running jobs (desing team)
- Validator errors
- Check point errors
- NANs in hbonding
- Restarting jobs (there's finer checkpointing now in relaxmode)