Ralph v1.42

Message boards : RALPH@home bug list : Ralph v1.42

To post messages, you must log in.

AuthorMessage
James
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 22 Jun 06
Posts: 19
Credit: 278
RAC: 0
Message 4357 - Posted: 25 Nov 2008, 9:28:29 UTC

I've just updated Ralph to minirosetta to version 1.42. Please post about bugs/issues here. Cheers,

James


ID: 4357 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4359 - Posted: 25 Nov 2008, 13:56:42 UTC

What bugs/issues do you feel you have resolved?

Are there steps we can take to test any of the frequently encountered problems seen on Rosetta?
ID: 4359 · Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 16 Feb 06
Posts: 364
Credit: 1,368,421
RAC: 0
Message 4361 - Posted: 26 Nov 2008, 10:33:35 UTC

Another Validate error on this WU

I also have one that has run 2 hours past preference time and seems to be stuck at 9.57 minutes to go. Goes up at the rate of 0.001 percent about every 10 sec, the time is not decreasing in the last 2 hours.

This is a similar problem to about 2 application versions ago. I doubt it will get to 100% by the time the Watchdog terminates the WU in about 1 hours time.
ID: 4361 · Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 16 Feb 06
Posts: 364
Credit: 1,368,421
RAC: 0
Message 4363 - Posted: 27 Nov 2008, 1:20:28 UTC

More "Validate Errors" on 1185665
1185675
1186564
1186565
1186810
1186813
1186824

All complete to completion but get no credit, lots of wasted time on this lot.
ID: 4363 · Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 16 Feb 06
Posts: 364
Credit: 1,368,421
RAC: 0
Message 4364 - Posted: 27 Nov 2008, 1:23:05 UTC

"Compute Errors" on these two 1187124 and 1187126

<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>
recovering checkpoint of tag S_1VL7A_12_00000001 with id abrelax_rg_state

ERROR: Loops::add_loop error -- overlapping loop regions
existing loop begin/end: 31/40
new loop begin/end: 40/54
ERROR:: Exit from: src/protocols/loops/LoopClass.cc line: 232
called boinc_finish
ID: 4364 · Report as offensive    Reply Quote
James
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 22 Jun 06
Posts: 19
Credit: 278
RAC: 0
Message 4365 - Posted: 27 Nov 2008, 3:54:25 UTC
Last modified: 27 Nov 2008, 6:03:12 UTC

feet1st: I've solicited information from the developers in our lab about what went into our update of Ralph v1.42, and I'll let some of the other project developers speak for themselves.

Here are two update messages from our protein interface design team:
- "We've revised the way that filters for protein-interface design runs are executed and reported in minirosetta 1.42. This will mean shorter run times for the Rosetta @ Home participants with more meaningful output for us."

- "On the design front, we've made an effort to significantly reduce the memory overhead associated with design, so users with less RAM should be able to run these tasks without it bringing their system to a halt. However, design calculations are by nature very memory-intensive. Therefore, we have also restricted all design WUs to machines that allocate no less than 512MB for Rosetta @ Home."

Here's a list of updates from our structure prediction team:
"Constraints: Can now specify multiple constraints.
Can now specify seperate constraints for centroid/fullatom

Silent files now in AbrelaxApplication (FoldCST)

Relax: A bunch of previously hardcoded parameters in Relax are now parameters.

Native constraints to keep natives from drifting away.
Start structure constraints with Loop Selection to restrain homology modelling cores from drifting too far.

BugFix in Gaussian Constraints.

Added a Pose Recombiner Mode that allows proteins to be spliced together.

Job Distribution: Added a shuffle mode which allows us to run the large scale relax benchmark without destroying the BOINC database by flooding it with millions of command lines."



Conan - those loop boundary errors were input errors by the person who submitted those workunits. The validate errors are the result of a new format added that's not yet supported by the BOINC server, and we'll have to update our server code to deal with it over the weekend. That slow workunit bug looks like something that we fixed several months ago, we've alerted the person who submitted those jobs and he's looking into them.

Thanks for crunching, and from those of us in America - Happy Thanksgiving!

Cheers,

James
ID: 4365 · Report as offensive    Reply Quote
olange

Send message
Joined: 27 Nov 08
Posts: 2
Credit: 0
RAC: 0
Message 4366 - Posted: 27 Nov 2008, 6:45:07 UTC - in response to Message 4364.  

Hi Conan,

the loop-errors are indeed an error in the input data. It shows why the ralph-project is so useful to us. Without ralph I would'nt have been able to spot these errors before running the jobs in a bigger scale on boinc.

For the current project - development of a general and automatable comparative modelling machinery - we have ca. 40 target proteins each coming with 200 alignments to homologues proteins. These homologues proteins are somewhat similar to the target, and hence provide valuable structural clues, however, some parts are wrong and other parts are missing. A typical strategy is to rebuild everything which is missing and a couple of residues around that region.
We are curious, however, if we can improve on that by also rebuilding other parts of the aligned regions, since these can be quite far from the target structure.
Right now we try to find out where exactly we should struck the balance between rebuilding and copying homologues structure. This requires to scan a range of cutoffs. this is done by a script that creates loop-files, which encode exactly what has to be rebuild and what shall be kept rigid.
The script generated 40*200*10 = 80.000 files. A lot of files. Due to a bug in the script, however, some contained subtle errors. I checked a handful of input conditions on our local machines and they were fine. So I went ahead and checked a larger number of input conditions on ralph. This revealed errors in some cases and thus valuable information to revise the script.

Thank you for your interest and your help crunching for our science,
Oliver
ID: 4366 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4367 - Posted: 27 Nov 2008, 16:23:17 UTC

These details are what will be needed to release on Rosetta, right? So, if we're testing here, let's be testing the descriptions as well, and if they don't make sense to lay-people, or whatever, we can work through those kinks as well.

Those descriptions are great, but perhaps the summarized version that will appear in the Rosetta news box would be good too "Brought memory usage down for tasks that were previously using more then normal, reduced per-model runtimes for most of the previously long-running models resulting in more consistent runtimes, additional refinements to the modeling logic...more details here"

What about the problem where suspended tasks keep running? I've not seen it occur here, but is that because it has been addressed? Or, luck of the task draw?
ID: 4367 · Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 16 Feb 06
Posts: 364
Credit: 1,368,421
RAC: 0
Message 4368 - Posted: 29 Nov 2008, 7:41:06 UTC - in response to Message 4367.  

*** G'Day feet1st,
I have often had the problem where both Ralph and Rosetta keep running even though Boinc Manager has switched the jobs and shows different tasks running.

I have a 4 core computer and the other night had 7 tasks running. This has happened a number of times.

If I stop and restart Boinc all is back to normal. If I let them run then it keeps happening till all the started jobs finish.

Can even happen when say Ralph and Cosmology are running together as well.

It is usually Ralph that has been doing this but I noticed Rosetta do it with Ralph only two nights ago.


*** G'Day to you James
and thanks for the follow up information, we then at least know we are helping. So if you know about the Validate errors then I will just say I did have another 7 of these (WU's 1186686, 1186687, 1186763, 1186909, 1186910 and 1186916).

*** G'Day olange,
Thanks for the feedback, not everyone processing work for this project reports problems, so I like to report what I find, also the other testers that report not only help you but also help me know when problems are occuring.



This WU did not do anything when it started. Unknown how long it was running before I realised it was not doing much as Boinc Manager said it was running but no cpu time showed and no percent done had happened. I aborted the WU.


I also have had three work units over the past week request access to my trusted zone and also internet access.
I have allowed these 3 requests but after allowing them the Work Units then error out anyway.
See 1187543
1187544
1187570

Trust this all helps.
ID: 4368 · Report as offensive    Reply Quote
mtyka
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 19 Mar 08
Posts: 79
Credit: 0
RAC: 0
Message 4369 - Posted: 29 Nov 2008, 22:53:55 UTC

Ok, i've just checked in a fix for

a) the validator errors.
b) the checkpoint errors.
c) NANs in hbonding

This version will go out tomorrow onto ralph.
In total this version should address the following bugs (as far as i'm aware of)

- Excessive memory usage (design team)
- Long running jobs (desing team)
- Validator errors
- Check point errors
- NANs in hbonding
- Restarting jobs (there's finer checkpointing now in relaxmode)

Mike
ID: 4369 · Report as offensive    Reply Quote

Message boards : RALPH@home bug list : Ralph v1.42



©2024 University of Washington
http://www.bakerlab.org