minirosetta v1.46 bug thread

Message boards : RALPH@home bug list : minirosetta v1.46 bug thread

To post messages, you must log in.

AuthorMessage
Profile dekim
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 20 Jan 06
Posts: 250
Credit: 543,579
RAC: 0
Message 4402 - Posted: 12 Dec 2008, 22:48:58 UTC

Please post bugs/issues regarding minirosetta version 1.46. This app update includes fixes to an access violation segmentation fault that was occuring frequently with jobs running constraints.
ID: 4402 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4403 - Posted: 13 Dec 2008, 4:33:02 UTC

Wow! I'm only 3 hours in to a 24hr preference and this one has already generated 14.8 million page faults! It is called:
cc2_1_8_mammoth_cst_hb_t315__IGNORE_THE_REST_1YIXA_3_6349_1_0

There also appear to still be scaling problems on the accepted energy... and I'm still on model 1, step 361,000, so it looks like long running models.

Peak memory usage shown is 305MB.

Response time on Rosetta message boards is about 2 minutes right now.
ID: 4403 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4404 - Posted: 13 Dec 2008, 5:45:52 UTC

It completed model 1 after 4 hours.
ID: 4404 · Report as offensive    Reply Quote
Path7

Send message
Joined: 11 Feb 08
Posts: 56
Credit: 4,974
RAC: 0
Message 4405 - Posted: 13 Dec 2008, 9:58:47 UTC
Last modified: 13 Dec 2008, 10:21:17 UTC

Long running model

The next WU:
ccc_1_8_native_cst_homo_bench_foldcst_chunk_general_t305__mtyka_IGNORE_THE_REST_1FPRA_4_6306_1_0
ran for over 6 hours (22054.45 seconds) and generated one decoy.

CPU_run_time_pref: 7200 seconds outcome: Success.

Edit: Boinc setting: switch between application every 60 minutes. This WU ran its 22054.45 seconds ongoing (without switching to another application). Did it do any check pointing?

Path7.
ID: 4405 · Report as offensive    Reply Quote
AdeB
Avatar

Send message
Joined: 22 Dec 07
Posts: 61
Credit: 161,367
RAC: 0
Message 4406 - Posted: 13 Dec 2008, 13:47:04 UTC

Error in ccc_1_8_mammoth_mix_cst_homo_bench_foldcst_chunk_general_mammoth_cst_t302__mtyka_IGNORE_THE_REST_2CRPA_8_6270_1_0.

stderr out:
<core_client_version>6.2.15</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>

ERROR: not able to build valid fold-tree in JumpingFoldConstraints::setup_foldtree
ERROR:: Exit from: src/protocols/abinitio/LoopJumpFoldCst.cc line: 108
called boinc_finish
# cpu_run_time_pref: 14400

</stderr_txt>
]]>


AdeB
ID: 4406 · Report as offensive    Reply Quote
Profile cenit

Send message
Joined: 26 Apr 08
Posts: 5
Credit: 25,392
RAC: 0
Message 4407 - Posted: 13 Dec 2008, 16:19:09 UTC

I have some 1.47 WU....
ID: 4407 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4411 - Posted: 13 Dec 2008, 22:59:42 UTC - in response to Message 4408.  
Last modified: 13 Dec 2008, 23:02:55 UTC

I'm well familiar with the runtime per task vs per model, and defaults and expectations... in my case, I don't really have what one would call a slower computer, I have plenty of memory, and I have 24hr runtime preference so I am expecting 24hrs total... but 4hrs per model exceeds the guidelines described here.

So, either the guideline should change, or the tasks should change.

The task now has 11.5 hours in and is on model 5. So it got 3 more models done in just over 7 hours. In other words, that first model took twice as long as the others. This will cause significant variation in credit, depending on which model you were assigned.

Has the reporting of runtime on a per model basis now been completed, so the tasks are reporting back that level of detail?
ID: 4411 · Report as offensive    Reply Quote
ramostol

Send message
Joined: 29 Mar 07
Posts: 24
Credit: 31,121
RAC: 0
Message 4412 - Posted: 14 Dec 2008, 12:06:06 UTC - in response to Message 4411.  

So, either the guideline should change, or the tasks should change.


I suggest a look at the guidelines. In my Rosetta experience some current errors like to appear only after at least 3-4 hours of computing, which makes testing on Ralph with a default runtime of 1 hour somewhat questionable.


ID: 4412 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4415 - Posted: 15 Dec 2008, 21:15:26 UTC
Last modified: 15 Dec 2008, 21:37:36 UTC

...as I said, well acquinted with the problem.

The entire credit system is model-centric. We've always maintained that some models are longer then others, but over time it would average out. ...but that was before anyone really had specific data on the degree of variability in per model runtimes. If you could produce a chart (or set of charts for a batch) that would show us more specifically what we're talking about, that would be great.

Many volunteers have very low tollerance for being "ripped off", and not being granted "fair" credit for their work (ask DK about the "credit wars").

some ideas:

Grant more credit for long-running models
I would hope you obfuscated the model runtimes enough in the task results that it would be very difficult to locate and falsify them. Assuming that is the case, it should be feasible to prorate some additional credit for long models, and perhaps a negative adjustment for short models.

Revise the expected per-model runtime guidelines
How can people have any confidence that some of the odd behaviors are "normal", if their observations are outside of the project's own published definition of what normal is. Why not publish details in the active WU log? "...this batch includes the following 20 proteins, with the following observed min/max/avg time seen on Ralph"

Increase the minimum runtime preference allowed
...as already under discussion on Rosetta boards. The longer the runtime, the less likely the user observes any significant variation or "slow down". One of the main arguments against long runtimes seems to be the project wants to get results back ASAP. ...but how often do you actually look at the data within 24hrs of it's arrival? So, it COULD have run for 24hrs longer and still been reported in by the time you are actually using the data.

Support trickle-up messages
Why not report each and every model completed in trickle-up messages? That way you get instant results, and reduced scheduler load, and I get minimum download bandwidth because I run the same task for several days if I like.

Support partial completions
Each time a checkpoint is completed, check how we're doing on runtime. If next checkpoint will exceed runtime preference, report the task back with it's last checkpoint. And devise a process for redistributing that to another host for completion. In other words, move all end of model checking to each checkpoint.

Revise scheduler to be RT aware
You have a pretty good idea (from Ralph results if nothing else) how long models will take for a given protein within a batch. If model runtime exceeds runtime preference for a host, then find another task to assign to it which has shorter models.

Become aggressive cutting off long searches
Rather then throwing away data when you cutoff a search, why not build new tasks based on that knowledge? Have them report higher FLOPS estimates, and send out a new task explicitly to pursue this model in far greater detail then you do at present. So the preliminary task will cut earlier then it does at present, and the rework task will cut later.

Support opt-in to specific subprojects
Several people have suggested you grant some control to users as to what they want to crunch. WCG does this. You could make long models, or model rework tasks be a seperate subproject choice. You could make the default for new subprojects be that only users that "opt-in to new" will crunch them unless they explicitly opt-in the the specific new subproject. Then over time, only once the error rate reaches a defined low threshold, does the default expand to anyone that has not explicitly opted-out.

Need more ideas, or clarification, say the word.
ID: 4415 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4418 - Posted: 15 Dec 2008, 23:19:38 UTC

I generally don't understand why credit is granted on a model basis

*DO* be sure to discuss the "credit wars" with DK! He'll show you the scars.

The sooner i can convince myself that things are working fine i'll move stuff over to BOINC.

...and this explains why so many bugs make it through to Rosetta! It is EASY to convince yourself your work is perfect. Let's raise the bar a little higher. I set my Ralph runtime preference to 24hrs. SOMEONE needs to test the maximum. And if my task fails, it gets dropped in to an average and by the time my result is returned for analysis which would show that all 24hr runtimes are failing... the work is already out causing problems on Rosetta.

SO we'd have to relaunch *exactly* the same task with the same random seed. Not impossible, but quite a bit of bookkeeping.


No, actually, you'd build a new task, which runs (and runs differently) only one specific model (or a list of specific models). So, it wouldn't use the seed.

We could try and send the longer WUs to faster computers and the shorter ones to slower PCs, that way balancing the load a little bit. I mean the most major problems i think arise when an older machine gets a particularily difficult job.


This would also require changes to scheduler. And/or definition of subprojects.
ID: 4418 · Report as offensive    Reply Quote

Message boards : RALPH@home bug list : minirosetta v1.46 bug thread



©2024 University of Washington
http://www.bakerlab.org