minirosetta v1.46 bug thread

Author	Message
dekim Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 20 Jan 06 Posts: 250 Credit: 543,579 RAC: 0	Message 4402 - Posted: 12 Dec 2008, 22:48:58 UTC Please post bugs/issues regarding minirosetta version 1.46. This app update includes fixes to an access violation segmentation fault that was occuring frequently with jobs running constraints. ID: 4402 · Reply Quote

feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 4403 - Posted: 13 Dec 2008, 4:33:02 UTC Wow! I'm only 3 hours in to a 24hr preference and this one has already generated 14.8 million page faults! It is called: cc2_1_8_mammoth_cst_hb_t315__IGNORE_THE_REST_1YIXA_3_6349_1_0 There also appear to still be scaling problems on the accepted energy... and I'm still on model 1, step 361,000, so it looks like long running models. Peak memory usage shown is 305MB. Response time on Rosetta message boards is about 2 minutes right now. ID: 4403 · Reply Quote

feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 4404 - Posted: 13 Dec 2008, 5:45:52 UTC It completed model 1 after 4 hours. ID: 4404 · Reply Quote

Path7 Send message Joined: 11 Feb 08 Posts: 56 Credit: 4,974 RAC: 0	Message 4405 - Posted: 13 Dec 2008, 9:58:47 UTC Last modified: 13 Dec 2008, 10:21:17 UTC Long running model The next WU: ccc_1_8_native_cst_homo_bench_foldcst_chunk_general_t305__mtyka_IGNORE_THE_REST_1FPRA_4_6306_1_0 ran for over 6 hours (22054.45 seconds) and generated one decoy. CPU_run_time_pref: 7200 seconds outcome: Success. Edit: Boinc setting: switch between application every 60 minutes. This WU ran its 22054.45 seconds ongoing (without switching to another application). Did it do any check pointing? Path7. ID: 4405 · Reply Quote

AdeB Send message Joined: 22 Dec 07 Posts: 61 Credit: 161,367 RAC: 0	Message 4406 - Posted: 13 Dec 2008, 13:47:04 UTC Error in ccc_1_8_mammoth_mix_cst_homo_bench_foldcst_chunk_general_mammoth_cst_t302__mtyka_IGNORE_THE_REST_2CRPA_8_6270_1_0. stderr out: <core_client_version>6.2.15</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255) </message> <stderr_txt> ERROR: not able to build valid fold-tree in JumpingFoldConstraints::setup_foldtree ERROR:: Exit from: src/protocols/abinitio/LoopJumpFoldCst.cc line: 108 called boinc_finish # cpu_run_time_pref: 14400 </stderr_txt> ]]> AdeB ID: 4406 · Reply Quote

cenit Send message Joined: 26 Apr 08 Posts: 5 Credit: 25,392 RAC: 0	Message 4407 - Posted: 13 Dec 2008, 16:19:09 UTC I have some 1.47 WU.... ID: 4407 · Reply Quote

feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 4411 - Posted: 13 Dec 2008, 22:59:42 UTC - in response to Message 4408. Last modified: 13 Dec 2008, 23:02:55 UTC I'm well familiar with the runtime per task vs per model, and defaults and expectations... in my case, I don't really have what one would call a slower computer, I have plenty of memory, and I have 24hr runtime preference so I am expecting 24hrs total... but 4hrs per model exceeds the guidelines described here. So, either the guideline should change, or the tasks should change. The task now has 11.5 hours in and is on model 5. So it got 3 more models done in just over 7 hours. In other words, that first model took twice as long as the others. This will cause significant variation in credit, depending on which model you were assigned. Has the reporting of runtime on a per model basis now been completed, so the tasks are reporting back that level of detail? ID: 4411 · Reply Quote

ramostol Send message Joined: 29 Mar 07 Posts: 24 Credit: 31,121 RAC: 0	Message 4412 - Posted: 14 Dec 2008, 12:06:06 UTC - in response to Message 4411. So, either the guideline should change, or the tasks should change. I suggest a look at the guidelines. In my Rosetta experience some current errors like to appear only after at least 3-4 hours of computing, which makes testing on Ralph with a default runtime of 1 hour somewhat questionable. ID: 4412 · Reply Quote

feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 4415 - Posted: 15 Dec 2008, 21:15:26 UTC Last modified: 15 Dec 2008, 21:37:36 UTC ...as I said, well acquinted with the problem. The entire credit system is model-centric. We've always maintained that some models are longer then others, but over time it would average out. ...but that was before anyone really had specific data on the degree of variability in per model runtimes. If you could produce a chart (or set of charts for a batch) that would show us more specifically what we're talking about, that would be great. Many volunteers have very low tollerance for being "ripped off", and not being granted "fair" credit for their work (ask DK about the "credit wars"). some ideas: Grant more credit for long-running models I would hope you obfuscated the model runtimes enough in the task results that it would be very difficult to locate and falsify them. Assuming that is the case, it should be feasible to prorate some additional credit for long models, and perhaps a negative adjustment for short models. Revise the expected per-model runtime guidelines How can people have any confidence that some of the odd behaviors are "normal", if their observations are outside of the project's own published definition of what normal is. Why not publish details in the active WU log? "...this batch includes the following 20 proteins, with the following observed min/max/avg time seen on Ralph" Increase the minimum runtime preference allowed ...as already under discussion on Rosetta boards. The longer the runtime, the less likely the user observes any significant variation or "slow down". One of the main arguments against long runtimes seems to be the project wants to get results back ASAP. ...but how often do you actually look at the data within 24hrs of it's arrival? So, it COULD have run for 24hrs longer and still been reported in by the time you are actually using the data. Support trickle-up messages Why not report each and every model completed in trickle-up messages? That way you get instant results, and reduced scheduler load, and I get minimum download bandwidth because I run the same task for several days if I like. Support partial completions Each time a checkpoint is completed, check how we're doing on runtime. If next checkpoint will exceed runtime preference, report the task back with it's last checkpoint. And devise a process for redistributing that to another host for completion. In other words, move all end of model checking to each checkpoint. Revise scheduler to be RT aware You have a pretty good idea (from Ralph results if nothing else) how long models will take for a given protein within a batch. If model runtime exceeds runtime preference for a host, then find another task to assign to it which has shorter models. Become aggressive cutting off long searches Rather then throwing away data when you cutoff a search, why not build new tasks based on that knowledge? Have them report higher FLOPS estimates, and send out a new task explicitly to pursue this model in far greater detail then you do at present. So the preliminary task will cut earlier then it does at present, and the rework task will cut later. Support opt-in to specific subprojects Several people have suggested you grant some control to users as to what they want to crunch. WCG does this. You could make long models, or model rework tasks be a seperate subproject choice. You could make the default for new subprojects be that only users that "opt-in to new" will crunch them unless they explicitly opt-in the the specific new subproject. Then over time, only once the error rate reaches a defined low threshold, does the default expand to anyone that has not explicitly opted-out. Need more ideas, or clarification, say the word. ID: 4415 · Reply Quote

feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 4418 - Posted: 15 Dec 2008, 23:19:38 UTC I generally don't understand why credit is granted on a model basis DO be sure to discuss the "credit wars" with DK! He'll show you the scars. The sooner i can convince myself that things are working fine i'll move stuff over to BOINC. ...and this explains why so many bugs make it through to Rosetta! It is EASY to convince yourself your work is perfect. Let's raise the bar a little higher. I set my Ralph runtime preference to 24hrs. SOMEONE needs to test the maximum. And if my task fails, it gets dropped in to an average and by the time my result is returned for analysis which would show that all 24hr runtimes are failing... the work is already out causing problems on Rosetta. SO we'd have to relaunch exactly the same task with the same random seed. Not impossible, but quite a bit of bookkeeping. No, actually, you'd build a new task, which runs (and runs differently) only one specific model (or a list of specific models). So, it wouldn't use the seed. We could try and send the longer WUs to faster computers and the shorter ones to slower PCs, that way balancing the load a little bit. I mean the most major problems i think arise when an older machine gets a particularily difficult job. This would also require changes to scheduler. And/or definition of subprojects. ID: 4418 · Reply Quote