Message boards : RALPH@home bug list : minirosetta v1.46 bug thread
Author | Message |
---|---|
dekim Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 20 Jan 06 Posts: 250 Credit: 543,579 RAC: 0 |
Please post bugs/issues regarding minirosetta version 1.46. This app update includes fixes to an access violation segmentation fault that was occuring frequently with jobs running constraints. |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
Wow! I'm only 3 hours in to a 24hr preference and this one has already generated 14.8 million page faults! It is called: cc2_1_8_mammoth_cst_hb_t315__IGNORE_THE_REST_1YIXA_3_6349_1_0 There also appear to still be scaling problems on the accepted energy... and I'm still on model 1, step 361,000, so it looks like long running models. Peak memory usage shown is 305MB. Response time on Rosetta message boards is about 2 minutes right now. |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
It completed model 1 after 4 hours. |
Path7 Send message Joined: 11 Feb 08 Posts: 56 Credit: 4,974 RAC: 0 |
Long running model The next WU: ccc_1_8_native_cst_homo_bench_foldcst_chunk_general_t305__mtyka_IGNORE_THE_REST_1FPRA_4_6306_1_0 ran for over 6 hours (22054.45 seconds) and generated one decoy. CPU_run_time_pref: 7200 seconds outcome: Success. Edit: Boinc setting: switch between application every 60 minutes. This WU ran its 22054.45 seconds ongoing (without switching to another application). Did it do any check pointing? Path7. |
AdeB Send message Joined: 22 Dec 07 Posts: 61 Credit: 161,367 RAC: 0 |
Error in ccc_1_8_mammoth_mix_cst_homo_bench_foldcst_chunk_general_mammoth_cst_t302__mtyka_IGNORE_THE_REST_2CRPA_8_6270_1_0. stderr out: <core_client_version>6.2.15</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255) </message> <stderr_txt> ERROR: not able to build valid fold-tree in JumpingFoldConstraints::setup_foldtree ERROR:: Exit from: src/protocols/abinitio/LoopJumpFoldCst.cc line: 108 called boinc_finish # cpu_run_time_pref: 14400 </stderr_txt> ]]> AdeB |
cenit Send message Joined: 26 Apr 08 Posts: 5 Credit: 25,392 RAC: 0 |
I have some 1.47 WU.... |
mtyka Volunteer moderator Project developer Project scientist Send message Joined: 19 Mar 08 Posts: 79 Credit: 0 RAC: 0 |
Sadly this bug was introduced by one of our coders only hours before we went ahead with the RALPH update, and is causing these errors. Exquisitely bad luck and bad timing. Hence I did another ralph update last night, so we're now at 1.47 (as you undoubtedly have noticed!) The 1.47 WUs should not crash with this anymore. WRT to long runnings WUs. I know that the ralph default WU length is 1 hour, and it is right that it is set to that. This allows us to get the tests through more quickly. But since ROsetta@HOME's default WU lenght is mroe like 3-4 hours, it not uncommon to see this on Ralph to. So dont worry if you see overrunning WUs that are in the 3-4 hr range, i'd say even 6 hrs is not always avoidable, esp. wehn its on a slower PC. more than that though is of concern, so thanks for pointing these out ! |
mtyka Volunteer moderator Project developer Project scientist Send message Joined: 19 Mar 08 Posts: 79 Credit: 0 RAC: 0 |
WRT to graphics problem: I've reproduced this here locally so i will fix it when i get to it. I am most concerned about stability issues now though - as you know things hve been pretty rough over on Rosetta@HOME. So i reeeeallly want to concentrate on getting the bugs in the actual computation core sorted out - that's my main priority. I think over xmas i might give the graphics an overhaul anyway, so i'll fix the scaling error then too. IMHO our graphics are a little ugly compared to Climate@HOME etc.. and i used to be into graphics coding so i'll try and tinker of the hols - that'll be fun ! :) - Watch this space. Mike |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
I'm well familiar with the runtime per task vs per model, and defaults and expectations... in my case, I don't really have what one would call a slower computer, I have plenty of memory, and I have 24hr runtime preference so I am expecting 24hrs total... but 4hrs per model exceeds the guidelines described here. So, either the guideline should change, or the tasks should change. The task now has 11.5 hours in and is on model 5. So it got 3 more models done in just over 7 hours. In other words, that first model took twice as long as the others. This will cause significant variation in credit, depending on which model you were assigned. Has the reporting of runtime on a per model basis now been completed, so the tasks are reporting back that level of detail? |
ramostol Send message Joined: 29 Mar 07 Posts: 24 Credit: 31,121 RAC: 0 |
So, either the guideline should change, or the tasks should change. I suggest a look at the guidelines. In my Rosetta experience some current errors like to appear only after at least 3-4 hours of computing, which makes testing on Ralph with a default runtime of 1 hour somewhat questionable. |
mtyka Volunteer moderator Project developer Project scientist Send message Joined: 19 Mar 08 Posts: 79 Credit: 0 RAC: 0 |
I'm well familiar with the runtime per task vs per model, and defaults and expectations... in my case, I don't really have what one would call a slower computer, I have plenty of memory, and I have 24hr runtime preference so I am expecting 24hrs total... but 4hrs per model exceeds the guidelines described here. Hmm, here's the problem: The minimum amount of work is 1 decoy. Sometimes its not possible to create that first decoy in 1 hr. SOmetime it will simple take 4 or 5 hours, particularily if the protein is large. Worse still, the precise time a decoy will take even within the exact same batch or even subbatch will vary hugely and largely beyond our control. For example, the search might get stuck for some time and take much logner than it's mates to, say, cloce a loop in the protein. Now, you could argue we should stop and abort if some time is exceeded. But then we're throwing away valuable data! Worse still, we cannot predict how long a model will take from he outset. This is the reason the "percentage finsihed" bar is so useless and appears to stall and slow down towards the end of a job. You see ? How can we display a percentage of the job if we dont know what 100% is ? Its easy if individual decoys take much less than the set runtime, cos then the program will run more or less exactly that long. But if even the first decoy takes longer than that ? When you see the bar "stall" at 98% or so, its not that the program stalls or something, but the program is actually only at say 50% but we got the 100% point wrong so-to-speak, simply because we can't predict the runtime very well. We've mulled over this problem over and over again. Its hard to think of a good solution. If you have any good ideas then *do* let us know :) Now, on RALPH we have set the default runtime to 1 hr even though we are well aware that some tasks take much much longer. THe reason is that for those jobs that actually take less than 1hr we want the clients to finish and report the results as soon as possible. We dont want them to crunch for 10hrs and produce thousands of models - that's what Rosetta@HOME is for. You see our dilemma ? And it is totally right, that sometimes errors dont occur till 3-4 hours into the job. Hence we wont abort half-way through the first decoy. At least one decoy must finsih for us to get a sense of bug-freeness.
see my comments above. It sucks. I know. Ideas appreciated :)
Yes, it has been there for while now. I can plot you a histogram of runtimes if you like :) Its useful for catching rogue decoys that take abnormal runtimes sometimes, but as i said - sometimes a decoy will simple take that long. We'll try and think of algorithmic changes we could implement to tackle these better, but at the end of the day *some* variabilty will always remian. |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
...as I said, well acquinted with the problem. The entire credit system is model-centric. We've always maintained that some models are longer then others, but over time it would average out. ...but that was before anyone really had specific data on the degree of variability in per model runtimes. If you could produce a chart (or set of charts for a batch) that would show us more specifically what we're talking about, that would be great. Many volunteers have very low tollerance for being "ripped off", and not being granted "fair" credit for their work (ask DK about the "credit wars"). some ideas: Grant more credit for long-running models I would hope you obfuscated the model runtimes enough in the task results that it would be very difficult to locate and falsify them. Assuming that is the case, it should be feasible to prorate some additional credit for long models, and perhaps a negative adjustment for short models. Revise the expected per-model runtime guidelines How can people have any confidence that some of the odd behaviors are "normal", if their observations are outside of the project's own published definition of what normal is. Why not publish details in the active WU log? "...this batch includes the following 20 proteins, with the following observed min/max/avg time seen on Ralph" Increase the minimum runtime preference allowed ...as already under discussion on Rosetta boards. The longer the runtime, the less likely the user observes any significant variation or "slow down". One of the main arguments against long runtimes seems to be the project wants to get results back ASAP. ...but how often do you actually look at the data within 24hrs of it's arrival? So, it COULD have run for 24hrs longer and still been reported in by the time you are actually using the data. Support trickle-up messages Why not report each and every model completed in trickle-up messages? That way you get instant results, and reduced scheduler load, and I get minimum download bandwidth because I run the same task for several days if I like. Support partial completions Each time a checkpoint is completed, check how we're doing on runtime. If next checkpoint will exceed runtime preference, report the task back with it's last checkpoint. And devise a process for redistributing that to another host for completion. In other words, move all end of model checking to each checkpoint. Revise scheduler to be RT aware You have a pretty good idea (from Ralph results if nothing else) how long models will take for a given protein within a batch. If model runtime exceeds runtime preference for a host, then find another task to assign to it which has shorter models. Become aggressive cutting off long searches Rather then throwing away data when you cutoff a search, why not build new tasks based on that knowledge? Have them report higher FLOPS estimates, and send out a new task explicitly to pursue this model in far greater detail then you do at present. So the preliminary task will cut earlier then it does at present, and the rework task will cut later. Support opt-in to specific subprojects Several people have suggested you grant some control to users as to what they want to crunch. WCG does this. You could make long models, or model rework tasks be a seperate subproject choice. You could make the default for new subprojects be that only users that "opt-in to new" will crunch them unless they explicitly opt-in the the specific new subproject. Then over time, only once the error rate reaches a defined low threshold, does the default expand to anyone that has not explicitly opted-out. Need more ideas, or clarification, say the word. |
mtyka Volunteer moderator Project developer Project scientist Send message Joined: 19 Mar 08 Posts: 79 Credit: 0 RAC: 0 |
...as I said, well acquinted with the problem. LOL - i just saw your FAQ-style post over on Rosetta (which is awesome btw). So i see you're an expert - sorry - didn't mean to be patronising ;) |
mtyka Volunteer moderator Project developer Project scientist Send message Joined: 19 Mar 08 Posts: 79 Credit: 0 RAC: 0 |
I generally don't understand why credit is granted on a model basis rather than simply like this: credit = computer_time*correction_factor where the correctino factor essentially encodes the computer speed. THis would be determined by a quick internal benhmark for every WU at the beginning. This would dtermine how much a CPU minute is "worth" on that particular machine. End of problem - right ? Or am i missing something ?
I'll post one in a second.
See this is a different story on RALPH vs R@H On RALPH i will submit a job and will be itching for results to come back even 4-5 *hours* later. The sooner i can convince myself that things are working fine i'll move stuff over to BOINC. Obviously when i've just done a RALPH upodate i will test much more thoroughly. But if i'm just repeating a previous run, with , say, diffeent parameters, then i'll want the overhead time to be minimal! On Rosetta i typically wait one/two days before pre-analysing results. THen after 5 days i will rerun the analysis - conclusions rarely change at that point though. But it makes the results way less noisy and hence more reliable. Hence the default/minimum runtime prefs difference.
Hmm - that's a technical issue - i'll have to talk to DK about this.
THAT! would be perfect of course but believe its much easier said than done! We have a ton of different protocols all being run on BOINC and checkpoining even on a single host is a pain in the a***.
I like this idea. Lemme talk to DK if this is conceivably possible.
Interesting idea. Lots of bookkeeping i fear. It also all depends on the *random number*. SO we'd have to relaunch *exactly* the same task with the same random seed. Not impossible, but quite a bit of bookkeeping.
THe last two ideas spawned a thought in me: We can certainly *estimate* that certain proteins will take longer then others. if nothing else, from the returned runtime distributions off RALPH. We could try and send the longer WUs to faster computers and the shorter ones to slower PCs, that way balancing the load a little bit. I mean the most major problems i think arise when an older machine gets a particularily difficult job. Thanks for all your thoughts, this is a good discussion. I'll email DK the thread too. Mike |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
I generally don't understand why credit is granted on a model basis *DO* be sure to discuss the "credit wars" with DK! He'll show you the scars. The sooner i can convince myself that things are working fine i'll move stuff over to BOINC. ...and this explains why so many bugs make it through to Rosetta! It is EASY to convince yourself your work is perfect. Let's raise the bar a little higher. I set my Ralph runtime preference to 24hrs. SOMEONE needs to test the maximum. And if my task fails, it gets dropped in to an average and by the time my result is returned for analysis which would show that all 24hr runtimes are failing... the work is already out causing problems on Rosetta. SO we'd have to relaunch *exactly* the same task with the same random seed. Not impossible, but quite a bit of bookkeeping. No, actually, you'd build a new task, which runs (and runs differently) only one specific model (or a list of specific models). So, it wouldn't use the seed. We could try and send the longer WUs to faster computers and the shorter ones to slower PCs, that way balancing the load a little bit. I mean the most major problems i think arise when an older machine gets a particularily difficult job. This would also require changes to scheduler. And/or definition of subprojects. |
Message boards :
RALPH@home bug list :
minirosetta v1.46 bug thread
©2024 University of Washington
http://www.bakerlab.org