41)
Message boards :
RALPH@home bug list :
minirosetta v1.54 bug thread
(Message 4555)
Posted 26 Jan 2009 by mtyka Post:
That error howe ver is not ok. I know it exists, i've been seeing it for weeks now. Sadly its always a little bit diferent, always fails ina slightly different place. I suspect its not ctually where the problem lies. THe problem is somewhere else in the code and presumably randomly corrupts other areas which then fail. I have no idea how to track this one down :( I can only hope that i get a reproducable version here locally one day. Maybe i'll have to run tens of thousands of local debug jobs on the cluster or something like that. Mike |
42)
Message boards :
RALPH@home bug list :
minirosetta v1.54 bug thread
(Message 4554)
Posted 26 Jan 2009 by mtyka Post: And it ges on... This one is OK. This error and anything similar just happens because ive changed the names of a handful of options. So if an old WU gets sent out with the new executable then this happens. Not really a bug, merely a lagging behind of data vs code. |
43)
Message boards :
RALPH@home bug list :
minirosetta v1.54 bug thread
(Message 4545)
Posted 26 Jan 2009 by mtyka Post: 1.54 is here. Its been a while since i started a new thread so here it is. 1.54 lays out a few more traps for potential problems and I've had a stab at addressing the problem that made it crash inmids option initialization. Also i've limited the number of decoys to 99 - so the WU will finish cleanly if it's reached that. That should limit upload problems, although it would be better to actually monitor outputsize insteead. I'll add that soon. I have to say that this version is probably the last of this fast sequence of updates. As far as i can see 1.53 has fixed all the issues that were reasonably tractable and reproducable. With only 600 testers on RALPH there's a limit to how much we can do. Provided i have not accidentally introduced new stupid problems into 1.54, i intend to farm it out to Rosetta@HOME. There, with so many more users we might be able to get a better handle on remaining problems. The current error rate is below 1-2% which is about 7-10fold better then previously and comparable with the old rosetta app. So far so good. Its nearly midnight - i'm off to bed as soon as the update is out. nighty night. ;) |
44)
Message boards :
RALPH@home bug list :
minirosetta v1.48-1.51 bug thread
(Message 4544)
Posted 25 Jan 2009 by mtyka Post: Please mister, can I have some more? No worries - there shall be more work soon. I'm preparring another update. |
45)
Message boards :
RALPH@home bug list :
minirosetta v1.48-1.51 bug thread
(Message 4541)
Posted 24 Jan 2009 by mtyka Post: After Rosetta 1.47 totally crashing on my MacBook on 27 Dec 2008 (after one week of faultless computing, within hours after connecting to the Rosetta server; still haven't recovered) I have waited for an opportunity to test what is happening on Ralph. At last I received two 1.53-tasks - however, they show exactly the same symptoms as observed using 1.47 on Rosetta: Hi ramostol, thanks for joining ralph. You error is new, i've not seen it on any other machine yet, but at least now we have a chance to cathc it. Shame the trace is giving so little information. Mike |
46)
Message boards :
RALPH@home bug list :
minirosetta v1.48-1.51 bug thread
(Message 4538)
Posted 24 Jan 2009 by mtyka Post:
currently there is no monitoring of that. I'll look into putting safety stop on that. I'm not even sure how to get that info from the boinc api. |
47)
Message boards :
RALPH@home bug list :
minirosetta v1.48-1.51 bug thread
(Message 4516)
Posted 24 Jan 2009 by mtyka Post: 1.53 is out. This includes a fix in API causing crashes with unzipping .zip files. I hope. Fingers crossed ;) Also the graphics are updated and should not freeze. What I'd like to know from you: - Do you ever get any long running tasks ? (longer then PrefRuntime + 4 hrs) - Do the checkpoints work and honor the user's setting ? (Feet1st ? ) - Do you get any jobs that are stuck ? - Do the graphics behave properly again ? Thanks ya all. Enjoy the weekend ;) Mike |
48)
Message boards :
RALPH@home bug list :
minirosetta v1.48-1.51 bug thread
(Message 4513)
Posted 24 Jan 2009 by mtyka Post: This one failed (1st time) Yeah guess what. I just found a bug in the BOINC API! . Holy crap. Basically as far as i can see there's a memory leak when it's trying to unzip files. Mostly all you see is the application dying kicking and screaming just after initialization. One RALPH user though produced a suspicious trace (thank you philip in hongkong!). I have to stress that this is the only job out of hundreds such failures that has returned with a trace.
It fails in the unzip code ! OMG. Been tinkering with the code, the bug probably stems froma single byte not being set to 0. This explains the sporadic nature - if the relevant byte happens to already be 0 then all is fine. I'll push out a version soon to see if the fix works. Its all stipulation at this point. ALso soory bout the graphics error, i increased the buffersizes and (duh!) forgot to also update the graphics app .. i'll do that ogether with 1.53. Mike |
49)
Message boards :
RALPH@home bug list :
minirosetta v1.48-1.51 bug thread
(Message 4503)
Posted 23 Jan 2009 by mtyka Post: Thank you all for reporting the problems! The problem divides it self into three parts, firstly cornering the error, turning it into an erro message, reproducing it here and then finding the root of the problem. At any stage there are three types of problem: a) Ones that fail with an error message like this:
These ones we've basically identified. Why they occur is merely a matter of tracking down the route. These ones i have already passed on to the relevant programmers who are fixing htem as we speak. b) Segfaults with traces: - Callstack - Segfault - bad. But at least i get a trace - i can thus look into the code and understand (or guess) at why the program could fail at that point. Then i can turn them into a) type errors.
These are random segfault that occured somewhere. I can do virtually nothing from here. If i rerun the *exact same commandline* from here all runs fine. So these ones i can only tentatively bracket with stderr statemenets. Right now i am mainly trying to turn b) and c) into a). Once an error is in the form of a) its usually trivial to solve! Anyway,.. just incase you guys were interested in what i'm trying to do. oh i guess there are also errors that are *behavioral* errors. Those are the ones i cannot see from here but you guys have to tell me about. Stuff like the graphics or overrunning models or other strange behaviour. :-) |
50)
Message boards :
RALPH@home bug list :
minirosetta v1.48-1.51 bug thread
(Message 4502)
Posted 23 Jan 2009 by mtyka Post:
we already submit identical jobs to many many machines - and they only fail *sometimes*. Even when the same random number seed is used.
again .. we're already doing that. But the errors are sporadic in nature so submitting the same task might not throw an error "this time". I have noticed though that some of the problems seem to be highly machine specific. Like one user will always produce this one kind of an error and nithing else. weired. something strange about their setup ? No idea. |
51)
Message boards :
RALPH@home bug list :
minirosetta v1.48-1.51 bug thread
(Message 4494)
Posted 23 Jan 2009 by mtyka Post: [quote]I tried loading up the docking task I got. It is 1.51. Displayed graphic just as the task was starting. Waited, waited... finally realized it was using more CPU then the thread working on the protein! Double checked Ralph settings for % of CPU for the graphic, set to default which is 10%. Here's my screenshot showing the graphic monopolizing one core, while the two running tasks are competing for the other. Net result, nothing shown in the graphic after several minutes, and graphic thread consuming much more then 10% of CPU. [quote] Feet1st , this is awesome - debugging on an unprecedented level :) Nice to get an idea of what all this looks like from your point of view. These docking tasks are new and not mine - lemme track down the person submiting these and make sure the graphics app can deal with it. Mike |
52)
Message boards :
RALPH@home bug list :
minirosetta v1.48-1.51 bug thread
(Message 4487)
Posted 22 Jan 2009 by mtyka Post: Tasks can not be suspended, boinc can do nothing with process. After few days I have about 10-15 death rosetta tasks with 3M RAM allocated. Is there a way to turn off the graphics ? What's DEP ? THis is interesting, i will have to look at the code to see if there's a way for it to hang somewhere. I wonder what happens to mini if the graphics app fails .. I will have to try that locally. If you can decribe your set-up in more detail that would be excellent. Is this taks 1.51 or 1.50 or 1.48 ? Mike |
53)
Message boards :
RALPH@home bug list :
minirosetta v1.48-1.51 bug thread
(Message 4486)
Posted 22 Jan 2009 by mtyka Post: This task is running v1.51. It seems a bit on the large side in all dimensions here. It's been running for 6 hours, but is only on model 2. absolutely reasonable isnt it ? 3-4 hours per model is what we expect these days with the mammoth tasks. > Step 900,000! Step 900000 ? WHat does that mean ? "Step" ?? >It's peak memory usage so far was 430MB. Hmm yeah, i guess we need to flag these. They're big, i know.
Not sure where i can get this info. If you see anomalies let us know. |
54)
Message boards :
RALPH@home bug list :
minirosetta v1.48-1.51 bug thread
(Message 4485)
Posted 22 Jan 2009 by mtyka Post: Morning. Things are moving forward - 1.51 is out. This should fix the checkpoint issue - they should not dump as frequently as before and try to honor the user's setting. Now - i have to say this is not *always* possible. To answer your question Feet1st, due to the structure of most simulation software (not just ours) you cant just checkpoint at any arbitrary point. sometime you *need* to checkpoint frequently or not at all, sometimes you can't checkpoint (or at least it would take huge amounts of data which is not ok). So what happens is that we checkpoint, but we "hold on" to the data in memory (i.e buffer it) until it is official time to checkpoint, and then we dump all the gathered checkpoints. Now, of course there is a limit to how much we can hold on to in orde rnot to overflow the memory, so occasionally we *have* to dump, even if it's not time. also at the end of a decoy, all is dumped and deleted and dealt with, the user setting cannot have any control over that. HOpe that explains it. However there was glitch in the buffering mechanism so it was dumping always. 1.51 should fix that - could you confirm that that is so !? Otherwise this release has added debug information to let me figure out where all this stuff is failing. Believe me guys, we're now in the land whre i cannot reproduce these errors here what so ever. Not on the linux boxes, Mac boxes or windows boxes we have. nowhere. Why these remaining segfaults occur is a total mystery to me, so please bear with me. THis is going ot be incredibly difficult to track down. Thanks for all your help! Every post is super useful to us!! Mike |
55)
Message boards :
RALPH@home bug list :
minirosetta v1.48-1.51 bug thread
(Message 4473)
Posted 21 Jan 2009 by mtyka Post: I still have problems on my new W2008 X64 server. How do you know it *hangs* ?? Rosetta will not print anything to stdout - that's normal. Are the graphics moving ? What happens if you just let it run for a few hours ? Mike |
56)
Message boards :
RALPH@home bug list :
minirosetta v1.48-1.51 bug thread
(Message 4464)
Posted 21 Jan 2009 by mtyka Post: as you wish ... |
57)
Message boards :
RALPH@home bug list :
minirosetta v1.48-1.51 bug thread
(Message 4462)
Posted 20 Jan 2009 by mtyka Post: More checkpointing is great! But... this is a bit extreme. My write to disk at MOST every... setting is at 1800 seconds. My harddrive will never be able to spin down and go in to power saver mode all night long if the checkpoints continue at this pace. Hmm ok, i'll look into this. |
58)
Message boards :
RALPH@home bug list :
minirosetta v1.48-1.51 bug thread
(Message 4461)
Posted 20 Jan 2009 by mtyka Post: http://ralph.bakerlab.org/result.php?resultid=1250838 Awesome !! Our new debug tools are working. This rare error (i've never seen it in 1000ds of runs) would have gone unnoticed before and led to a segfault. Now it gets caught at least and we can find its cause. Thanks! |
59)
Message boards :
RALPH@home bug list :
minirosetta v1.48-1.51 bug thread
(Message 4457)
Posted 20 Jan 2009 by mtyka Post: Awesome guys! Keep me posted on what you see out there. The error rate so far is looking fabulous. I'll probably update the app once more today to fix an issue with the symbol store such that we get code traces in cases where it still fails. Mike :) |
60)
Message boards :
RALPH@home bug list :
minirosetta v1.48-1.51 bug thread
(Message 4454)
Posted 19 Jan 2009 by mtyka Post: Yes! Ignore that database error message - for some reason the databse did not get uploaded to the server when i did the update on sunday. Something to do with the move to a new update machine i suspect.. |
©2024 University of Washington
http://www.bakerlab.org