Posts by mtyka

41) Message boards : RALPH@home bug list : minirosetta v1.54 bug thread (Message 4555)
Posted 26 Jan 2009 by mtyka
Post:

...
*** Dump of thread ID 3700 (state: Initialized): ***
...


That error howe ver is not ok. I know it exists, i've been seeing it for weeks now. Sadly its always a little bit diferent, always fails ina slightly different place. I suspect its not ctually where the problem lies. THe problem is somewhere else in the code and presumably randomly corrupts other areas which then fail. I have no idea how to track this one down :(
I can only hope that i get a reproducable version here locally one day. Maybe i'll have to run tens of thousands of local debug jobs on the cluster or something like that.

Mike
42) Message boards : RALPH@home bug list : minirosetta v1.54 bug thread (Message 4554)
Posted 26 Jan 2009 by mtyka
Post:
And it ges on...

BOINC:: Initializing ... ok.
[2009- 1-26 12:40:10:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing core...
Initializing options.... ok
ERROR: Option matching -loop:close_loops not found in command line top-level context[/size]

Peter



This one is OK. This error and anything similar just happens because ive changed the names of a handful of options. So if an old WU gets sent out with the new executable then this happens. Not really a bug, merely a lagging behind of data vs code.
43) Message boards : RALPH@home bug list : minirosetta v1.54 bug thread (Message 4545)
Posted 26 Jan 2009 by mtyka
Post:
1.54 is here.

Its been a while since i started a new thread so here it is.

1.54 lays out a few more traps for potential problems and I've had a stab at addressing the problem that made it crash inmids option initialization. Also i've limited the number of decoys to 99 - so the WU will finish cleanly if it's reached that. That should limit upload problems, although it would be better to actually monitor outputsize insteead. I'll add that soon.

I have to say that this version is probably the last of this fast sequence of updates. As far as i can see 1.53 has fixed all the issues that were reasonably tractable and reproducable. With only 600 testers on RALPH there's a limit to how much we can do. Provided i have not accidentally introduced new stupid problems into 1.54, i intend to farm it out to Rosetta@HOME. There, with so many more users we might be able to get a better handle on remaining problems. The current error rate is below 1-2% which is about 7-10fold better then previously and comparable with the old rosetta app.

So far so good. Its nearly midnight - i'm off to bed as soon as the update is out. nighty night. ;)
44) Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread (Message 4544)
Posted 25 Jan 2009 by mtyka
Post:
Please mister, can I have some more?

Can't test the version with no work to do ... :)


No worries - there shall be more work soon. I'm preparring another update.
45) Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread (Message 4541)
Posted 24 Jan 2009 by mtyka
Post:
After Rosetta 1.47 totally crashing on my MacBook on 27 Dec 2008 (after one week of faultless computing, within hours after connecting to the Rosetta server; still haven't recovered) I have waited for an opportunity to test what is happening on Ralph. At last I received two 1.53-tasks - however, they show exactly the same symptoms as observed using 1.47 on Rosetta:

abinitio_norelax_homfrag_natfrag_129_B_1pxuA_SAVE_ALL_OUT_7037_1_0

<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
minirosetta_1.47_i686-apple-darwin(69428,0xa031b720) malloc: *** error for object 0x1748220: Non-aligned pointer being freed (2)
*** set a breakpoint in malloc_error_break to debug
minirosetta_1.47_i686-apple-darwin(69428,0xa031b720) malloc: *** error for object 0x1748220: incorrect checksum for freed object - object was probably modified after being freed.

*** set a breakpoint in malloc_error_break to debug
SIGBUS: bus error

Crashed executable name: minirosetta_1.47_i686-apple-darwin
built using BOINC library version 6.5.0
Machine type Intel 80486 (32-bit executable)
System version: Macintosh OS 10.5.6 build 9G55
Sat Dec 27 14:45:29 2008




Hi ramostol, thanks for joining ralph. You error is new, i've not seen it on any other machine yet, but at least now we have a chance to cathc it. Shame the trace is giving so little information.

Mike
46) Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread (Message 4538)
Posted 24 Jan 2009 by mtyka
Post:

Hmmm, shouldn't that be a test? When it writes to the file it should check to see if it is hearing the limit, and if so, stop ... I mean, 24 hours worth of work down the tubes other than we discovered the error ... (thank you) ...

Paul, I think you are saying "shouldn't it have detected the problem much sooner?". It dawned on me later looking at how far over the file size was, that there must be much more to it then just the last model. It was actually wrapping up the entire task when it failed.

There are file size limits in BOINC, and it does check them as the task runs (so far as I know anyway). But this one may not have been detected because I'll bet the last thing the task did was zip it all up to send in. So, there was no way to know the final size until the compression completed.

But yes, seems to indicate that the task is producing too many models. And perhaps that there is some intermediate file size that could be further limited to assure the compressed size does not exceed the limit there. The other way around it would be to limit the number of models the task is allowed to produce. But this would mess up my runtime preference, and my future requests for work, because it would end significantly sooner then my 24hr preference.

Actually, this would be another good factor for the watchdog to keep tabs on. (or the logic about whether to start the next model) If he sees a problem, then he could get the current work reported rather then losing it.


currently there is no monitoring of that. I'll look into putting safety stop on that. I'm not even sure how to get that info from the boinc api.
47) Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread (Message 4516)
Posted 24 Jan 2009 by mtyka
Post:
1.53 is out. This includes a fix in API causing crashes with unzipping .zip files. I hope. Fingers crossed ;)

Also the graphics are updated and should not freeze.

What I'd like to know from you:

- Do you ever get any long running tasks ? (longer then PrefRuntime + 4 hrs)
- Do the checkpoints work and honor the user's setting ? (Feet1st ? )
- Do you get any jobs that are stuck ?
- Do the graphics behave properly again ?

Thanks ya all.

Enjoy the weekend ;)

Mike
48) Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread (Message 4513)
Posted 24 Jan 2009 by mtyka
Post:
This one failed (1st time)
1259966

Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0044EFF4 read attempt to address 0x01062000

Engaging BOINC Windows Runtime Debugger...


Yeah guess what. I just found a bug in the BOINC API! . Holy crap. Basically as far as i can see there's a memory leak when it's trying to unzip files. Mostly all you see is the application dying kicking and screaming just after initialization. One RALPH user though produced a suspicious trace (thank you philip in hongkong!). I have to stress that this is the only job out of hundreds such failures that has returned with a trace.


LENOVO-A05B19D0 (10.9.3.121) [16000]
User ID philip-in-hongkong [2191]
CPU time 0
XML doc in

<core_client_version>6.2.18</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
BOINC:: Initializing ... ok.
[2009- 1-22 17:43:47:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Initializing options.... ok
Initializing random generators... ok
Initialization complete.


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0044FA48 read attempt to
- Callstack -
ChildEBP RetAddr Args to Child
0012ecbc 0044fbc9 0114f008 00000001 010f1ffc 00000004 minirosetta_1.51_windows_intelx!unzip+0x0 (d:boinc_buildminirosetta_windowsminiexternalboinczipunzipunzip.c:943)
0012ecd4 0044e8f0 00000004 010f1ff0 14f97ded 0000000f minirosetta_1.51_windows_intelx!unzip_main+0x0 (d:boinc_buildminirosetta_windowsminiexternalboinczipunzipunzip.c:629)
0012ed10 0044ea7d 00000000 00000000 0108df08 0000000f minirosetta_1.51_windows_intelx!boinc_zip+0x7 (d:boinc_buildminirosetta_windowsminiexternalboinczipboinc_zip.cpp:151)
0012ed68 0041981a 00000000 0000000f 0108bd58 0012ffc0 minirosetta_1.51_windows_intelx!boinc_zip+0x32 (d:boinc_buildminirosetta_windowsminiexternalboinczipboinc_zip.cpp:73)
0012ef14 0041a015 0000001f 0012ef2c 00152310 0012ef2c minirosetta_1.51_windows_intelx!main+0x55 (d:boinc_buildminirosetta_windowsminisrcappspublicboincminirosetta.cc:85)
0012ff28 0042cdf7 00400000 00000000 00152352 0000000a minirosetta_1.51_windows_intelx!WinMain+0x0 (d:boinc_buildminirosetta_windowsminisrcappspublicboincminirosetta.cc:159)
0012ffc0 7c816fd7 00000000 00000000 7ffd3000 c0000005 minirosetta_1.51_windows_intelx!__tmainCRTStartup+0x1c (f:spvctoolscrt_bldself_x86crtsrccrt0.c:324)
0012fff0 00000000 0042ce60 00000000 78746341 00000020 kernel32!_BaseProcessStart@4+0x0 (f:spvctoolscrt_bldself_x86crtsrccrt0.c:324)

*** Dump of thread ID 4252 (state: Waiting): ***



It fails in the unzip code ! OMG.
Been tinkering with the code, the bug probably stems froma single byte not being set to 0. This explains the sporadic nature - if the relevant byte happens to already be 0 then all is fine. I'll push out a version soon to see if the fix works. Its all stipulation at this point.

ALso soory bout the graphics error, i increased the buffersizes and (duh!) forgot to also update the graphics app .. i'll do that ogether with 1.53.

Mike
49) Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread (Message 4503)
Posted 23 Jan 2009 by mtyka
Post:
Thank you all for reporting the problems! The problem divides it self into three parts, firstly cornering the error, turning it into an erro message, reproducing it here and then finding the root of the problem.

At any stage there are three types of problem:

a) Ones that fail with an error message like this:


ERROR: unknown model name: 1DK8A_1
ERROR:: Exit from: d:boinc_buildminirosetta_windowsminisrcprotocols/abinitio/PairingStatistics.hh line: 170
called boinc_finish


These ones we've basically identified. Why they occur is merely a matter of tracking down the route. These ones i have already passed on to the relevant programmers who are fixing htem as we speak.

b) Segfaults with traces:

- Callstack -
ChildEBP RetAddr Args to Child
0012d270 00426312 08000000 077bf590 00000010 08000000 minirosetta_1.50_windows_intelx!UnwindUpVec+0x0 (F:SPvctoolscrt_bldSELF_X86crtsrcIntelMEMCPY.ASM:312)
0012d28c 0046d11a 08000000 00000010 077bf590 00000010 minirosetta_1.50_windows_intelx!memmove_s+0xc (f:spvctoolscrt_bldself_x86crtsrcmemmove_s.c:58)
0012d2a8 0046f352 077bf590 077bf5a0 08000000 021a3d3c minirosetta_1.50_windows_intelx!stdext::unchecked_copy<double *,double *>+0x2a (c:program files (x86)microsoft visual studio 8vcincludexutility:3409)
0012d2c8 005b1d2e 021a3d40 3d0497dd 0012d500 02006520 minirosetta_1.50_windows_intelx!std::vector<double,std::allocator<double> >::operator=+0x0 (c:program files (x86)microsoft visual studio 8vcincludevector:565)
0012d3cc 005b1ed1 3d0496dd 02029f88 00000000 02006520 minirosetta_1.50_windows_intelx!core::scoring::Ramachandran::init_rama_sampling_table+0x2e (d:boinc_buildminirosetta_windowsminisrccorescoringramachandran.cc:200)
0012d3ec 005398fb 3d0496fd 02090718 02006520 0012d420 minirosetta_1.50_windows_intelx!core::scoring::Ramachandran::Ramachandran+0x0 (d:boinc_buildminirosetta_windowsminisrccorescoringramachandran.cc:72)
0012d408 005eccf9 3d049111 02029f88 3fc99999 02090718 minirosetta_1.50_windows_intelx!core::scoring::ScoringManager::get_Ramachandran+0x21 (d:boinc_buildminirosetta_windowsminisrccorescoringscoringmanager.cc:339)
0012d428 0053b7e1 3d049131 02012f48 00000001 7c00003d minirosetta_1.50_windows_intelx!core::scoring::methods::RamachandranEnergy::RamachandranEnergy+0x59 (d:boinc_buildminirosetta_windowsminisrccorescoringmethodsramachandranenergy.cc:42)
0012d4b4 004d8a28 0012d4e4 0012d500 02090718 3d0491c9 minirosetta_1.50_windows_intelx!core::scoring::ScoringManager::energy_method+0x2b (d:boinc_buildminirosetta_windowsminisrccorescoringscoringmanager.cc:566)
0012d4dc 004d9135 0012d500 0012d544 3d0491f5 0012da20 minirosetta_1.50_windows_intelx!core::scoring::ScoreFunction::set_weight+0x0 (d:boinc_buildminirosetta_windowsminisrccorescoringscorefunction.cc:1377)
0012d708 004ceacb 0012d738 3d04921d 020c68e0 01321ef0 minirosetta_1.50_windows_intelx!core::scoring::ScoreFunction::apply_patch_from_file+0x0 (d:boinc_buildminirosetta_windowsminisrccorescoringscorefunction.cc:235)
0012d9e0 004cf1b1 0012da20 0000001f 6e617473 64726164 minirosetta_1.50_windows_intelx!core::scoring::ScoreFunctionFactory::create_score_function+0x4e (d:boinc_buildminirosetta_windowsminisrccorescoringscorefunctionfactory.cc:72)
0012da90 005755b2 0012dac4 3d049fa5 00000000 0131af60 minirosetta_1.50_windows_intelx!core::scoring::getScoreFunction+0x31 (d:boinc_buildminirosetta_windowsminisrccorescoringscorefunctionfactory.cc:168)
0012daec 0051b6a9 0131af60 0012dc20 3d049e05 01304498 minirosetta_1.50_windows_intelx!core::pack::pack_missing_sidechains+0xa (d:boinc_buildminirosetta_windowsminisrccorepackpack_missing_sidechains.cc:83)
0012ddf8 0051bc8a 0131af60 01304498 3d049b09 00000010 minirosetta_1.50_windows_intelx!core::io::pdb::FileData::build_pose_as_is+0xe (d:boinc_buildminirosetta_windowsminisrccoreiopdbfile_data.cc:749)
0012de74 004dd0f2 0131af60 01304498 3d049b8d 0012ee68 minirosetta_1.50_windows_intelx!core::io::pdb::FileData::build_pose+0x0 (d:boinc_buildminirosetta_windowsminisrccoreiopdbfile_data.cc:116)
0012e01c 004dd1ae 0131af60 01304498 0012e070 00000000 minirosetta_1.50_windows_intelx!core::io::pdb::pose_from_pdb+0x0 (d:boinc_buildminirosetta_windowsminisrccoreiopdbpose_io.cc:194)
0012e034 00798370 0131af60 0012e070 00000000 3d04a549 minirosetta_1.50_windows_intelx!core::io::pdb::pose_from_pdb+0x0 (d:boinc_buildminirosetta_windowsminisrccoreiopdbpose_io.cc:210)
0012e0f8 007a7282 3d04a401 0000000f 00000007 00000000 minirosetta_1.50_windows_intelx!protocols::abinitio::AbrelaxApplication::setup+0x5d (d:boinc_buildminirosetta_windowsminisrcprotocolsabinitioabrelaxapplication.cc:496)
0012edb4 0041a0d8 3d04a8c5 00000a28 00000002 0012ffc0 minirosetta_1.50_windows_intelx!protocols::abinitio::AbrelaxApplication::run+0x0 (d:boinc_buildminirosetta_windowsminisrcprotocolsabinitioabrelaxapplication.cc:2176)
0012ef14 0041a465 0000001f 0012ef2c 00152310 0012ef2c minirosetta_1.50_windows_intelx!main+0x0 (d:boinc_buildminirosetta_windowsminisrcappspublicboincminirosetta.cc:117)
0012ff28 0042d207 00400000 00000000 00152352 0000000a minirosetta_1.50_windows_intelx!WinMain+0x0 (d:boinc_buildminirosetta_windowsminisrcappspublicboincminirosetta.cc:159)
0012ffc0 7c817067 0032005c 0064005c 7ffd8000 c0000005 minirosetta_1.50_windows_intelx!__tmainCRTStartup+0x1c (f:spvctoolscrt_bldself_x86crtsrccrt0.c:324)
0012fff0 00000000 0042d270 00000000 78746341 00000020 kernel32!_BaseProcessStart@4+0x0 (f:spvctoolscrt_bldself_x86crtsrccrt0.c:324)



Segfault - bad. But at least i get a trace - i can thus look into the code and understand (or guess) at why the program could fail at that point. Then i can turn them into a) type errors.


<core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
There are no child processes to wait for. (0x80) - exit code 128 (0x80)
</message>
<stderr_txt>
BOINC:: Initializing ... ok.
[2009- 1-22 4:45: 1:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x7C81BD02 read attempt to address 0x00000002

Engaging BOINC Windows Runtime Debugger...



Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x7C81BD02 read attempt to address 0x00000002

Engaging BOINC Windows Runtime Debugger...


</stderr_txt>
]]>




These are random segfault that occured somewhere. I can do virtually nothing from here. If i rerun the *exact same commandline* from here all runs fine. So these ones i can only tentatively bracket with stderr statemenets.




Right now i am mainly trying to turn b) and c) into a). Once an error is in the form of a) its usually trivial to solve!

Anyway,.. just incase you guys were interested in what i'm trying to do.


oh i guess there are also errors that are *behavioral* errors. Those are the ones i cannot see from here but you guys have to tell me about. Stuff like the graphics or overrunning models or other strange behaviour.

:-)
50) Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread (Message 4502)
Posted 23 Jan 2009 by mtyka
Post:

Why not grab the set and issue them to us with high replication counts ... then if they fail for everyone, that tells us one thing ... if they only fail for some and not others ... that tells us something else ...


we already submit identical jobs to many many machines - and they only fail *sometimes*. Even when the same random number seed is used.



To my mind, the cases that fail are the ones that you should be saving and using as your issue tasks for each round of testing. Not to keep making new tasks in all cases, but to use those tasks that have proven their ability to cause a problem.


again .. we're already doing that. But the errors are sporadic in nature so submitting the same task might not throw an error "this time".

I have noticed though that some of the problems seem to be highly machine specific. Like one user will always produce this one kind of an error and nithing else. weired. something strange about their setup ? No idea.


51) Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread (Message 4494)
Posted 23 Jan 2009 by mtyka
Post:
[quote]I tried loading up the docking task I got. It is 1.51. Displayed graphic just as the task was starting. Waited, waited... finally realized it was using more CPU then the thread working on the protein! Double checked Ralph settings for % of CPU for the graphic, set to default which is 10%.

Here's my screenshot showing the graphic monopolizing one core, while the two running tasks are competing for the other. Net result, nothing shown in the graphic after several minutes, and graphic thread consuming much more then 10% of CPU.

[quote]

Feet1st , this is awesome - debugging on an unprecedented level :) Nice to get an idea of what all this looks like from your point of view.

These docking tasks are new and not mine - lemme track down the person submiting these and make sure the graphics app can deal with it.

Mike
52) Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread (Message 4487)
Posted 22 Jan 2009 by mtyka
Post:
Tasks can not be suspended, boinc can do nothing with process. After few days I have about 10-15 death rosetta tasks with 3M RAM allocated.

If I don't kill the app, it runs till pc restart (on server usually about 30 days till MS security update and restart).

Can be problem with DEP turned on?

Ver 1.51:

BOINC:: Initializing ... ok.
[2009- 1-22 11: 0:26:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.


Unhandled Exception Detected...

- Unhandled Exception Record -

In Graphics application I have got:

Starting graphics application...
Setting window title to 'minirosetta version 1.51 [workunit: lr6_D_score12_rlbd_2ccv_IGNORE_THE_REST_DECOY_6840_2]'.
OpenSemaphore failure
Successfully loaded '../../projects/ralph.bakerlab.org/Helvetica.txf'...
Close event (shmem not updated) detected, shutting down.
Shutting down graphics application...

but I'm running boinc like service app (and connecting over rdp) and this may be the problem for graphics.



Is there a way to turn off the graphics ? What's DEP ?

THis is interesting, i will have to look at the code to see if there's a way for it to hang somewhere. I wonder what happens to mini if the graphics app fails ..

I will have to try that locally. If you can decribe your set-up in more detail that would be excellent. Is this taks 1.51 or 1.50 or 1.48 ?

Mike

53) Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread (Message 4486)
Posted 22 Jan 2009 by mtyka
Post:
This task is running v1.51. It seems a bit on the large side in all dimensions here. It's been running for 6 hours, but is only on model 2.


absolutely reasonable isnt it ? 3-4 hours per model is what we expect these days with the mammoth tasks.


> Step 900,000!

Step 900000 ? WHat does that mean ? "Step" ??


>It's peak memory usage so far was 430MB.

Hmm yeah, i guess we need to flag these. They're big, i know.


I believe the memory usage of the tasks is reported back with scheduler requests. Does the project process this data and query for anonomolies in memory usage? Or do you need us to report such things?


Not sure where i can get this info. If you see anomalies let us know.
54) Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread (Message 4485)
Posted 22 Jan 2009 by mtyka
Post:
Morning.

Things are moving forward - 1.51 is out.

This should fix the checkpoint issue - they should not dump as frequently
as before and try to honor the user's setting. Now - i have to say this is not *always* possible. To answer your question Feet1st, due to the structure of most simulation software (not just ours) you cant just checkpoint at any arbitrary point. sometime you *need* to checkpoint frequently or not at all, sometimes
you can't checkpoint (or at least it would take huge amounts of data which is not ok). So what happens is that we checkpoint, but we "hold on" to the data in memory (i.e buffer it) until it is official time to checkpoint, and then we dump all the gathered checkpoints. Now, of course there is a limit to how much we can hold on to in orde rnot to overflow the memory, so occasionally we *have* to dump, even if it's not time. also at the end of a decoy, all is dumped and deleted and dealt with, the user setting cannot have any control over that. HOpe that explains it.

However there was glitch in the buffering mechanism so it was dumping always. 1.51 should fix that - could you confirm that that is so !?

Otherwise this release has added debug information to let me figure out where all this stuff is failing. Believe me guys, we're now in the land whre i cannot reproduce these errors here what so ever. Not on the linux boxes, Mac boxes or windows boxes we have. nowhere. Why these remaining segfaults occur is a total mystery to me, so please bear with me. THis is going ot be incredibly difficult to track down.

Thanks for all your help! Every post is super useful to us!!

Mike
55) Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread (Message 4473)
Posted 21 Jan 2009 by mtyka
Post:
I still have problems on my new W2008 X64 server.
Every taks of 1.5 minirosetta hangs at startup with 3MB memory and stdout:

[2009- 1-21 9:52:36:] :: BOINC :: boinc_init()
Created shared memory segment

These tasks hangs and I have to kill them from taskbar. After killing stderr is:


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x778806CF read attempt to address 0x00000004

Engaging BOINC Windows Runtime Debugger...



How do you know it *hangs* ?? Rosetta will not print anything to stdout - that's normal. Are the graphics moving ? What happens if you just let it run for a few hours ?

Mike

56) Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread (Message 4464)
Posted 21 Jan 2009 by mtyka
Post:
as you wish ...
57) Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread (Message 4462)
Posted 20 Jan 2009 by mtyka
Post:
More checkpointing is great! But... this is a bit extreme. My write to disk at MOST every... setting is at 1800 seconds. My harddrive will never be able to spin down and go in to power saver mode all night long if the checkpoints continue at this pace.


1/20/2009 3:31:58 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0 checkpointed
1/20/2009 3:32:28 PM|ralph@home|[checkpoint_debug] result


Hmm ok, i'll look into this.
58) Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread (Message 4461)
Posted 20 Jan 2009 by mtyka
Post:
http://ralph.bakerlab.org/result.php?resultid=1250838

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
Watchdog active.

ERROR: target_strands.size()
ERROR:: Exit from: ....srcprotocolsabinitioTemplateJumpSetup.cc line: 94
called boinc_finish

</stderr_txt>
]]>


Awesome !! Our new debug tools are working. This rare error (i've never seen it in 1000ds of runs) would have gone unnoticed before and led to a segfault. Now it gets caught at least and we can find its cause.

Thanks!

59) Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread (Message 4457)
Posted 20 Jan 2009 by mtyka
Post:
Awesome guys! Keep me posted on what you see out there. The error rate so far is looking fabulous.

I'll probably update the app once more today to fix an issue with the symbol store such that we get code traces in cases where it still fails.

Mike :)
60) Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread (Message 4454)
Posted 19 Jan 2009 by mtyka
Post:
Yes! Ignore that database error message - for some reason the databse did not get uploaded to the server when i did the update on sunday. Something to do with the move to a new update machine i suspect..



Previous 20 · Next 20



©2024 University of Washington
http://www.bakerlab.org