minirosetta v1.48-1.51 bug thread

Author	Message
Snagletooth Send message Joined: 4 May 07 Posts: 67 Credit: 134,427 RAC: 0	Message 4504 - Posted: 23 Jan 2009, 20:46:27 UTC Last modified: 23 Jan 2009, 20:47:39 UTC Hey, Mike, thanks for all the info you've been giving us lately. It's interesting and certainly provides a bit of extra motivation to keep crunching and reporting what we see at our end. Snags ID: 4504 · Reply Quote

franfranlatulipe94@hotmail.com Send message Joined: 12 Dec 08 Posts: 1 Credit: 447 RAC: 0	Message 4505 - Posted: 23 Jan 2009, 21:22:42 UTC Hi, I'm running RALPH with Ubuntu Jaunty up to date. My BOINC version is 6.2.18. When I click to "show graphic" (or suchs, I'm French), I see a small window, like the screenshot in XP in this Post, but all my memory (1.5 Gio) and all my swap (2.0 Gio) are fullâ€¦ If I do not it (click on "sow graphics"), there isn't any visible problem. Rosetta mini 1.52 http://francois.linkmauve.fr/boinc.png The screenshot show when the window has been killed. If there are questions, ask meâ€¦ ID: 4505 · Reply Quote

Snagletooth Send message Joined: 4 May 07 Posts: 67 Credit: 134,427 RAC: 0	Message 4506 - Posted: 23 Jan 2009, 22:12:04 UTC - in response to Message 4505. Hi, I'm running RALPH with Ubuntu Jaunty up to date. My BOINC version is 6.2.18. When I click to "show graphic" (or suchs, I'm French), I see a small window, like the screenshot in XP in this Post, but all my memory (1.5 Gio) and all my swap (2.0 Gio) are fullâ€¦ If I do not it (click on "sow graphics"), there isn't any visible problem. Rosetta mini 1.52 http://francois.linkmauve.fr/boinc.png The screenshot show when the window has been killed. If there are questions, ask meâ€¦ I'm showing the same or very similar behavior on the same type of workunit:lr6_D_score12_rlbn_1gu3_IGNORE_THE_REST_NATIVE_6909_1 running Leopard on an Intel Mac. No cpu use but tons of memory and I don't even get the box so nothing to show in a screen capture. Activity Manager says the process has hung and I killed it without trouble from there. Snags ID: 4506 · Reply Quote

tralala Send message Joined: 12 Apr 06 Posts: 52 Credit: 15,257 RAC: 0	Message 4507 - Posted: 23 Jan 2009, 22:12:53 UTC Mtyka, can you disable graphics for some WUs and see if the error rate goes down? Might help turning c)'s into b)'s. ID: 4507 · Reply Quote

feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 4508 - Posted: 23 Jan 2009, 23:45:17 UTC FYI, in French "Gio" is "GB" (gigabytes) ID: 4508 · Reply Quote

Evan Send message Joined: 23 Dec 07 Posts: 75 Credit: 69,584 RAC: 0	Message 4509 - Posted: 24 Jan 2009, 0:17:01 UTC Last modified: 24 Jan 2009, 0:38:32 UTC I'm showing the same or very similar behavior on the same type of workunit:lr6_D_score12_rlbn_1gu3_IGNORE_THE_REST_NATIVE_6909 I'm also getting problems on several work units starting: test_B_cc .... I get either a hung empty screen or a runtime error. edit The graphics problem doesn't affect progress of the work unit. ID: 4509 · Reply Quote

Aegis Maelstrom Send message Joined: 19 Jan 09 Posts: 12 Credit: 4,751 RAC: 0	Message 4510 - Posted: 24 Jan 2009, 0:17:28 UTC Hi Mike, Hi Fellows! I'm just running Mini 1.52, this task. Actually, I've found that RALPH WU has started only because of a prompt alert: BOINC wanted to turn the screensaver on and I got the Windows massage about a runtime terror and sudden termination of 1.40 graphics viewer. The process has been killed by the system. So far, the WU seems to be continuing, I'm on 3,84% (0:11:35) and running. ID: 4510 · Reply Quote

Path7 Send message Joined: 11 Feb 08 Posts: 56 Credit: 4,974 RAC: 0	Message 4511 - Posted: 24 Jan 2009, 0:25:28 UTC Last modified: 24 Jan 2009, 0:32:34 UTC Hello all, While crunching: lr6_D_score12_rlbn_1ynv_IGNORE_THE_REST_NATIVE_NOCON_6909_1_0 (minirosetta 1.52) I hit the â€œShow graphicâ€ button, Windows XP replied with an error message: Runtime error. Short message rapport: Hung application: minirosetta_graphics_1.40_windows_intelx86.exe, version: 0.0.0.0, hung module: hungapp, version: 0.0.0.0, hung at: 0x00000000. Retrying â€œShow graphicsâ€opened the graphics window, empty. At 16 minutes processor time the processor time stood still, CPU usage = 0 %. Stopped and started BOINC, the WU proceeded where it has stopped. Again tried to hit the â€œShow graphics buttonâ€, and the above repeated. Now the WU has started for the third time, and keeps on running. I don't dare to touch the â€œShow graphicsâ€ button again! Windows XP SP3, single core AMD Sempron 3000+ 1.8 GHz, 1.5 Gb RAM, BOINC 5.10.45. Path7. ID: 4511 · Reply Quote

Evan Send message Joined: 23 Dec 07 Posts: 75 Credit: 69,584 RAC: 0	Message 4512 - Posted: 24 Jan 2009, 0:27:20 UTC This one failed (1st time) 1259966 Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x0044EFF4 read attempt to address 0x01062000 Engaging BOINC Windows Runtime Debugger... ID: 4512 · Reply Quote

Paul D. Buck Send message Joined: 14 Jan 09 Posts: 62 Credit: 33,293 RAC: 0	Message 4514 - Posted: 24 Jan 2009, 0:49:24 UTC - in response to Message 4502. I have noticed though that some of the problems seem to be highly machine specific. Like one user will always produce this one kind of an error and nithing else. weired. something strange about their setup ? No idea. Well, then it is time to start asking that user ... :) I know that some will OC their machine to the edge of instability and fail to recognize the implications of that ... Other cases can be because they are running other applications / projects and that might be the cause ... like running GPU Grid may raise the temperature enough at some times to cause the instability, but not other times ... So, it may not be the problem but the specific hardware ... I remember once where I reported a problem where the device we were testing would cause a tape read error ... impossible the engineers said ... I showed them ... still impossible ... not really ... the compiler was putting bad instructions on the tape and under an error condition the tape could not be read ... Anyway ... Just trying to stimulate the brain cells with some brain storming ... I do appreciate the feedback though ... and I got another task and it should be back in a little bit ... and yes, fascinated in what we as a collective are trying to do ... Oh, and the same RND seed, unless the machines are of the same specific type with identical CPU and OS the RND generator may or may not issue the same sequence of random numbers. Especially if the core of the generator relies on the noise at the end of floating point numbers ... Virtual Prairie is having this exact problem with their work ... ID: 4514 · Reply Quote

Aegis Maelstrom Send message Joined: 19 Jan 09 Posts: 12 Credit: 4,751 RAC: 0	Message 4515 - Posted: 24 Jan 2009, 2:13:24 UTC Last modified: 24 Jan 2009, 2:25:53 UTC O.K., I got the new 1.53 version - when only 1.52 finishes we will see how it behaves. :) EDIT: So far O.K., and the graphics is working... ID: 4515 · Reply Quote

Aegis Maelstrom Send message Joined: 19 Jan 09 Posts: 12 Credit: 4,751 RAC: 0	Message 4517 - Posted: 24 Jan 2009, 2:43:05 UTC - in response to Message 4516. Last modified: 24 Jan 2009, 3:24:16 UTC Hi, currently I'm running sr213_t077_1_NMR_NESG_SAVE_ALL_OUT_6972_24_0. - Do the graphics behave properly again ? Yes, it is working - however there was a funny thing. This WU has a native protein and RMSD - but once the RMSD part of the graphics (showing left/right how good the RMSD is) was not working. It was just blank. The low energy part was O.K. After turning off and turning on one-two minutes after, everything was just fine. Regarding checkpointing - according to boinccmd.exe it does make checkpoints every 100+ - 200+ seconds. It is quite close to my preferences (120 sec). EDIT: I have made an experiment to check the checkpointing and I have turned off the client and afterwards turned it back on. The task resumed from over 26 minutes (and according to the boinccmd the checkpoint was 1591 seconds, so it would fit). However, the WU started from Model 1, from the step 0 and was moving forward. On the other hand, the WU snaply went into SmoothFragmentMover_GunnCost and it didn't look that bad, so I am puzzled: I am not sure if it really started from the beginning, or if the checkpointing worked but the step information is misguiding. I think it must be checked on some further step when the protein is really pretty and the difference is easy to tell - but now I must go sleep. :D EDIT 2: I have repeated my experiment and I am pretty sure the checkpointing is not working correctly. I've stopped the WU after 46 minutes (2524 seconds) when it was quite beautiful (minus three hundred something energy etc.) and restarted. Yet again it started from step 0 and then it started to behave in a strange manner. There was no really folded protein, only one straight chain - but the program was clearly trying to use methods proper in the last stage of the process (namely MoverBase+Minimization). In a minute or so the chain was bent in something like two points and it got "fractalized" by the last step of the procedure (when you have all these small strings moving to get the lowest possible energy). If you want, I can provide you with a snapshot (printscreen of the graphics). Everything ended after circa 200 seconds (for a couple of seconds there was "stage: unknown") and got reported. Have a nice lecture. Interesting thing: it reduced the granted credit; claimed was 7.06, granted: 4.68. Well, one must suffer for the science. =) And now let me get some sleep... P.S. The only thing that collided with the experiment was that in a second time I had to halt a regular Rosetta task (I think I've halted the Rosetta but this time after restarting BOINC the manager didn't remember it, huh). I don't know if it could interfere - if it did it would mean there is some lack of reseting used variables. If it were not for the experiment, I am pretty sure the task would run alright. ID: 4517 · Reply Quote

feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 4518 - Posted: 24 Jan 2009, 2:53:05 UTC The checkpoints seem more inline with my settings as of 1.52. I'm a bit unclear as to what exactly to expect now. Sounds like if I set my preference to "write at most..." 600 seconds (10min) that I should expect the following... I'll see checkpoint debug messages in my messages tab, but it may be that no data was written to disk? And I'll see for a given task, within a given model, that it won't write more then every 10min? ...except if it reaches the end of a model, at which point it will write, regardless of how recently the last checkpoint was?? So, hypothetical example: Time Event mm:ss 00:00 Model start 01:15 checkpoint (but not written) 02:45 checkpoint (but not written) 05:15 checkpoint (but not written) 09:15 checkpoint (but not written) 11:30 checkpoint, all of the above written to disk 12:10 checkpoint (but not written) 13:40 checkpoint (but not written) 14:25 model completed, write to disk, even though only 3min since last write. I think the above is what I should expect. Now... I've got a P4 running with HT, so 2 virtual cores... should I expect the tasks to behave independantly? i.e. each on their own 10 min "write at most" timer? Or should I expect them to both buffer for 10min. unless they reach a model end? And if a model end is reached on one, will that cause the buffers for the other to be written as well? Anyone have any good ideas on how to track over time the number of checkpoint messages, as compared to the number of disk writes? ID: 4518 · Reply Quote

feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 4519 - Posted: 24 Jan 2009, 3:03:50 UTC Last modified: 24 Jan 2009, 3:09:27 UTC Anyone have any good ideas on how to track over time the number of checkpoint messages, as compared to the number of disk writes? OH! Is that what all these new files are?? chk_S_1LARA_1_00000009_ClassicAbinitio___lc_3.out chk_S_1LARA_1_00000009_ClassicAbinitio___lc_3.rng.state chk_S_1LARA_1_00000010_ClassicAbinitio__stage_1.out chk_S_1LARA_1_00000010_ClassicAbinitio__stage_1.rng.state In fact I see many files now in my slot directory. Seems like blocks of 4 all written with same timestamp. But names vary, probably depending on what each of the model took the checkpoint. Ok, yes my "boinc_checkpoint_count" file says 92, but I've got no where near that many files. And then even less blocks of time where files were written. Ya, I've only got 14 different timestamps. So, that sounds good. I've got another docking task that says it's on model 258 after 22hrs (my preference is 24hrs), and so it's checkpoint count says 257, and I don't have any of the other files. Perhaps it is not taking checkpoints? The default.out file on that one is 43MB so far! Man is THAT upload going to take a while! ID: 4519 · Reply Quote

Aegis Maelstrom Send message Joined: 19 Jan 09 Posts: 12 Credit: 4,751 RAC: 0	Message 4520 - Posted: 24 Jan 2009, 3:31:43 UTC Feel1st, others: could you consider running a similar test to the one I have described above? Or maybe you would find some further/better tests? I guess there is no better way to see if the checkpointing is really working. And I am awfully sorry for my subprime English ;) - it's 4:30 a.m. here and I am rushing to my bed. :) ID: 4520 · Reply Quote

Paul D. Buck Send message Joined: 14 Jan 09 Posts: 62 Credit: 33,293 RAC: 0	Message 4521 - Posted: 24 Jan 2009, 4:21:08 UTC Last modified: 24 Jan 2009, 5:14:33 UTC Graphics ... Well, for me on the Mac Pro (Intel) trying the graphics got the system start to initialize and then die... never got the graphics window up ... I don't have a Rosetta task in work at the moment, but I am pretty sure that 1.47 the graphics does work on OS-X ... does not for me with 1.52 (maybe I need 1.53? Well, I will see when the other tasks become available and I can run them and try the grpahics ... Could it be becuase I am running GPU Grid on my other computers and the Mac is jealous? {edit add} It looks like the task that failed for graphics did complete... But this task failed for unknown model name ... Not sure why ... and I just caught it ... Initialization complete. Watchdog active. ERROR: unknown model name: 1DK8A_1 ERROR:: Exit from: src/protocols/abinitio/PairingStatistics.hh line: 170 called boinc_finish ID: 4521 · Reply Quote

feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 4522 - Posted: 24 Jan 2009, 5:54:35 UTC Last modified: 24 Jan 2009, 5:56:02 UTC Yep, I was afraid of that... 1/23/2009 11:12:31 PM\|ralph@home\|Computation for task 1BJ1.mrtmdock.pdb_simpledocking.xml_3_8_6906_1_0 finished 1/23/2009 11:12:31 PM\|ralph@home\|Output file 1BJ1.mrtmdock.pdb_simpledocking.xml_3_8_6906_1_0_0 for task 1BJ1.mrtmdock.pdb_simpledocking.xml_3_8_6906_1_0 exceeds size limit. 1/23/2009 11:12:31 PM\|ralph@home\|File size: 30524004.000000 bytes. Limit: 25000000.000000 bytes 1BJ1.mrtmdock.pdb_simpledocking.xml_3_8_6906_1_0 The file was too large, so it calls it a compute error. Looks like it doesn't actually send the file when it grows too large. Too bad. It was 24 seconds short of 24hrs when it "failed", apparently writing the last model to the file. Oh well... THIS is why I chose to run 24hr tasks on Ralph. To reveal these things. On Rosetta I do it because I like a tidy task list, and to minimize hits to the server. ID: 4522 · Reply Quote

Paul D. Buck Send message Joined: 14 Jan 09 Posts: 62 Credit: 33,293 RAC: 0	Message 4523 - Posted: 24 Jan 2009, 6:13:44 UTC I just chedcked a 1.53 task and the graphics window came up and seemed to post updates ... too tired to look too hard at it (sorry), but it does look like what I saw with 1.52 is fixed in 1.53 (or was it the task?) ... ID: 4523 · Reply Quote

Paul D. Buck Send message Joined: 14 Jan 09 Posts: 62 Credit: 33,293 RAC: 0	Message 4524 - Posted: 24 Jan 2009, 6:16:04 UTC - in response to Message 4522. Yep, I was afraid of that... 1/23/2009 11:12:31 PM\|ralph@home\|Computation for task 1BJ1.mrtmdock.pdb_simpledocking.xml_3_8_6906_1_0 finished 1/23/2009 11:12:31 PM\|ralph@home\|Output file 1BJ1.mrtmdock.pdb_simpledocking.xml_3_8_6906_1_0_0 for task 1BJ1.mrtmdock.pdb_simpledocking.xml_3_8_6906_1_0 exceeds size limit. 1/23/2009 11:12:31 PM\|ralph@home\|File size: 30524004.000000 bytes. Limit: 25000000.000000 bytes 1BJ1.mrtmdock.pdb_simpledocking.xml_3_8_6906_1_0 The file was too large, so it calls it a compute error. Looks like it doesn't actually send the file when it grows too large. Too bad. It was 24 seconds short of 24hrs when it "failed", apparently writing the last model to the file. Oh well... THIS is why I chose to run 24hr tasks on Ralph. To reveal these things. On Rosetta I do it because I like a tidy task list, and to minimize hits to the server. Hmmm, shouldn't that be a test? When it writes to the file it should check to see if it is hearing the limit, and if so, stop ... I mean, 24 hours worth of work down the tubes other than we discovered the error ... (thank you) ... ID: 4524 · Reply Quote

Ian_D Send message Joined: 16 Feb 06 Posts: 16 Credit: 39,518 RAC: 0	Message 4525 - Posted: 24 Jan 2009, 9:40:03 UTC version 1.53 https://ralph.bakerlab.org/result.php?resultid=1263355 <core_client_version>6.4.5</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> BOINC:: Initializing ... ok. [2009- 1-24 8:50:37:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. Registering options.. Registered extra options. Initializing core... Initializing options.... ok Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/ralph.bakerlab.org/minirosetta_database_rev26003.zip <unzip> <-oq> <../../projects/ralph.bakerlab.org/minirosetta_database_rev26003.zip> <-d> <./> Firstarg=true; pp=-d first check it's not -dexdir firstarg: <./> End of unzipping. Unpacking WU data ... Unpacking data: ../../projects/ralph.bakerlab.org/testC_cc2_1_8_mammoth_mix_cen_cst.foldcst_chunk_general_mammoth_cst.t290_.mtyka.boinc_files.zip <unzip> <-oq> <../../projects/ralph.bakerlab.org/testC_cc2_1_8_mammoth_mix_cen_cst.foldcst_chunk_general_mammoth_cst.t290_.mtyka.boinc_files.zip> End of unzipping. Setting database description ... Setting up checkpointing ... Setting up folding (abrelax) ... Beginning folding (abrelax) ... BOINC:: Worker startup. Starting watchdog... Watchdog active. Starting work on structure: _1CYNA_10_00001 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x00572136 read attempt to address 0xAB9812D1 Engaging BOINC Windows Runtime Debugger... ******************** BOINC Windows Runtime Debugger Version 6.5.0 Dump Timestamp : 01/24/09 09:25:27 Install Directory : D:Program FilesBOINC Data Directory : D:ProgramDataBOINC Project Symstore : https://boinc.bakerlab.org/rosetta/symstore Loaded Library : D:Program FilesBOINCdbghelp.dll Loaded Library : D:Program FilesBOINCsymsrv.dll Loaded Library : D:Program FilesBOINCsrcsrv.dll LoadLibraryA( D:Program FilesBOINCversion.dll ): GetLastError = 126 Loaded Library : version.dll Debugger Engine : 4.0.5.0 Symbol Search Path: D:ProgramDataBOINCslots1;D:ProgramDataBOINCprojectsralph.bakerlab.org;srvD:ProgramDataBOINCprojectsralph.bakerlab.orgsymbolshttp://msdl.microsoft.com/download/symbols;srvD:ProgramDataBOINCprojectsralph.bakerlab.orgsymbolshttps://boinc.bakerlab.org/rosetta/symstore;srvD:ProgramDataBOINCprojectsralph.bakerlab.orgsymbolshttp://boinc.berkeley.edu/symstore ID: 4525 · Reply Quote