Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next
Author | Message |
---|---|
Evan Send message Joined: 23 Dec 07 Posts: 75 Credit: 69,584 RAC: 0 |
I'm showing the same or very similar behavior on the same type of workunit:lr6_D_score12_rlbn_1gu3_IGNORE_THE_REST_NATIVE_6909 I'm also getting problems on several work units starting: test_B_cc .... I get either a hung empty screen or a runtime error. edit The graphics problem doesn't affect progress of the work unit. |
Aegis Maelstrom Send message Joined: 19 Jan 09 Posts: 12 Credit: 4,751 RAC: 0 |
Hi Mike, Hi Fellows! I'm just running Mini 1.52, this task. Actually, I've found that RALPH WU has started only because of a prompt alert: BOINC wanted to turn the screensaver on and I got the Windows massage about a runtime terror and sudden termination of 1.40 graphics viewer. The process has been killed by the system. So far, the WU seems to be continuing, I'm on 3,84% (0:11:35) and running. |
Path7 Send message Joined: 11 Feb 08 Posts: 56 Credit: 4,974 RAC: 0 |
Hello all, While crunching: lr6_D_score12_rlbn_1ynv_IGNORE_THE_REST_NATIVE_NOCON_6909_1_0 (minirosetta 1.52) I hit the “Show graphic†button, Windows XP replied with an error message: Runtime error. Short message rapport: Hung application: minirosetta_graphics_1.40_windows_intelx86.exe, version: 0.0.0.0, hung module: hungapp, version: 0.0.0.0, hung at: 0x00000000. Retrying “Show graphicsâ€opened the graphics window, empty. At 16 minutes processor time the processor time stood still, CPU usage = 0 %. Stopped and started BOINC, the WU proceeded where it has stopped. Again tried to hit the “Show graphics buttonâ€, and the above repeated. Now the WU has started for the third time, and keeps on running. I don't dare to touch the “Show graphics†button again! Windows XP SP3, single core AMD Sempron 3000+ 1.8 GHz, 1.5 Gb RAM, BOINC 5.10.45. Path7. |
Evan Send message Joined: 23 Dec 07 Posts: 75 Credit: 69,584 RAC: 0 |
This one failed (1st time) 1259966 Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x0044EFF4 read attempt to address 0x01062000 Engaging BOINC Windows Runtime Debugger... |
mtyka Volunteer moderator Project developer Project scientist Send message Joined: 19 Mar 08 Posts: 79 Credit: 0 RAC: 0 |
This one failed (1st time) Yeah guess what. I just found a bug in the BOINC API! . Holy crap. Basically as far as i can see there's a memory leak when it's trying to unzip files. Mostly all you see is the application dying kicking and screaming just after initialization. One RALPH user though produced a suspicious trace (thank you philip in hongkong!). I have to stress that this is the only job out of hundreds such failures that has returned with a trace.
It fails in the unzip code ! OMG. Been tinkering with the code, the bug probably stems froma single byte not being set to 0. This explains the sporadic nature - if the relevant byte happens to already be 0 then all is fine. I'll push out a version soon to see if the fix works. Its all stipulation at this point. ALso soory bout the graphics error, i increased the buffersizes and (duh!) forgot to also update the graphics app .. i'll do that ogether with 1.53. Mike |
Paul D. Buck Send message Joined: 14 Jan 09 Posts: 62 Credit: 33,293 RAC: 0 |
I have noticed though that some of the problems seem to be highly machine specific. Like one user will always produce this one kind of an error and nithing else. weired. something strange about their setup ? No idea. Well, then it is time to start asking that user ... :) I know that some will OC their machine to the edge of instability and fail to recognize the implications of that ... Other cases can be because they are running other applications / projects and that might be the cause ... like running GPU Grid may raise the temperature enough at some times to cause the instability, but not other times ... So, it may not be the problem but the specific hardware ... I remember once where I reported a problem where the device we were testing would cause a tape read error ... impossible the engineers said ... I showed them ... still impossible ... not really ... the compiler was putting bad instructions on the tape and under an error condition the tape could not be read ... Anyway ... Just trying to stimulate the brain cells with some brain storming ... I do appreciate the feedback though ... and I got another task and it should be back in a little bit ... and yes, fascinated in what we as a collective are trying to do ... Oh, and the same RND seed, unless the machines are of the same specific type with identical CPU and OS the RND generator may or may not issue the same sequence of random numbers. Especially if the core of the generator relies on the noise at the end of floating point numbers ... Virtual Prairie is having this exact problem with their work ... |
Aegis Maelstrom Send message Joined: 19 Jan 09 Posts: 12 Credit: 4,751 RAC: 0 |
O.K., I got the new 1.53 version - when only 1.52 finishes we will see how it behaves. :) EDIT: So far O.K., and the graphics is working... |
mtyka Volunteer moderator Project developer Project scientist Send message Joined: 19 Mar 08 Posts: 79 Credit: 0 RAC: 0 |
1.53 is out. This includes a fix in API causing crashes with unzipping .zip files. I hope. Fingers crossed ;) Also the graphics are updated and should not freeze. What I'd like to know from you: - Do you ever get any long running tasks ? (longer then PrefRuntime + 4 hrs) - Do the checkpoints work and honor the user's setting ? (Feet1st ? ) - Do you get any jobs that are stuck ? - Do the graphics behave properly again ? Thanks ya all. Enjoy the weekend ;) Mike |
Aegis Maelstrom Send message Joined: 19 Jan 09 Posts: 12 Credit: 4,751 RAC: 0 |
Hi, currently I'm running sr213_t077_1_NMR_NESG_SAVE_ALL_OUT_6972_24_0. - Do the graphics behave properly again ? Yes, it is working - however there was a funny thing. This WU has a native protein and RMSD - but once the RMSD part of the graphics (showing left/right how good the RMSD is) was not working. It was just blank. The low energy part was O.K. After turning off and turning on one-two minutes after, everything was just fine. Regarding checkpointing - according to boinccmd.exe it does make checkpoints every 100+ - 200+ seconds. It is quite close to my preferences (120 sec). EDIT: I have made an experiment to check the checkpointing and I have turned off the client and afterwards turned it back on. The task resumed from over 26 minutes (and according to the boinccmd the checkpoint was 1591 seconds, so it would fit). However, the WU started from Model 1, from the step 0 and was moving forward. On the other hand, the WU snaply went into SmoothFragmentMover_GunnCost and it didn't look that bad, so I am puzzled: I am not sure if it really started from the beginning, or if the checkpointing worked but the step information is misguiding. I think it must be checked on some further step when the protein is really pretty and the difference is easy to tell - but now I must go sleep. :D EDIT 2: I have repeated my experiment and I am pretty sure the checkpointing is not working correctly. I've stopped the WU after 46 minutes (2524 seconds) when it was quite beautiful (minus three hundred something energy etc.) and restarted. Yet again it started from step 0 and then it started to behave in a strange manner. There was no really folded protein, only one straight chain - but the program was clearly trying to use methods proper in the last stage of the process (namely MoverBase+Minimization). In a minute or so the chain was bent in something like two points and it got "fractalized" by the last step of the procedure (when you have all these small strings moving to get the lowest possible energy). If you want, I can provide you with a snapshot (printscreen of the graphics). Everything ended after circa 200 seconds (for a couple of seconds there was "stage: unknown") and got reported. Have a nice lecture. Interesting thing: it reduced the granted credit; claimed was 7.06, granted: 4.68. Well, one must suffer for the science. =) And now let me get some sleep... P.S. The only thing that collided with the experiment was that in a second time I had to halt a regular Rosetta task (I think I've halted the Rosetta but this time after restarting BOINC the manager didn't remember it, huh). I don't know if it could interfere - if it did it would mean there is some lack of reseting used variables. If it were not for the experiment, I am pretty sure the task would run alright. |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
The checkpoints seem more inline with my settings as of 1.52. I'm a bit unclear as to what exactly to expect now. Sounds like if I set my preference to "write at most..." 600 seconds (10min) that I should expect the following... I'll see checkpoint debug messages in my messages tab, but it may be that no data was written to disk? And I'll see for a given task, within a given model, that it won't write more then every 10min? ...except if it reaches the end of a model, at which point it will write, regardless of how recently the last checkpoint was?? So, hypothetical example: Time Event mm:ss 00:00 Model start 01:15 checkpoint (but not written) 02:45 checkpoint (but not written) 05:15 checkpoint (but not written) 09:15 checkpoint (but not written) 11:30 checkpoint, all of the above written to disk 12:10 checkpoint (but not written) 13:40 checkpoint (but not written) 14:25 model completed, write to disk, even though only 3min since last write. I think the above is what I should expect. Now... I've got a P4 running with HT, so 2 virtual cores... should I expect the tasks to behave independantly? i.e. each on their own 10 min "write at most" timer? Or should I expect them to both buffer for 10min. unless they reach a model end? And if a model end is reached on one, will that cause the buffers for the other to be written as well? Anyone have any good ideas on how to track over time the number of checkpoint messages, as compared to the number of disk writes? |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
Anyone have any good ideas on how to track over time the number of checkpoint messages, as compared to the number of disk writes? OH! Is that what all these new files are?? chk_S_1LARA_1_00000009_ClassicAbinitio___lc_3.out chk_S_1LARA_1_00000009_ClassicAbinitio___lc_3.rng.state chk_S_1LARA_1_00000010_ClassicAbinitio__stage_1.out chk_S_1LARA_1_00000010_ClassicAbinitio__stage_1.rng.state In fact I see many files now in my slot directory. Seems like blocks of 4 all written with same timestamp. But names vary, probably depending on what each of the model took the checkpoint. Ok, yes my "boinc_checkpoint_count" file says 92, but I've got no where near that many files. And then even less blocks of time where files were written. Ya, I've only got 14 different timestamps. So, that sounds good. I've got another docking task that says it's on model 258 after 22hrs (my preference is 24hrs), and so it's checkpoint count says 257, and I don't have any of the other files. Perhaps it is not taking checkpoints? The default.out file on that one is 43MB so far! Man is *THAT* upload going to take a while! |
Aegis Maelstrom Send message Joined: 19 Jan 09 Posts: 12 Credit: 4,751 RAC: 0 |
Feel1st, others: could you consider running a similar test to the one I have described above? Or maybe you would find some further/better tests? I guess there is no better way to see if the checkpointing is really working. And I am awfully sorry for my subprime English ;) - it's 4:30 a.m. here and I am rushing to my bed. :) |
Paul D. Buck Send message Joined: 14 Jan 09 Posts: 62 Credit: 33,293 RAC: 0 |
Graphics ... Well, for me on the Mac Pro (Intel) trying the graphics got the system start to initialize and then die... never got the graphics window up ... I don't have a Rosetta task in work at the moment, but I am pretty sure that 1.47 the graphics does work on OS-X ... does not for me with 1.52 (maybe I need 1.53? Well, I will see when the other tasks become available and I can run them and try the grpahics ... Could it be becuase I am running GPU Grid on my other computers and the Mac is jealous? {edit add} It looks like the task that failed for graphics did complete... But this task failed for unknown model name ... Not sure why ... and I just caught it ...
|
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
Yep, I was afraid of that... 1/23/2009 11:12:31 PM|ralph@home|Computation for task 1BJ1.mrtmdock.pdb_simpledocking.xml_3_8_6906_1_0 finished 1/23/2009 11:12:31 PM|ralph@home|Output file 1BJ1.mrtmdock.pdb_simpledocking.xml_3_8_6906_1_0_0 for task 1BJ1.mrtmdock.pdb_simpledocking.xml_3_8_6906_1_0 exceeds size limit. 1/23/2009 11:12:31 PM|ralph@home|File size: 30524004.000000 bytes. Limit: 25000000.000000 bytes 1BJ1.mrtmdock.pdb_simpledocking.xml_3_8_6906_1_0 The file was too large, so it calls it a compute error. Looks like it doesn't actually send the file when it grows too large. Too bad. It was 24 seconds short of 24hrs when it "failed", apparently writing the last model to the file. Oh well... THIS is why I chose to run 24hr tasks on Ralph. To reveal these things. On Rosetta I do it because I like a tidy task list, and to minimize hits to the server. |
Paul D. Buck Send message Joined: 14 Jan 09 Posts: 62 Credit: 33,293 RAC: 0 |
I just chedcked a 1.53 task and the graphics window came up and seemed to post updates ... too tired to look too hard at it (sorry), but it does look like what I saw with 1.52 is fixed in 1.53 (or was it the task?) ... |
Paul D. Buck Send message Joined: 14 Jan 09 Posts: 62 Credit: 33,293 RAC: 0 |
Yep, I was afraid of that... Hmmm, shouldn't that be a test? When it writes to the file it should check to see if it is hearing the limit, and if so, stop ... I mean, 24 hours worth of work down the tubes other than we discovered the error ... (thank you) ... |
Ian_D Send message Joined: 16 Feb 06 Posts: 16 Credit: 39,518 RAC: 0 |
version 1.53 https://ralph.bakerlab.org/result.php?resultid=1263355 <core_client_version>6.4.5</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> BOINC:: Initializing ... ok. [2009- 1-24 8:50:37:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. Registering options.. Registered extra options. Initializing core... Initializing options.... ok Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/ralph.bakerlab.org/minirosetta_database_rev26003.zip <unzip> <-oq> <../../projects/ralph.bakerlab.org/minirosetta_database_rev26003.zip> <-d> <./> Firstarg=true; pp=-d first check it's not -dexdir firstarg: <./> End of unzipping. Unpacking WU data ... Unpacking data: ../../projects/ralph.bakerlab.org/testC_cc2_1_8_mammoth_mix_cen_cst.foldcst_chunk_general_mammoth_cst.t290_.mtyka.boinc_files.zip <unzip> <-oq> <../../projects/ralph.bakerlab.org/testC_cc2_1_8_mammoth_mix_cen_cst.foldcst_chunk_general_mammoth_cst.t290_.mtyka.boinc_files.zip> End of unzipping. Setting database description ... Setting up checkpointing ... Setting up folding (abrelax) ... Beginning folding (abrelax) ... BOINC:: Worker startup. Starting watchdog... Watchdog active. Starting work on structure: _1CYNA_10_00001 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x00572136 read attempt to address 0xAB9812D1 Engaging BOINC Windows Runtime Debugger... ******************** BOINC Windows Runtime Debugger Version 6.5.0 Dump Timestamp : 01/24/09 09:25:27 Install Directory : D:Program FilesBOINC Data Directory : D:ProgramDataBOINC Project Symstore : https://boinc.bakerlab.org/rosetta/symstore Loaded Library : D:Program FilesBOINCdbghelp.dll Loaded Library : D:Program FilesBOINCsymsrv.dll Loaded Library : D:Program FilesBOINCsrcsrv.dll LoadLibraryA( D:Program FilesBOINCversion.dll ): GetLastError = 126 Loaded Library : version.dll Debugger Engine : 4.0.5.0 Symbol Search Path: D:ProgramDataBOINCslots1;D:ProgramDataBOINCprojectsralph.bakerlab.org;srv*D:ProgramDataBOINCprojectsralph.bakerlab.orgsymbols*http://msdl.microsoft.com/download/symbols;srv*D:ProgramDataBOINCprojectsralph.bakerlab.orgsymbols*https://boinc.bakerlab.org/rosetta/symstore;srv*D:ProgramDataBOINCprojectsralph.bakerlab.orgsymbols*http://boinc.berkeley.edu/symstore |
Aegis Maelstrom Send message Joined: 19 Jan 09 Posts: 12 Credit: 4,751 RAC: 0 |
...and now something completely different. Mike, do you need more information about actual behaviour of your predicting models? Just to fine tune the energy function, folding procedure etc.? One thing we all now is that some models of particular WU take much more time than the others. It's clearly seen when we have "long running WUs", however it is quite often as well in WUs which have very short models. In these cases you need to actually watch it or have good logs to see, that, i.e. most of models run in 5 minutes, and one takes 15-20. I've been having such a task right now: testC_cc_1_8_nocst4_hb_t288__IGNORE_THE_REST_2FNEA_2_6976_1_0. For many minutes the accepted screen was showing a very simple chain split on two, longer and shorter, and the shorter was standing still, showing some quite folded structure. It took over 400,000 steps to get something semifolded and more complicated on a "accepted" screen and go further. This model (12th) took much more time and as the forecast runtime of a model increased, it was the last decoy allowed by the scheduler. All of these looked like the procedure needed much more time to hit something acceptable to start with. I'm sure you know about this issue but I have seen such a behaviour a couple times before and just wanted to let you know it is quite common. P.S. Besided that everything - except checkpointing, see my posts above - is working like a charm. P.S.S. Great work Feel1st! |
Path7 Send message Joined: 11 Feb 08 Posts: 56 Credit: 4,974 RAC: 0 |
Hello mtyka, - Do the graphics behave properly again ? After having problems with the graphics running minrosetta 1.52, the combination of minrosetta 1.53 & minrosetta graphics 1.53 seems to work smoothly again. Thanks and a good weekend, Path7. |
ramostol Send message Joined: 29 Mar 07 Posts: 24 Credit: 31,121 RAC: 0 |
After Rosetta 1.47 totally crashing on my MacBook on 27 Dec 2008 (after one week of faultless computing, within hours after connecting to the Rosetta server; still haven't recovered) I have waited for an opportunity to test what is happening on Ralph. At last I received two 1.53-tasks - however, they show exactly the same symptoms as observed using 1.47 on Rosetta: abinitio_norelax_homfrag_natfrag_129_B_1pxuA_SAVE_ALL_OUT_7037_1_0 <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> This one crashed after 30 sec. ---- abinitio_norelax_homfrag_natfrag_129_B_1ctf__SAVE_ALL_OUT_7037_1_0 <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> This one is reported using Boinc 6.6.2 but computed with 6.6.1. It started out relatively normally. But after about one hour’s computing time, steadily increasing "To completion" time, progress stuck at 0.240%, I had to conclude that this wu would turn into an everlasting task. "Show graphics" did not respond, even "Quit Boinc" was greyed out. After rebooting to install Boinc 6.6.2, and logging on as administrative and non-administrative user the wu quit after restarting using 32 sec. I should add that I mostly use this computer as a non-administrator, which may have some bearing (on my PPC it influences my graphics display at least). However, I do not receive more units to check, so I can not test the behaviour when changing configurations. ---- PS For the curious: This is the report from the first wu crashing on Dec 27 2008 on Rosetta (I haven't too much time analyzing such data....). The error message is a bit more extensive than the messages for the rest of the bunch: https://boinc.bakerlab.org/rosetta/result.php?resultid=216101560 (obsolete link) Task ID 216101560 Name 1r9pA_BOINC_MPZN_vanilla_abrelax_5901_6648_0 Workunit 196949149 Created 21 Dec 2008 12:53:24 UTC Sent 21 Dec 2008 13:16:26 UTC Received 27 Dec 2008 14:15:44 UTC Server state Over Outcome Client error Client state Compute error Exit status 193 (0xc1) Computer ID 936507 Report deadline 31 Dec 2008 13:16:26 UTC CPU time 20.36099 stderr out <core_client_version>6.5.0</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> minirosetta_1.47_i686-apple-darwin(69428,0xa031b720) malloc: *** error for object 0x1748220: Non-aligned pointer being freed (2) *** set a breakpoint in malloc_error_break to debug minirosetta_1.47_i686-apple-darwin(69428,0xa031b720) malloc: *** error for object 0x1748220: incorrect checksum for freed object - object was probably modified after being freed. *** set a breakpoint in malloc_error_break to debug SIGBUS: bus error Crashed executable name: minirosetta_1.47_i686-apple-darwin built using BOINC library version 6.5.0 Machine type Intel 80486 (32-bit executable) System version: Macintosh OS 10.5.6 build 9G55 Sat Dec 27 14:45:29 2008 |
Message boards :
RALPH@home bug list :
minirosetta v1.48-1.51 bug thread
©2024 University of Washington
http://www.bakerlab.org