minirosetta v1.48-1.51 bug thread

Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
Evan

Send message
Joined: 23 Dec 07
Posts: 75
Credit: 69,584
RAC: 0
Message 4509 - Posted: 24 Jan 2009, 0:17:01 UTC
Last modified: 24 Jan 2009, 0:38:32 UTC

I'm showing the same or very similar behavior on the same type of workunit:lr6_D_score12_rlbn_1gu3_IGNORE_THE_REST_NATIVE_6909


I'm also getting problems on several work units starting: test_B_cc ....
I get either a hung empty screen or a runtime error.

edit The graphics problem doesn't affect progress of the work unit.
ID: 4509 · Report as offensive    Reply Quote
Aegis Maelstrom

Send message
Joined: 19 Jan 09
Posts: 12
Credit: 4,751
RAC: 0
Message 4510 - Posted: 24 Jan 2009, 0:17:28 UTC

Hi Mike, Hi Fellows!

I'm just running Mini 1.52, this task.

Actually, I've found that RALPH WU has started only because of a prompt alert: BOINC wanted to turn the screensaver on and I got the Windows massage about a runtime terror and sudden termination of 1.40 graphics viewer. The process has been killed by the system.

So far, the WU seems to be continuing, I'm on 3,84% (0:11:35) and running.
ID: 4510 · Report as offensive    Reply Quote
Path7

Send message
Joined: 11 Feb 08
Posts: 56
Credit: 4,974
RAC: 0
Message 4511 - Posted: 24 Jan 2009, 0:25:28 UTC
Last modified: 24 Jan 2009, 0:32:34 UTC

Hello all,

While crunching:
lr6_D_score12_rlbn_1ynv_IGNORE_THE_REST_NATIVE_NOCON_6909_1_0 (minirosetta 1.52)
I hit the “Show graphic” button, Windows XP replied with an error message: Runtime error.
Short message rapport:
Hung application: minirosetta_graphics_1.40_windows_intelx86.exe, version: 0.0.0.0, hung module: hungapp, version: 0.0.0.0, hung at: 0x00000000.

Retrying “Show graphics”opened the graphics window, empty.
At 16 minutes processor time the processor time stood still, CPU usage = 0 %.
Stopped and started BOINC, the WU proceeded where it has stopped.

Again tried to hit the “Show graphics button”, and the above repeated.
Now the WU has started for the third time, and keeps on running. I don't dare to touch the “Show graphics” button again!

Windows XP SP3, single core AMD Sempron 3000+ 1.8 GHz, 1.5 Gb RAM, BOINC 5.10.45.

Path7.
ID: 4511 · Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 07
Posts: 75
Credit: 69,584
RAC: 0
Message 4512 - Posted: 24 Jan 2009, 0:27:20 UTC

This one failed (1st time)
1259966

Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0044EFF4 read attempt to address 0x01062000

Engaging BOINC Windows Runtime Debugger...
ID: 4512 · Report as offensive    Reply Quote
mtyka
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 19 Mar 08
Posts: 79
Credit: 0
RAC: 0
Message 4513 - Posted: 24 Jan 2009, 0:43:10 UTC - in response to Message 4512.  
Last modified: 24 Jan 2009, 0:44:09 UTC

This one failed (1st time)
1259966

Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0044EFF4 read attempt to address 0x01062000

Engaging BOINC Windows Runtime Debugger...


Yeah guess what. I just found a bug in the BOINC API! . Holy crap. Basically as far as i can see there's a memory leak when it's trying to unzip files. Mostly all you see is the application dying kicking and screaming just after initialization. One RALPH user though produced a suspicious trace (thank you philip in hongkong!). I have to stress that this is the only job out of hundreds such failures that has returned with a trace.


LENOVO-A05B19D0 (10.9.3.121) [16000]
User ID philip-in-hongkong [2191]
CPU time 0
XML doc in

6.2.18

- exit code -1073741819 (0xc0000005)


BOINC:: Initializing ... ok.
[2009- 1-22 17:43:47:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Initializing options.... ok
Initializing random generators... ok
Initialization complete.


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0044FA48 read attempt to
- Callstack -
ChildEBP RetAddr Args to Child
0012ecbc 0044fbc9 0114f008 00000001 010f1ffc 00000004 minirosetta_1.51_windows_intelx!unzip+0x0 (d:boinc_buildminirosetta_windowsminiexternalboinczipunzipunzip.c:943)
0012ecd4 0044e8f0 00000004 010f1ff0 14f97ded 0000000f minirosetta_1.51_windows_intelx!unzip_main+0x0 (d:boinc_buildminirosetta_windowsminiexternalboinczipunzipunzip.c:629)
0012ed10 0044ea7d 00000000 00000000 0108df08 0000000f minirosetta_1.51_windows_intelx!boinc_zip+0x7 (d:boinc_buildminirosetta_windowsminiexternalboinczipboinc_zip.cpp:151)
0012ed68 0041981a 00000000 0000000f 0108bd58 0012ffc0 minirosetta_1.51_windows_intelx!boinc_zip+0x32 (d:boinc_buildminirosetta_windowsminiexternalboinczipboinc_zip.cpp:73)
0012ef14 0041a015 0000001f 0012ef2c 00152310 0012ef2c minirosetta_1.51_windows_intelx!main+0x55 (d:boinc_buildminirosetta_windowsminisrcappspublicboincminirosetta.cc:85)
0012ff28 0042cdf7 00400000 00000000 00152352 0000000a minirosetta_1.51_windows_intelx!WinMain+0x0 (d:boinc_buildminirosetta_windowsminisrcappspublicboincminirosetta.cc:159)
0012ffc0 7c816fd7 00000000 00000000 7ffd3000 c0000005 minirosetta_1.51_windows_intelx!__tmainCRTStartup+0x1c (f:spvctoolscrt_bldself_x86crtsrccrt0.c:324)
0012fff0 00000000 0042ce60 00000000 78746341 00000020 kernel32!_BaseProcessStart@4+0x0 (f:spvctoolscrt_bldself_x86crtsrccrt0.c:324)

*** Dump of thread ID 4252 (state: Waiting): ***



It fails in the unzip code ! OMG.
Been tinkering with the code, the bug probably stems froma single byte not being set to 0. This explains the sporadic nature - if the relevant byte happens to already be 0 then all is fine. I'll push out a version soon to see if the fix works. Its all stipulation at this point.

ALso soory bout the graphics error, i increased the buffersizes and (duh!) forgot to also update the graphics app .. i'll do that ogether with 1.53.

Mike
ID: 4513 · Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 14 Jan 09
Posts: 62
Credit: 33,293
RAC: 0
Message 4514 - Posted: 24 Jan 2009, 0:49:24 UTC - in response to Message 4502.  

I have noticed though that some of the problems seem to be highly machine specific. Like one user will always produce this one kind of an error and nithing else. weired. something strange about their setup ? No idea.


Well, then it is time to start asking that user ... :)

I know that some will OC their machine to the edge of instability and fail to recognize the implications of that ...

Other cases can be because they are running other applications / projects and that might be the cause ... like running GPU Grid may raise the temperature enough at some times to cause the instability, but not other times ... So, it may not be the problem but the specific hardware ...

I remember once where I reported a problem where the device we were testing would cause a tape read error ... impossible the engineers said ... I showed them ... still impossible ... not really ... the compiler was putting bad instructions on the tape and under an error condition the tape could not be read ...

Anyway ...

Just trying to stimulate the brain cells with some brain storming ... I do appreciate the feedback though ... and I got another task and it should be back in a little bit ... and yes, fascinated in what we as a collective are trying to do ...

Oh, and the same RND seed, unless the machines are of the same specific type with identical CPU and OS the RND generator may or may not issue the same sequence of random numbers. Especially if the core of the generator relies on the noise at the end of floating point numbers ... Virtual Prairie is having this exact problem with their work ...
ID: 4514 · Report as offensive    Reply Quote
Aegis Maelstrom

Send message
Joined: 19 Jan 09
Posts: 12
Credit: 4,751
RAC: 0
Message 4515 - Posted: 24 Jan 2009, 2:13:24 UTC
Last modified: 24 Jan 2009, 2:25:53 UTC

O.K., I got the new 1.53 version - when only 1.52 finishes we will see how it behaves. :)

EDIT: So far O.K., and the graphics is working...
ID: 4515 · Report as offensive    Reply Quote
mtyka
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 19 Mar 08
Posts: 79
Credit: 0
RAC: 0
Message 4516 - Posted: 24 Jan 2009, 2:24:49 UTC

1.53 is out. This includes a fix in API causing crashes with unzipping .zip files. I hope. Fingers crossed ;)

Also the graphics are updated and should not freeze.

What I'd like to know from you:

- Do you ever get any long running tasks ? (longer then PrefRuntime + 4 hrs)
- Do the checkpoints work and honor the user's setting ? (Feet1st ? )
- Do you get any jobs that are stuck ?
- Do the graphics behave properly again ?

Thanks ya all.

Enjoy the weekend ;)

Mike
ID: 4516 · Report as offensive    Reply Quote
Aegis Maelstrom

Send message
Joined: 19 Jan 09
Posts: 12
Credit: 4,751
RAC: 0
Message 4517 - Posted: 24 Jan 2009, 2:43:05 UTC - in response to Message 4516.  
Last modified: 24 Jan 2009, 3:24:16 UTC

Hi,

currently I'm running sr213_t077_1_NMR_NESG_SAVE_ALL_OUT_6972_24_0.

- Do the graphics behave properly again ?

Yes, it is working - however there was a funny thing.
This WU has a native protein and RMSD - but once the RMSD part of the graphics (showing left/right how good the RMSD is) was not working. It was just blank.
The low energy part was O.K.

After turning off and turning on one-two minutes after, everything was just fine.

Regarding checkpointing - according to boinccmd.exe it does make checkpoints every 100+ - 200+ seconds. It is quite close to my preferences (120 sec).

EDIT: I have made an experiment to check the checkpointing and I have turned off the client and afterwards turned it back on.

The task resumed from over 26 minutes (and according to the boinccmd the checkpoint was 1591 seconds, so it would fit). However, the WU started from Model 1, from the step 0 and was moving forward.
On the other hand, the WU snaply went into SmoothFragmentMover_GunnCost and it didn't look that bad, so I am puzzled: I am not sure if it really started from the beginning, or if the checkpointing worked but the step information is misguiding.
I think it must be checked on some further step when the protein is really pretty and the difference is easy to tell - but now I must go sleep. :D


EDIT 2: I have repeated my experiment and I am pretty sure the checkpointing is not working correctly.

I've stopped the WU after 46 minutes (2524 seconds) when it was quite beautiful (minus three hundred something energy etc.) and restarted.

Yet again it started from step 0 and then it started to behave in a strange manner. There was no really folded protein, only one straight chain - but the program was clearly trying to use methods proper in the last stage of the process (namely MoverBase+Minimization). In a minute or so the chain was bent in something like two points and it got "fractalized" by the last step of the procedure (when you have all these small strings moving to get the lowest possible energy).

If you want, I can provide you with a snapshot (printscreen of the graphics).

Everything ended after circa 200 seconds (for a couple of seconds there was "stage: unknown") and got reported.

Have a nice lecture.

Interesting thing: it reduced the granted credit; claimed was 7.06, granted: 4.68.

Well, one must suffer for the science. =) And now let me get some sleep...


P.S. The only thing that collided with the experiment was that in a second time I had to halt a regular Rosetta task (I think I've halted the Rosetta but this time after restarting BOINC the manager didn't remember it, huh). I don't know if it could interfere - if it did it would mean there is some lack of reseting used variables.

If it were not for the experiment, I am pretty sure the task would run alright.
ID: 4517 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4518 - Posted: 24 Jan 2009, 2:53:05 UTC

The checkpoints seem more inline with my settings as of 1.52.

I'm a bit unclear as to what exactly to expect now. Sounds like if I set my preference to "write at most..." 600 seconds (10min) that I should expect the following...

I'll see checkpoint debug messages in my messages tab, but it may be that no data was written to disk?

And I'll see for a given task, within a given model, that it won't write more then every 10min?

...except if it reaches the end of a model, at which point it will write, regardless of how recently the last checkpoint was??

So, hypothetical example:

Time Event
mm:ss
00:00 Model start
01:15 checkpoint (but not written)
02:45 checkpoint (but not written)
05:15 checkpoint (but not written)
09:15 checkpoint (but not written)
11:30 checkpoint, all of the above written to disk
12:10 checkpoint (but not written)
13:40 checkpoint (but not written)
14:25 model completed, write to disk, even though only 3min since last write.

I think the above is what I should expect. Now... I've got a P4 running with HT, so 2 virtual cores... should I expect the tasks to behave independantly? i.e. each on their own 10 min "write at most" timer? Or should I expect them to both buffer for 10min. unless they reach a model end? And if a model end is reached on one, will that cause the buffers for the other to be written as well?

Anyone have any good ideas on how to track over time the number of checkpoint messages, as compared to the number of disk writes?
ID: 4518 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4519 - Posted: 24 Jan 2009, 3:03:50 UTC
Last modified: 24 Jan 2009, 3:09:27 UTC

Anyone have any good ideas on how to track over time the number of checkpoint messages, as compared to the number of disk writes?


OH! Is that what all these new files are??
chk_S_1LARA_1_00000009_ClassicAbinitio___lc_3.out
chk_S_1LARA_1_00000009_ClassicAbinitio___lc_3.rng.state
chk_S_1LARA_1_00000010_ClassicAbinitio__stage_1.out
chk_S_1LARA_1_00000010_ClassicAbinitio__stage_1.rng.state

In fact I see many files now in my slot directory. Seems like blocks of 4 all written with same timestamp. But names vary, probably depending on what each of the model took the checkpoint.

Ok, yes my "boinc_checkpoint_count" file says 92, but I've got no where near that many files. And then even less blocks of time where files were written. Ya, I've only got 14 different timestamps. So, that sounds good.

I've got another docking task that says it's on model 258 after 22hrs (my preference is 24hrs), and so it's checkpoint count says 257, and I don't have any of the other files. Perhaps it is not taking checkpoints? The default.out file on that one is 43MB so far! Man is *THAT* upload going to take a while!
ID: 4519 · Report as offensive    Reply Quote
Aegis Maelstrom

Send message
Joined: 19 Jan 09
Posts: 12
Credit: 4,751
RAC: 0
Message 4520 - Posted: 24 Jan 2009, 3:31:43 UTC

Feel1st, others: could you consider running a similar test to the one I have described above?

Or maybe you would find some further/better tests? I guess there is no better way to see if the checkpointing is really working.

And I am awfully sorry for my subprime English ;) - it's 4:30 a.m. here and I am rushing to my bed. :)
ID: 4520 · Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 14 Jan 09
Posts: 62
Credit: 33,293
RAC: 0
Message 4521 - Posted: 24 Jan 2009, 4:21:08 UTC
Last modified: 24 Jan 2009, 5:14:33 UTC

Graphics ...

Well, for me on the Mac Pro (Intel) trying the graphics got the system start to initialize and then die... never got the graphics window up ... I don't have a Rosetta task in work at the moment, but I am pretty sure that 1.47 the graphics does work on OS-X ... does not for me with 1.52 (maybe I need 1.53? Well, I will see when the other tasks become available and I can run them and try the grpahics ...

Could it be becuase I am running GPU Grid on my other computers and the Mac is jealous?

{edit add}
It looks like the task that failed for graphics did complete...

But this task failed for unknown model name ...

Not sure why ... and I just caught it ...


Initialization complete.
Watchdog active.

ERROR: unknown model name: 1DK8A_1
ERROR:: Exit from: src/protocols/abinitio/PairingStatistics.hh line: 170
called boinc_finish


ID: 4521 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4522 - Posted: 24 Jan 2009, 5:54:35 UTC
Last modified: 24 Jan 2009, 5:56:02 UTC

Yep, I was afraid of that...

1/23/2009 11:12:31 PM|ralph@home|Computation for task 1BJ1.mrtmdock.pdb_simpledocking.xml_3_8_6906_1_0 finished
1/23/2009 11:12:31 PM|ralph@home|Output file 1BJ1.mrtmdock.pdb_simpledocking.xml_3_8_6906_1_0_0 for task 1BJ1.mrtmdock.pdb_simpledocking.xml_3_8_6906_1_0 exceeds size limit.
1/23/2009 11:12:31 PM|ralph@home|File size: 30524004.000000 bytes. Limit: 25000000.000000 bytes


1BJ1.mrtmdock.pdb_simpledocking.xml_3_8_6906_1_0

The file was too large, so it calls it a compute error. Looks like it doesn't actually send the file when it grows too large. Too bad. It was 24 seconds short of 24hrs when it "failed", apparently writing the last model to the file.

Oh well... THIS is why I chose to run 24hr tasks on Ralph. To reveal these things. On Rosetta I do it because I like a tidy task list, and to minimize hits to the server.
ID: 4522 · Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 14 Jan 09
Posts: 62
Credit: 33,293
RAC: 0
Message 4523 - Posted: 24 Jan 2009, 6:13:44 UTC

I just chedcked a 1.53 task and the graphics window came up and seemed to post updates ... too tired to look too hard at it (sorry), but it does look like what I saw with 1.52 is fixed in 1.53 (or was it the task?) ...
ID: 4523 · Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 14 Jan 09
Posts: 62
Credit: 33,293
RAC: 0
Message 4524 - Posted: 24 Jan 2009, 6:16:04 UTC - in response to Message 4522.  

Yep, I was afraid of that...

1/23/2009 11:12:31 PM|ralph@home|Computation for task 1BJ1.mrtmdock.pdb_simpledocking.xml_3_8_6906_1_0 finished
1/23/2009 11:12:31 PM|ralph@home|Output file 1BJ1.mrtmdock.pdb_simpledocking.xml_3_8_6906_1_0_0 for task 1BJ1.mrtmdock.pdb_simpledocking.xml_3_8_6906_1_0 exceeds size limit.
1/23/2009 11:12:31 PM|ralph@home|File size: 30524004.000000 bytes. Limit: 25000000.000000 bytes


1BJ1.mrtmdock.pdb_simpledocking.xml_3_8_6906_1_0

The file was too large, so it calls it a compute error. Looks like it doesn't actually send the file when it grows too large. Too bad. It was 24 seconds short of 24hrs when it "failed", apparently writing the last model to the file.

Oh well... THIS is why I chose to run 24hr tasks on Ralph. To reveal these things. On Rosetta I do it because I like a tidy task list, and to minimize hits to the server.


Hmmm, shouldn't that be a test? When it writes to the file it should check to see if it is hearing the limit, and if so, stop ... I mean, 24 hours worth of work down the tubes other than we discovered the error ... (thank you) ...



ID: 4524 · Report as offensive    Reply Quote
Profile Ian_D

Send message
Joined: 16 Feb 06
Posts: 16
Credit: 39,518
RAC: 0
Message 4525 - Posted: 24 Jan 2009, 9:40:03 UTC

version 1.53

https://ralph.bakerlab.org/result.php?resultid=1263355

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
BOINC:: Initializing ... ok.
[2009- 1-24 8:50:37:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing core...
Initializing options.... ok
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/ralph.bakerlab.org/minirosetta_database_rev26003.zip
<unzip> <-oq> <../../projects/ralph.bakerlab.org/minirosetta_database_rev26003.zip> <-d> <./>
Firstarg=true; pp=-d
first check it's not -dexdir
firstarg: <./>
End of unzipping.
Unpacking WU data ...
Unpacking data: ../../projects/ralph.bakerlab.org/testC_cc2_1_8_mammoth_mix_cen_cst.foldcst_chunk_general_mammoth_cst.t290_.mtyka.boinc_files.zip
<unzip> <-oq> <../../projects/ralph.bakerlab.org/testC_cc2_1_8_mammoth_mix_cen_cst.foldcst_chunk_general_mammoth_cst.t290_.mtyka.boinc_files.zip>
End of unzipping.
Setting database description ...
Setting up checkpointing ...
Setting up folding (abrelax) ...
Beginning folding (abrelax) ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
Starting work on structure: _1CYNA_10_00001


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00572136 read attempt to address 0xAB9812D1

Engaging BOINC Windows Runtime Debugger...



********************


BOINC Windows Runtime Debugger Version 6.5.0


Dump Timestamp : 01/24/09 09:25:27
Install Directory : D:Program FilesBOINC
Data Directory : D:ProgramDataBOINC
Project Symstore : https://boinc.bakerlab.org/rosetta/symstore
Loaded Library : D:Program FilesBOINCdbghelp.dll
Loaded Library : D:Program FilesBOINCsymsrv.dll
Loaded Library : D:Program FilesBOINCsrcsrv.dll
LoadLibraryA( D:Program FilesBOINCversion.dll ): GetLastError = 126
Loaded Library : version.dll
Debugger Engine : 4.0.5.0
Symbol Search Path: D:ProgramDataBOINCslots1;D:ProgramDataBOINCprojectsralph.bakerlab.org;srv*D:ProgramDataBOINCprojectsralph.bakerlab.orgsymbols*http://msdl.microsoft.com/download/symbols;srv*D:ProgramDataBOINCprojectsralph.bakerlab.orgsymbols*https://boinc.bakerlab.org/rosetta/symstore;srv*D:ProgramDataBOINCprojectsralph.bakerlab.orgsymbols*http://boinc.berkeley.edu/symstore



ID: 4525 · Report as offensive    Reply Quote
Aegis Maelstrom

Send message
Joined: 19 Jan 09
Posts: 12
Credit: 4,751
RAC: 0
Message 4526 - Posted: 24 Jan 2009, 10:17:07 UTC

...and now something completely different.

Mike, do you need more information about actual behaviour of your predicting models? Just to fine tune the energy function, folding procedure etc.?

One thing we all now is that some models of particular WU take much more time than the others. It's clearly seen when we have "long running WUs", however it is quite often as well in WUs which have very short models. In these cases you need to actually watch it or have good logs to see, that, i.e. most of models run in 5 minutes, and one takes 15-20.

I've been having such a task right now: testC_cc_1_8_nocst4_hb_t288__IGNORE_THE_REST_2FNEA_2_6976_1_0.
For many minutes the accepted screen was showing a very simple chain split on two, longer and shorter, and the shorter was standing still, showing some quite folded structure. It took over 400,000 steps to get something semifolded and more complicated on a "accepted" screen and go further.
This model (12th) took much more time and as the forecast runtime of a model increased, it was the last decoy allowed by the scheduler.

All of these looked like the procedure needed much more time to hit something acceptable to start with.

I'm sure you know about this issue but I have seen such a behaviour a couple times before and just wanted to let you know it is quite common.

P.S. Besided that everything - except checkpointing, see my posts above - is working like a charm.

P.S.S. Great work Feel1st!
ID: 4526 · Report as offensive    Reply Quote
Path7

Send message
Joined: 11 Feb 08
Posts: 56
Credit: 4,974
RAC: 0
Message 4528 - Posted: 24 Jan 2009, 12:38:47 UTC - in response to Message 4516.  

Hello mtyka,

- Do the graphics behave properly again ?

After having problems with the graphics running minrosetta 1.52,
the combination of minrosetta 1.53 & minrosetta graphics 1.53 seems to work smoothly again.
Thanks and a good weekend,
Path7.
ID: 4528 · Report as offensive    Reply Quote
ramostol

Send message
Joined: 29 Mar 07
Posts: 24
Credit: 31,121
RAC: 0
Message 4529 - Posted: 24 Jan 2009, 12:55:36 UTC

After Rosetta 1.47 totally crashing on my MacBook on 27 Dec 2008 (after one week of faultless computing, within hours after connecting to the Rosetta server; still haven't recovered) I have waited for an opportunity to test what is happening on Ralph. At last I received two 1.53-tasks - however, they show exactly the same symptoms as observed using 1.47 on Rosetta:

abinitio_norelax_homfrag_natfrag_129_B_1pxuA_SAVE_ALL_OUT_7037_1_0

<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>

This one crashed after 30 sec.

----

abinitio_norelax_homfrag_natfrag_129_B_1ctf__SAVE_ALL_OUT_7037_1_0

<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>

This one is reported using Boinc 6.6.2 but computed with 6.6.1.

It started out relatively normally. But after about one hour’s computing time, steadily increasing "To completion" time, progress stuck at 0.240%, I had to conclude that this wu would turn into an everlasting task. "Show graphics" did not respond, even "Quit Boinc" was greyed out. After rebooting to install Boinc 6.6.2, and logging on as administrative and non-administrative user the wu quit after restarting using 32 sec.


I should add that I mostly use this computer as a non-administrator, which may have some bearing (on my PPC it influences my graphics display at least). However, I do not receive more units to check, so I can not test the behaviour when changing configurations.

----

PS
For the curious:
This is the report from the first wu crashing on Dec 27 2008 on Rosetta (I haven't too much time analyzing such data....). The error message is a bit more extensive than the messages for the rest of the bunch:

https://boinc.bakerlab.org/rosetta/result.php?resultid=216101560 (obsolete link)

Task ID
216101560
Name
1r9pA_BOINC_MPZN_vanilla_abrelax_5901_6648_0
Workunit
196949149
Created
21 Dec 2008 12:53:24 UTC
Sent
21 Dec 2008 13:16:26 UTC
Received
27 Dec 2008 14:15:44 UTC
Server state
Over
Outcome
Client error
Client state
Compute error
Exit status
193 (0xc1)
Computer ID
936507
Report deadline
31 Dec 2008 13:16:26 UTC
CPU time
20.36099
stderr out
<core_client_version>6.5.0</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
minirosetta_1.47_i686-apple-darwin(69428,0xa031b720) malloc: *** error for object 0x1748220: Non-aligned pointer being freed (2)
*** set a breakpoint in malloc_error_break to debug
minirosetta_1.47_i686-apple-darwin(69428,0xa031b720) malloc: *** error for object 0x1748220: incorrect checksum for freed object - object was probably modified after being freed.

*** set a breakpoint in malloc_error_break to debug
SIGBUS: bus error

Crashed executable name: minirosetta_1.47_i686-apple-darwin
built using BOINC library version 6.5.0
Machine type Intel 80486 (32-bit executable)
System version: Macintosh OS 10.5.6 build 9G55
Sat Dec 27 14:45:29 2008

ID: 4529 · Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread



©2024 University of Washington
http://www.bakerlab.org