minirosetta v1.48-1.51 bug thread

Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Snagletooth

Send message
Joined: 4 May 07
Posts: 67
Credit: 134,427
RAC: 0
Message 4504 - Posted: 23 Jan 2009, 20:46:27 UTC
Last modified: 23 Jan 2009, 20:47:39 UTC

Hey, Mike, thanks for all the info you've been giving us lately. It's interesting and certainly provides a bit of extra motivation to keep crunching and reporting what we see at our end.

Snags
ID: 4504 · Report as offensive    Reply Quote
franfranlatulipe94@hotmail.com

Send message
Joined: 12 Dec 08
Posts: 1
Credit: 447
RAC: 0
Message 4505 - Posted: 23 Jan 2009, 21:22:42 UTC

Hi,
I'm running RALPH with Ubuntu Jaunty up to date.
My BOINC version is 6.2.18.
When I click to "show graphic" (or suchs, I'm French), I see a small window, like the screenshot in XP in this Post, but all my memory (1.5 Gio) and all my swap (2.0 Gio) are full…
If I do not it (click on "sow graphics"), there isn't any visible problem.

Rosetta mini 1.52
http://francois.linkmauve.fr/boinc.png
The screenshot show when the window has been killed.
If there are questions, ask me…
ID: 4505 · Report as offensive    Reply Quote
Snagletooth

Send message
Joined: 4 May 07
Posts: 67
Credit: 134,427
RAC: 0
Message 4506 - Posted: 23 Jan 2009, 22:12:04 UTC - in response to Message 4505.  

Hi,
I'm running RALPH with Ubuntu Jaunty up to date.
My BOINC version is 6.2.18.
When I click to "show graphic" (or suchs, I'm French), I see a small window, like the screenshot in XP in this Post, but all my memory (1.5 Gio) and all my swap (2.0 Gio) are full…
If I do not it (click on "sow graphics"), there isn't any visible problem.

Rosetta mini 1.52
http://francois.linkmauve.fr/boinc.png
The screenshot show when the window has been killed.
If there are questions, ask me…


I'm showing the same or very similar behavior on the same type of workunit:lr6_D_score12_rlbn_1gu3_IGNORE_THE_REST_NATIVE_6909_1 running Leopard on an Intel Mac. No cpu use but tons of memory and I don't even get the box so nothing to show in a screen capture. Activity Manager says the process has hung and I killed it without trouble from there.

Snags
ID: 4506 · Report as offensive    Reply Quote
tralala

Send message
Joined: 12 Apr 06
Posts: 52
Credit: 15,257
RAC: 0
Message 4507 - Posted: 23 Jan 2009, 22:12:53 UTC

Mtyka,

can you disable graphics for some WUs and see if the error rate goes down? Might help turning c)'s into b)'s.
ID: 4507 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4508 - Posted: 23 Jan 2009, 23:45:17 UTC

FYI, in French "Gio" is "GB" (gigabytes)
ID: 4508 · Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 07
Posts: 75
Credit: 69,584
RAC: 0
Message 4509 - Posted: 24 Jan 2009, 0:17:01 UTC
Last modified: 24 Jan 2009, 0:38:32 UTC

I'm showing the same or very similar behavior on the same type of workunit:lr6_D_score12_rlbn_1gu3_IGNORE_THE_REST_NATIVE_6909


I'm also getting problems on several work units starting: test_B_cc ....
I get either a hung empty screen or a runtime error.

edit The graphics problem doesn't affect progress of the work unit.
ID: 4509 · Report as offensive    Reply Quote
Aegis Maelstrom

Send message
Joined: 19 Jan 09
Posts: 12
Credit: 4,751
RAC: 0
Message 4510 - Posted: 24 Jan 2009, 0:17:28 UTC

Hi Mike, Hi Fellows!

I'm just running Mini 1.52, this task.

Actually, I've found that RALPH WU has started only because of a prompt alert: BOINC wanted to turn the screensaver on and I got the Windows massage about a runtime terror and sudden termination of 1.40 graphics viewer. The process has been killed by the system.

So far, the WU seems to be continuing, I'm on 3,84% (0:11:35) and running.
ID: 4510 · Report as offensive    Reply Quote
Path7

Send message
Joined: 11 Feb 08
Posts: 56
Credit: 4,974
RAC: 0
Message 4511 - Posted: 24 Jan 2009, 0:25:28 UTC
Last modified: 24 Jan 2009, 0:32:34 UTC

Hello all,

While crunching:
lr6_D_score12_rlbn_1ynv_IGNORE_THE_REST_NATIVE_NOCON_6909_1_0 (minirosetta 1.52)
I hit the “Show graphic” button, Windows XP replied with an error message: Runtime error.
Short message rapport:
Hung application: minirosetta_graphics_1.40_windows_intelx86.exe, version: 0.0.0.0, hung module: hungapp, version: 0.0.0.0, hung at: 0x00000000.

Retrying “Show graphics”opened the graphics window, empty.
At 16 minutes processor time the processor time stood still, CPU usage = 0 %.
Stopped and started BOINC, the WU proceeded where it has stopped.

Again tried to hit the “Show graphics button”, and the above repeated.
Now the WU has started for the third time, and keeps on running. I don't dare to touch the “Show graphics” button again!

Windows XP SP3, single core AMD Sempron 3000+ 1.8 GHz, 1.5 Gb RAM, BOINC 5.10.45.

Path7.
ID: 4511 · Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 07
Posts: 75
Credit: 69,584
RAC: 0
Message 4512 - Posted: 24 Jan 2009, 0:27:20 UTC

This one failed (1st time)
1259966

Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0044EFF4 read attempt to address 0x01062000

Engaging BOINC Windows Runtime Debugger...
ID: 4512 · Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 14 Jan 09
Posts: 62
Credit: 33,293
RAC: 0
Message 4514 - Posted: 24 Jan 2009, 0:49:24 UTC - in response to Message 4502.  

I have noticed though that some of the problems seem to be highly machine specific. Like one user will always produce this one kind of an error and nithing else. weired. something strange about their setup ? No idea.


Well, then it is time to start asking that user ... :)

I know that some will OC their machine to the edge of instability and fail to recognize the implications of that ...

Other cases can be because they are running other applications / projects and that might be the cause ... like running GPU Grid may raise the temperature enough at some times to cause the instability, but not other times ... So, it may not be the problem but the specific hardware ...

I remember once where I reported a problem where the device we were testing would cause a tape read error ... impossible the engineers said ... I showed them ... still impossible ... not really ... the compiler was putting bad instructions on the tape and under an error condition the tape could not be read ...

Anyway ...

Just trying to stimulate the brain cells with some brain storming ... I do appreciate the feedback though ... and I got another task and it should be back in a little bit ... and yes, fascinated in what we as a collective are trying to do ...

Oh, and the same RND seed, unless the machines are of the same specific type with identical CPU and OS the RND generator may or may not issue the same sequence of random numbers. Especially if the core of the generator relies on the noise at the end of floating point numbers ... Virtual Prairie is having this exact problem with their work ...
ID: 4514 · Report as offensive    Reply Quote
Aegis Maelstrom

Send message
Joined: 19 Jan 09
Posts: 12
Credit: 4,751
RAC: 0
Message 4515 - Posted: 24 Jan 2009, 2:13:24 UTC
Last modified: 24 Jan 2009, 2:25:53 UTC

O.K., I got the new 1.53 version - when only 1.52 finishes we will see how it behaves. :)

EDIT: So far O.K., and the graphics is working...
ID: 4515 · Report as offensive    Reply Quote
Aegis Maelstrom

Send message
Joined: 19 Jan 09
Posts: 12
Credit: 4,751
RAC: 0
Message 4517 - Posted: 24 Jan 2009, 2:43:05 UTC - in response to Message 4516.  
Last modified: 24 Jan 2009, 3:24:16 UTC

Hi,

currently I'm running sr213_t077_1_NMR_NESG_SAVE_ALL_OUT_6972_24_0.

- Do the graphics behave properly again ?

Yes, it is working - however there was a funny thing.
This WU has a native protein and RMSD - but once the RMSD part of the graphics (showing left/right how good the RMSD is) was not working. It was just blank.
The low energy part was O.K.

After turning off and turning on one-two minutes after, everything was just fine.

Regarding checkpointing - according to boinccmd.exe it does make checkpoints every 100+ - 200+ seconds. It is quite close to my preferences (120 sec).

EDIT: I have made an experiment to check the checkpointing and I have turned off the client and afterwards turned it back on.

The task resumed from over 26 minutes (and according to the boinccmd the checkpoint was 1591 seconds, so it would fit). However, the WU started from Model 1, from the step 0 and was moving forward.
On the other hand, the WU snaply went into SmoothFragmentMover_GunnCost and it didn't look that bad, so I am puzzled: I am not sure if it really started from the beginning, or if the checkpointing worked but the step information is misguiding.
I think it must be checked on some further step when the protein is really pretty and the difference is easy to tell - but now I must go sleep. :D


EDIT 2: I have repeated my experiment and I am pretty sure the checkpointing is not working correctly.

I've stopped the WU after 46 minutes (2524 seconds) when it was quite beautiful (minus three hundred something energy etc.) and restarted.

Yet again it started from step 0 and then it started to behave in a strange manner. There was no really folded protein, only one straight chain - but the program was clearly trying to use methods proper in the last stage of the process (namely MoverBase+Minimization). In a minute or so the chain was bent in something like two points and it got "fractalized" by the last step of the procedure (when you have all these small strings moving to get the lowest possible energy).

If you want, I can provide you with a snapshot (printscreen of the graphics).

Everything ended after circa 200 seconds (for a couple of seconds there was "stage: unknown") and got reported.

Have a nice lecture.

Interesting thing: it reduced the granted credit; claimed was 7.06, granted: 4.68.

Well, one must suffer for the science. =) And now let me get some sleep...


P.S. The only thing that collided with the experiment was that in a second time I had to halt a regular Rosetta task (I think I've halted the Rosetta but this time after restarting BOINC the manager didn't remember it, huh). I don't know if it could interfere - if it did it would mean there is some lack of reseting used variables.

If it were not for the experiment, I am pretty sure the task would run alright.
ID: 4517 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4518 - Posted: 24 Jan 2009, 2:53:05 UTC

The checkpoints seem more inline with my settings as of 1.52.

I'm a bit unclear as to what exactly to expect now. Sounds like if I set my preference to "write at most..." 600 seconds (10min) that I should expect the following...

I'll see checkpoint debug messages in my messages tab, but it may be that no data was written to disk?

And I'll see for a given task, within a given model, that it won't write more then every 10min?

...except if it reaches the end of a model, at which point it will write, regardless of how recently the last checkpoint was??

So, hypothetical example:

Time Event
mm:ss
00:00 Model start
01:15 checkpoint (but not written)
02:45 checkpoint (but not written)
05:15 checkpoint (but not written)
09:15 checkpoint (but not written)
11:30 checkpoint, all of the above written to disk
12:10 checkpoint (but not written)
13:40 checkpoint (but not written)
14:25 model completed, write to disk, even though only 3min since last write.

I think the above is what I should expect. Now... I've got a P4 running with HT, so 2 virtual cores... should I expect the tasks to behave independantly? i.e. each on their own 10 min "write at most" timer? Or should I expect them to both buffer for 10min. unless they reach a model end? And if a model end is reached on one, will that cause the buffers for the other to be written as well?

Anyone have any good ideas on how to track over time the number of checkpoint messages, as compared to the number of disk writes?
ID: 4518 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4519 - Posted: 24 Jan 2009, 3:03:50 UTC
Last modified: 24 Jan 2009, 3:09:27 UTC

Anyone have any good ideas on how to track over time the number of checkpoint messages, as compared to the number of disk writes?


OH! Is that what all these new files are??
chk_S_1LARA_1_00000009_ClassicAbinitio___lc_3.out
chk_S_1LARA_1_00000009_ClassicAbinitio___lc_3.rng.state
chk_S_1LARA_1_00000010_ClassicAbinitio__stage_1.out
chk_S_1LARA_1_00000010_ClassicAbinitio__stage_1.rng.state

In fact I see many files now in my slot directory. Seems like blocks of 4 all written with same timestamp. But names vary, probably depending on what each of the model took the checkpoint.

Ok, yes my "boinc_checkpoint_count" file says 92, but I've got no where near that many files. And then even less blocks of time where files were written. Ya, I've only got 14 different timestamps. So, that sounds good.

I've got another docking task that says it's on model 258 after 22hrs (my preference is 24hrs), and so it's checkpoint count says 257, and I don't have any of the other files. Perhaps it is not taking checkpoints? The default.out file on that one is 43MB so far! Man is *THAT* upload going to take a while!
ID: 4519 · Report as offensive    Reply Quote
Aegis Maelstrom

Send message
Joined: 19 Jan 09
Posts: 12
Credit: 4,751
RAC: 0
Message 4520 - Posted: 24 Jan 2009, 3:31:43 UTC

Feel1st, others: could you consider running a similar test to the one I have described above?

Or maybe you would find some further/better tests? I guess there is no better way to see if the checkpointing is really working.

And I am awfully sorry for my subprime English ;) - it's 4:30 a.m. here and I am rushing to my bed. :)
ID: 4520 · Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 14 Jan 09
Posts: 62
Credit: 33,293
RAC: 0
Message 4521 - Posted: 24 Jan 2009, 4:21:08 UTC
Last modified: 24 Jan 2009, 5:14:33 UTC

Graphics ...

Well, for me on the Mac Pro (Intel) trying the graphics got the system start to initialize and then die... never got the graphics window up ... I don't have a Rosetta task in work at the moment, but I am pretty sure that 1.47 the graphics does work on OS-X ... does not for me with 1.52 (maybe I need 1.53? Well, I will see when the other tasks become available and I can run them and try the grpahics ...

Could it be becuase I am running GPU Grid on my other computers and the Mac is jealous?

{edit add}
It looks like the task that failed for graphics did complete...

But this task failed for unknown model name ...

Not sure why ... and I just caught it ...


Initialization complete.
Watchdog active.

ERROR: unknown model name: 1DK8A_1
ERROR:: Exit from: src/protocols/abinitio/PairingStatistics.hh line: 170
called boinc_finish


ID: 4521 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4522 - Posted: 24 Jan 2009, 5:54:35 UTC
Last modified: 24 Jan 2009, 5:56:02 UTC

Yep, I was afraid of that...

1/23/2009 11:12:31 PM|ralph@home|Computation for task 1BJ1.mrtmdock.pdb_simpledocking.xml_3_8_6906_1_0 finished
1/23/2009 11:12:31 PM|ralph@home|Output file 1BJ1.mrtmdock.pdb_simpledocking.xml_3_8_6906_1_0_0 for task 1BJ1.mrtmdock.pdb_simpledocking.xml_3_8_6906_1_0 exceeds size limit.
1/23/2009 11:12:31 PM|ralph@home|File size: 30524004.000000 bytes. Limit: 25000000.000000 bytes


1BJ1.mrtmdock.pdb_simpledocking.xml_3_8_6906_1_0

The file was too large, so it calls it a compute error. Looks like it doesn't actually send the file when it grows too large. Too bad. It was 24 seconds short of 24hrs when it "failed", apparently writing the last model to the file.

Oh well... THIS is why I chose to run 24hr tasks on Ralph. To reveal these things. On Rosetta I do it because I like a tidy task list, and to minimize hits to the server.
ID: 4522 · Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 14 Jan 09
Posts: 62
Credit: 33,293
RAC: 0
Message 4523 - Posted: 24 Jan 2009, 6:13:44 UTC

I just chedcked a 1.53 task and the graphics window came up and seemed to post updates ... too tired to look too hard at it (sorry), but it does look like what I saw with 1.52 is fixed in 1.53 (or was it the task?) ...
ID: 4523 · Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 14 Jan 09
Posts: 62
Credit: 33,293
RAC: 0
Message 4524 - Posted: 24 Jan 2009, 6:16:04 UTC - in response to Message 4522.  

Yep, I was afraid of that...

1/23/2009 11:12:31 PM|ralph@home|Computation for task 1BJ1.mrtmdock.pdb_simpledocking.xml_3_8_6906_1_0 finished
1/23/2009 11:12:31 PM|ralph@home|Output file 1BJ1.mrtmdock.pdb_simpledocking.xml_3_8_6906_1_0_0 for task 1BJ1.mrtmdock.pdb_simpledocking.xml_3_8_6906_1_0 exceeds size limit.
1/23/2009 11:12:31 PM|ralph@home|File size: 30524004.000000 bytes. Limit: 25000000.000000 bytes


1BJ1.mrtmdock.pdb_simpledocking.xml_3_8_6906_1_0

The file was too large, so it calls it a compute error. Looks like it doesn't actually send the file when it grows too large. Too bad. It was 24 seconds short of 24hrs when it "failed", apparently writing the last model to the file.

Oh well... THIS is why I chose to run 24hr tasks on Ralph. To reveal these things. On Rosetta I do it because I like a tidy task list, and to minimize hits to the server.


Hmmm, shouldn't that be a test? When it writes to the file it should check to see if it is hearing the limit, and if so, stop ... I mean, 24 hours worth of work down the tubes other than we discovered the error ... (thank you) ...



ID: 4524 · Report as offensive    Reply Quote
Profile Ian_D

Send message
Joined: 16 Feb 06
Posts: 16
Credit: 39,518
RAC: 0
Message 4525 - Posted: 24 Jan 2009, 9:40:03 UTC

version 1.53

https://ralph.bakerlab.org/result.php?resultid=1263355

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
BOINC:: Initializing ... ok.
[2009- 1-24 8:50:37:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing core...
Initializing options.... ok
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/ralph.bakerlab.org/minirosetta_database_rev26003.zip
<unzip> <-oq> <../../projects/ralph.bakerlab.org/minirosetta_database_rev26003.zip> <-d> <./>
Firstarg=true; pp=-d
first check it's not -dexdir
firstarg: <./>
End of unzipping.
Unpacking WU data ...
Unpacking data: ../../projects/ralph.bakerlab.org/testC_cc2_1_8_mammoth_mix_cen_cst.foldcst_chunk_general_mammoth_cst.t290_.mtyka.boinc_files.zip
<unzip> <-oq> <../../projects/ralph.bakerlab.org/testC_cc2_1_8_mammoth_mix_cen_cst.foldcst_chunk_general_mammoth_cst.t290_.mtyka.boinc_files.zip>
End of unzipping.
Setting database description ...
Setting up checkpointing ...
Setting up folding (abrelax) ...
Beginning folding (abrelax) ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
Starting work on structure: _1CYNA_10_00001


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00572136 read attempt to address 0xAB9812D1

Engaging BOINC Windows Runtime Debugger...



********************


BOINC Windows Runtime Debugger Version 6.5.0


Dump Timestamp : 01/24/09 09:25:27
Install Directory : D:Program FilesBOINC
Data Directory : D:ProgramDataBOINC
Project Symstore : https://boinc.bakerlab.org/rosetta/symstore
Loaded Library : D:Program FilesBOINCdbghelp.dll
Loaded Library : D:Program FilesBOINCsymsrv.dll
Loaded Library : D:Program FilesBOINCsrcsrv.dll
LoadLibraryA( D:Program FilesBOINCversion.dll ): GetLastError = 126
Loaded Library : version.dll
Debugger Engine : 4.0.5.0
Symbol Search Path: D:ProgramDataBOINCslots1;D:ProgramDataBOINCprojectsralph.bakerlab.org;srv*D:ProgramDataBOINCprojectsralph.bakerlab.orgsymbols*http://msdl.microsoft.com/download/symbols;srv*D:ProgramDataBOINCprojectsralph.bakerlab.orgsymbols*https://boinc.bakerlab.org/rosetta/symstore;srv*D:ProgramDataBOINCprojectsralph.bakerlab.orgsymbols*http://boinc.berkeley.edu/symstore



ID: 4525 · Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread



©2024 University of Washington
http://www.bakerlab.org