minirosetta v1.48-1.51 bug thread

Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Profile Ian_D

Send message
Joined: 16 Feb 06
Posts: 16
Credit: 39,518
RAC: 0
Message 4478 - Posted: 22 Jan 2009, 9:30:50 UTC
Last modified: 22 Jan 2009, 9:32:15 UTC

Version 1.51

https://ralph.bakerlab.org/result.php?resultid=1258153


<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
BOINC:: Initializing ... ok.
[2009- 1-22 9:15:49:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Initializing options.... ok
Initializing random generators... ok
Initialization complete.
Watchdog active.


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0081E942 read attempt to address 0x00000000

Engaging BOINC Windows Runtime Debugger...
ID: 4478 · Report as offensive    Reply Quote
I _ quit

Send message
Joined: 13 Jan 09
Posts: 44
Credit: 88,562
RAC: 0
Message 4479 - Posted: 22 Jan 2009, 9:37:42 UTC - in response to Message 4477.  
Last modified: 22 Jan 2009, 9:40:37 UTC

I don't see that in my boinc manager


You can find it in your slots directory. (v6.2.19) You will find the boinc_checkpoint_count file. Open it and you will see how many checkpoints you have. Also there is a list of the checkpoints.

My work unit has been running for for about 34 minutes and I have accumulated 43 checkpoints.



thanks for the info evan.
found the file, in 3 hrs+ run time one task accumulated 561 so far and the other 305. that's a pretty healthy number for 3 hrs. feet1st looks easy enough. i may try that later. thanks

mtyka - now into the 1.50 tasks and so far no errors on any of the tasks sent to my system.
ID: 4479 · Report as offensive    Reply Quote
HA-SOFT, s.r.o.

Send message
Joined: 19 Jan 09
Posts: 6
Credit: 19,644
RAC: 0
Message 4480 - Posted: 22 Jan 2009, 9:54:07 UTC - in response to Message 4473.  
Last modified: 22 Jan 2009, 10:24:16 UTC

Tasks can not be suspended, boinc can do nothing with process. After few days I have about 10-15 death rosetta tasks with 3M RAM allocated.

If I don't kill the app, it runs till pc restart (on server usually about 30 days till MS security update and restart).

Can be problem with DEP turned on?

Ver 1.51:

BOINC:: Initializing ... ok.
[2009- 1-22 11: 0:26:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.


Unhandled Exception Detected...

- Unhandled Exception Record -

In Graphics application I have got:

Starting graphics application...
Setting window title to 'minirosetta version 1.51 [workunit: lr6_D_score12_rlbd_2ccv_IGNORE_THE_REST_DECOY_6840_2]'.
OpenSemaphore failure
Successfully loaded '../../projects/ralph.bakerlab.org/Helvetica.txf'...
Close event (shmem not updated) detected, shutting down.
Shutting down graphics application...

but I'm running boinc like service app (and connecting over rdp) and this may be the problem for graphics.
ID: 4480 · Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 07
Posts: 75
Credit: 69,584
RAC: 0
Message 4481 - Posted: 22 Jan 2009, 10:47:12 UTC

This one 1256978 has failed for the second time.

Maybe the same problem as reported by Zdenek Vasku

[url]<core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
BOINC:: Initializing ... ok.
[2009- 1-22 10:30:57:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Initializing options.... ok
Initializing random generators... ok
Initialization complete.
Watchdog active.

ERROR: unknown model name: 1DK8A_1
ERROR:: Exit from: d:boinc_buildminirosetta_windowsminisrcprotocols/abinitio/PairingStatistics.hh line: 170
called boinc_finish
[/quote]
ID: 4481 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4482 - Posted: 22 Jan 2009, 14:14:35 UTC

This task is running v1.51. It seems a bit on the large side in all dimensions here. It's been running for 6 hours, but is only on model 2. Step 900,000! It's peak memory usage so far was 430MB.

Probably what you were expecting for such a large protein, but definitely needs a high memory flag.

I believe the memory usage of the tasks is reported back with scheduler requests. Does the project process this data and query for anonomolies in memory usage? Or do you need us to report such things?
ID: 4482 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4483 - Posted: 22 Jan 2009, 14:22:56 UTC

This task completed my 24hr runtime preference on v1.48. But it reported 99 starting structures (not the usual "1") and 98 decoys. So, what happened to the last one?
ID: 4483 · Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 07
Posts: 75
Credit: 69,584
RAC: 0
Message 4484 - Posted: 22 Jan 2009, 17:36:44 UTC

Add these to the mammoth second time failures:

1257164
1257066
1257064
ID: 4484 · Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 13 Jan 09
Posts: 103
Credit: 331,865
RAC: 0
Message 4488 - Posted: 22 Jan 2009, 21:46:42 UTC

Should the name of this thread be editted to include 1.51, or should a new thread be started for 1.51?
ID: 4488 · Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 13 Jan 09
Posts: 103
Credit: 331,865
RAC: 0
Message 4489 - Posted: 22 Jan 2009, 21:53:35 UTC

Both copies of this test_cc 1.50 workunit failed quickly:

https://ralph.bakerlab.org/workunit.php?wuid=1105459
ID: 4489 · Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 07
Posts: 75
Credit: 69,584
RAC: 0
Message 4490 - Posted: 22 Jan 2009, 22:10:46 UTC

The mammoths have not been having a happy time. I have had a run of 23 failures (1st time) including:


unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0081E942 read attempt to address 0x00000000

and

ERROR: unknown model name: 1B0NA_10
ERROR:: Exit from: d:boinc_buildminirosetta_windowsminisrcprotocols/abinitio/PairingStatistics.hh line: 170
called boinc_finish




ID: 4490 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4491 - Posted: 22 Jan 2009, 22:11:54 UTC - in response to Message 4486.  


> Step 900,000!

Step 900000 ? WHat does that mean ? "Step" ??


That was the step shown in the graphic. (to the right of the model number). I've just never seen such a large step number.
ID: 4491 · Report as offensive    Reply Quote
Snagletooth

Send message
Joined: 4 May 07
Posts: 67
Credit: 134,427
RAC: 0
Message 4492 - Posted: 22 Jan 2009, 23:02:02 UTC

One more with v1.50: test_cc_1_8_nocst4_hb_t367__IGNORE_THE_REST_1UFBA_5_6830_3

<core_client_version>6.2.18</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>
Watchdog active.
# cpu_run_time_pref: 14400

ERROR: target_strands.size()
ERROR:: Exit from: src/protocols/abinitio/TemplateJumpSetup.cc line: 94
called boinc_finish



And one with v.1.51: test_cc2_1_8_mammoth_mix_cen_cst_hb_t311__IGNORE_THE_REST_1PERL_6_6852_1

<core_client_version>6.2.18</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>
BOINC:: Initializing ... ok.
[2009- 1-22 11:56:36:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Initializing options.... ok
Initializing random generators... ok
Initialization complete.
Watchdog active.

ERROR: unknown model name: 1B0NA_10
ERROR:: Exit from: src/protocols/abinitio/PairingStatistics.hh line: 170
called boinc_finish

Snags
ID: 4492 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4493 - Posted: 23 Jan 2009, 1:04:16 UTC
Last modified: 23 Jan 2009, 1:13:19 UTC

I tried loading up the docking task I got. It is 1.51. Displayed graphic just as the task was starting. Waited, waited... finally realized it was using more CPU then the thread working on the protein! Double checked Ralph settings for % of CPU for the graphic, set to default which is 10%.

Here's my screenshot showing the graphic monopolizing one core, while the two running tasks are competing for the other. Net result, nothing shown in the graphic after several minutes, and graphic thread consuming much more then 10% of CPU.



[edit]
I gave up on it, captured the screen, uploaded the screenshot, reported it here... then when I opened the graphic a second time it was better behaved. Not overusing CPU... but was essentially unusable. Go to resize or rotate the images and it wouldn't respond for about 30 seconds each time.
ID: 4493 · Report as offensive    Reply Quote
Profile Ian_D

Send message
Joined: 16 Feb 06
Posts: 16
Credit: 39,518
RAC: 0
Message 4495 - Posted: 23 Jan 2009, 8:31:33 UTC
Last modified: 23 Jan 2009, 8:32:55 UTC

application version 1.51

https://ralph.bakerlab.org/result.php?resultid=1258676

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
BOINC:: Initializing ... ok.
[2009- 1-22 19: 2:51:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Initializing options.... ok
Initializing random generators... ok
Initialization complete.
Watchdog active.

ERROR: unknown model name: 1DK8A_1
ERROR:: Exit from: d:boinc_buildminirosetta_windowsminisrcprotocols/abinitio/PairingStatistics.hh line: 170
called boinc_finish

</stderr_txt>
]]>
ID: 4495 · Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 14 Jan 09
Posts: 62
Credit: 33,293
RAC: 0
Message 4496 - Posted: 23 Jan 2009, 11:25:18 UTC - in response to Message 4485.  

Otherwise this release has added debug information to let me figure out where all this stuff is failing. Believe me guys, we're now in the land whre i cannot reproduce these errors here what so ever. Not on the linux boxes, Mac boxes or windows boxes we have. nowhere. Why these remaining segfaults occur is a total mystery to me, so please bear with me. THis is going ot be incredibly difficult to track down.

Thanks for all your help! Every post is super useful to us!!


Questions I ask myself,

Are they happening in the same module?
Are they happening because of the same type of activity?
What is common to all the events?
Could it be something external?

I know BOINC is supposed to be keeping things isolated and that application A is not supposed to affect application B ... but I have seen enough odd things that I am not convinced that this is completely so ...

As to the tasks that fail ... you say they don't fail for you ...

Why not grab the set and issue them to us with high replication counts ... then if they fail for everyone, that tells us one thing ... if they only fail for some and not others ... that tells us something else ...

To my mind, the cases that fail are the ones that you should be saving and using as your issue tasks for each round of testing. Not to keep making new tasks in all cases, but to use those tasks that have proven their ability to cause a problem.

Just some things to consider ... oh, and you are out of work again ...
ID: 4496 · Report as offensive    Reply Quote
HA-SOFT, s.r.o.

Send message
Joined: 19 Jan 09
Posts: 6
Credit: 19,644
RAC: 0
Message 4497 - Posted: 23 Jan 2009, 12:56:15 UTC - in response to Message 4487.  

DEP - execution protection on new Intel and AMD cores.

How can I turn graphics off?


ID: 4497 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4498 - Posted: 23 Jan 2009, 15:06:37 UTC
Last modified: 23 Jan 2009, 15:18:05 UTC

v1.51

ERROR: unknown model name: 2FRHA_10
ERROR:: Exit from: d:boinc_buildminirosetta_windowsminisrcprotocols/abinitio/PairingStatistics.hh line: 170
called boinc_finish

on task
test_cc2_1_8_mammoth_mix_cen_cst_hb_t327__IGNORE_THE_REST_2F2EA_7_6860_1_1
ID: 4498 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4499 - Posted: 23 Jan 2009, 15:16:37 UTC - in response to Message 4494.  



Feet1st , this is awesome - debugging on an unprecedented level :) Nice to get an idea of what all this looks like from your point of view.

These docking tasks are new and not mine - lemme track down the person submiting these and make sure the graphics app can deal with it.

Mike


I've seen a number of other reports of the screen saver/graphic just displaying as a black window. And I had always assumed people just weren't waiting long enough for the display to refresh. It does often take a long time, and I always assumed this was due to the allowed % of CPU time for the graphic. But, on the other hand, this wasn't an issue before about 2 or 3 months ago.

I meant to point out that the screenshot shows the graphic thread has used 4:02 of CPU time, but the corresponding Ralph thread is shown with only 2:11 of CPU received so far. So the % CPU shown in the screenshot is only the last interval, but you can see from the totals that the number is roughly what it's been during the entire 4 minutes the task has been running.

I was wondering if perhaps the graphic could always display at least it's grid lines, and text immediately, and perhaps a "protein information being retrieved... please wait" message in the frames. That way at least it would never just be "blank".
ID: 4499 · Report as offensive    Reply Quote
I _ quit

Send message
Joined: 13 Jan 09
Posts: 44
Credit: 88,562
RAC: 0
Message 4500 - Posted: 23 Jan 2009, 16:15:26 UTC - in response to Message 4499.  



Feet1st , this is awesome - debugging on an unprecedented level :) Nice to get an idea of what all this looks like from your point of view.

These docking tasks are new and not mine - lemme track down the person submiting these and make sure the graphics app can deal with it.

Mike


I've seen a number of other reports of the screen saver/graphic just displaying as a black window. And I had always assumed people just weren't waiting long enough for the display to refresh. It does often take a long time, and I always assumed this was due to the allowed % of CPU time for the graphic. But, on the other hand, this wasn't an issue before about 2 or 3 months ago.

I meant to point out that the screenshot shows the graphic thread has used 4:02 of CPU time, but the corresponding Ralph thread is shown with only 2:11 of CPU received so far. So the % CPU shown in the screenshot is only the last interval, but you can see from the totals that the number is roughly what it's been during the entire 4 minutes the task has been running.

I was wondering if perhaps the graphic could always display at least it's grid lines, and text immediately, and perhaps a "protein information being retrieved... please wait" message in the frames. That way at least it would never just be "blank".



Cool idea...I would love to see that as well.
ID: 4500 · Report as offensive    Reply Quote
Snagletooth

Send message
Joined: 4 May 07
Posts: 67
Credit: 134,427
RAC: 0
Message 4501 - Posted: 23 Jan 2009, 19:42:23 UTC

test_cc_1_8_nocst4_hb_t327__IGNORE_THE_REST_2FSWA_6_6888_1
andtest_cc_1_8_nocst4_hb_t327__IGNORE_THE_REST_2F2EA_10_6888_1_1

both ended with:

ERROR: unknown model name: 2FRHA_10
ERROR:: Exit from: src/protocols/abinitio/PairingStatistics.hh line: 170

This time they ran a few minutes instead of a few seconds, claimed to be done instead of declaring a compute error and received a validate error instead of a client error.


Snags

ID: 4501 · Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread



©2024 University of Washington
http://www.bakerlab.org