minirosetta v1.48-1.51 bug thread

Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

AuthorMessage
I _ quit

Send message
Joined: 13 Jan 09
Posts: 44
Credit: 88,562
RAC: 0
Message 4469 - Posted: 21 Jan 2009, 14:35:22 UTC

every task given to me so far has completed ok.
I see there is nothing more for jobs in queue, so i take it this test is coming to an end?
ID: 4469 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4470 - Posted: 21 Jan 2009, 15:55:30 UTC - in response to Message 4469.  

so i take it this test is coming to an end?


Don't rush it! There were still 1000 failures and 3000 successes. I'm not sure what time the last of the DB packaging problems were cleared through. But, you've barely given any time for a 24hr runtime to complete.

Greg. Typically, they release a few tasks, if no clear problems like the DB packaging, and successes start coming in, then they release a few thousand tasks over the course of a day or so. And then, when they've made the final adjustments, explained most of the reported errors and confirmed the results, then they do a final push of 10,000+ tasks (again over the course of a day or two) to really seek out those rare and intermittant problems. THEN send it over to Rosetta.
ID: 4470 · Report as offensive    Reply Quote
I _ quit

Send message
Joined: 13 Jan 09
Posts: 44
Credit: 88,562
RAC: 0
Message 4471 - Posted: 21 Jan 2009, 16:19:36 UTC - in response to Message 4470.  

so i take it this test is coming to an end?


Don't rush it! There were still 1000 failures and 3000 successes. I'm not sure what time the last of the DB packaging problems were cleared through. But, you've barely given any time for a 24hr runtime to complete.

Greg. Typically, they release a few tasks, if no clear problems like the DB packaging, and successes start coming in, then they release a few thousand tasks over the course of a day or so. And then, when they've made the final adjustments, explained most of the reported errors and confirmed the results, then they do a final push of 10,000+ tasks (again over the course of a day or two) to really seek out those rare and intermittant problems. THEN send it over to Rosetta.



ok...thanks for the explanation, just not sure what to expect here since this is the first time for me on ralph.
ID: 4471 · Report as offensive    Reply Quote
Snagletooth

Send message
Joined: 4 May 07
Posts: 67
Credit: 134,427
RAC: 0
Message 4472 - Posted: 21 Jan 2009, 17:33:55 UTC

test_cc_1_8_nocst4_hb_t367__IGNORE_THE_REST_1WOLA_4_6830_2_0 ended with same rare error as Ian D and Paul D Buck.

<core_client_version>6.2.18</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>
Watchdog active.
# cpu_run_time_pref: 14400

ERROR: target_strands.size()
ERROR:: Exit from: src/protocols/abinitio/TemplateJumpSetup.cc line: 94
called boinc_finish

Snags

ID: 4472 · Report as offensive    Reply Quote
mtyka
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 19 Mar 08
Posts: 79
Credit: 0
RAC: 0
Message 4473 - Posted: 21 Jan 2009, 17:58:14 UTC - in response to Message 4466.  

I still have problems on my new W2008 X64 server.
Every taks of 1.5 minirosetta hangs at startup with 3MB memory and stdout:

[2009- 1-21 9:52:36:] :: BOINC :: boinc_init()
Created shared memory segment

These tasks hangs and I have to kill them from taskbar. After killing stderr is:


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x778806CF read attempt to address 0x00000004

Engaging BOINC Windows Runtime Debugger...



How do you know it *hangs* ?? Rosetta will not print anything to stdout - that's normal. Are the graphics moving ? What happens if you just let it run for a few hours ?

Mike

ID: 4473 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4474 - Posted: 21 Jan 2009, 22:14:45 UTC - in response to Message 4462.  

More checkpointing is great! But... this is a bit extreme. My write to disk at MOST every... setting is at 1800 seconds. My harddrive will never be able to spin down and go in to power saver mode all night long if the checkpoints continue at this pace.


1/20/2009 3:31:58 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0 checkpointed
1/20/2009 3:32:28 PM|ralph@home|[checkpoint_debug] result


Hmm ok, i'll look into this.


Ya, this afternoon I've been running two tasks for about 4.25hrs and I've got 450 checkpoint taken messages in my messages tab. Sometimes showing two checkpoints on same task within the same second.

Task names:
test_cc_1_8_nocst4_hb_t342__IGNORE_THE_REST_2G0QA_13_6824_1_0
test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0
(both are running v1.50)
ID: 4474 · Report as offensive    Reply Quote
I _ quit

Send message
Joined: 13 Jan 09
Posts: 44
Credit: 88,562
RAC: 0
Message 4475 - Posted: 22 Jan 2009, 1:00:19 UTC - in response to Message 4474.  
Last modified: 22 Jan 2009, 1:02:22 UTC

More checkpointing is great! But... this is a bit extreme. My write to disk at MOST every... setting is at 1800 seconds. My harddrive will never be able to spin down and go in to power saver mode all night long if the checkpoints continue at this pace.


1/20/2009 3:31:58 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0 checkpointed
1/20/2009 3:32:28 PM|ralph@home|[checkpoint_debug] result


Hmm ok, i'll look into this.


Ya, this afternoon I've been running two tasks for about 4.25hrs and I've got 450 checkpoint taken messages in my messages tab. Sometimes showing two checkpoints on same task within the same second.

Task names:
test_cc_1_8_nocst4_hb_t342__IGNORE_THE_REST_2G0QA_13_6824_1_0
test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0
(both are running v1.50)



interesting how you guys are getting checkpoint messages, i don't see that in my boinc manager. is that due to me using version 6 and your using version 5?
ID: 4475 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4476 - Posted: 22 Jan 2009, 5:09:57 UTC

Greg, this is one of the BOINC debug messages. You get them be setting up a cc_config.xml file. In this case you need "checkpoint_debug" set to 1. You also need at least the first three set to 1.

Otherwise, the checkpoints are pretty transparent. But, in my case, my BOINC data directory is over on my second drive and it cannot spin down when idle now, because it is never idle. Normally, the drive is set to spin down when not in use, and then every 15-30 minutes BOINC kicks in and wants to write something and it spins up to do so, then goes back to sleep. The timer that makes the drive sleep is longer then the time between checkpoints with this new app.

It really should be honoring the BOINC setting for "write at most". I'm not clear why, but many projects do not honor that setting. They take then checkpoints as they are able, regardless. ...which was fine, until they started checkpointing every 2 minutes :)
ID: 4476 · Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 07
Posts: 75
Credit: 69,584
RAC: 0
Message 4477 - Posted: 22 Jan 2009, 9:28:05 UTC

I don't see that in my boinc manager


You can find it in your slots directory. (v6.2.19) You will find the boinc_checkpoint_count file. Open it and you will see how many checkpoints you have. Also there is a list of the checkpoints.

My work unit has been running for for about 34 minutes and I have accumulated 43 checkpoints.

ID: 4477 · Report as offensive    Reply Quote
Profile Ian_D

Send message
Joined: 16 Feb 06
Posts: 16
Credit: 39,518
RAC: 0
Message 4478 - Posted: 22 Jan 2009, 9:30:50 UTC
Last modified: 22 Jan 2009, 9:32:15 UTC

Version 1.51

https://ralph.bakerlab.org/result.php?resultid=1258153


<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
BOINC:: Initializing ... ok.
[2009- 1-22 9:15:49:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Initializing options.... ok
Initializing random generators... ok
Initialization complete.
Watchdog active.


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0081E942 read attempt to address 0x00000000

Engaging BOINC Windows Runtime Debugger...
ID: 4478 · Report as offensive    Reply Quote
I _ quit

Send message
Joined: 13 Jan 09
Posts: 44
Credit: 88,562
RAC: 0
Message 4479 - Posted: 22 Jan 2009, 9:37:42 UTC - in response to Message 4477.  
Last modified: 22 Jan 2009, 9:40:37 UTC

I don't see that in my boinc manager


You can find it in your slots directory. (v6.2.19) You will find the boinc_checkpoint_count file. Open it and you will see how many checkpoints you have. Also there is a list of the checkpoints.

My work unit has been running for for about 34 minutes and I have accumulated 43 checkpoints.



thanks for the info evan.
found the file, in 3 hrs+ run time one task accumulated 561 so far and the other 305. that's a pretty healthy number for 3 hrs. feet1st looks easy enough. i may try that later. thanks

mtyka - now into the 1.50 tasks and so far no errors on any of the tasks sent to my system.
ID: 4479 · Report as offensive    Reply Quote
HA-SOFT, s.r.o.

Send message
Joined: 19 Jan 09
Posts: 6
Credit: 19,644
RAC: 0
Message 4480 - Posted: 22 Jan 2009, 9:54:07 UTC - in response to Message 4473.  
Last modified: 22 Jan 2009, 10:24:16 UTC

Tasks can not be suspended, boinc can do nothing with process. After few days I have about 10-15 death rosetta tasks with 3M RAM allocated.

If I don't kill the app, it runs till pc restart (on server usually about 30 days till MS security update and restart).

Can be problem with DEP turned on?

Ver 1.51:

BOINC:: Initializing ... ok.
[2009- 1-22 11: 0:26:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.


Unhandled Exception Detected...

- Unhandled Exception Record -

In Graphics application I have got:

Starting graphics application...
Setting window title to 'minirosetta version 1.51 [workunit: lr6_D_score12_rlbd_2ccv_IGNORE_THE_REST_DECOY_6840_2]'.
OpenSemaphore failure
Successfully loaded '../../projects/ralph.bakerlab.org/Helvetica.txf'...
Close event (shmem not updated) detected, shutting down.
Shutting down graphics application...

but I'm running boinc like service app (and connecting over rdp) and this may be the problem for graphics.
ID: 4480 · Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 07
Posts: 75
Credit: 69,584
RAC: 0
Message 4481 - Posted: 22 Jan 2009, 10:47:12 UTC

This one 1256978 has failed for the second time.

Maybe the same problem as reported by Zdenek Vasku

[url]<core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
BOINC:: Initializing ... ok.
[2009- 1-22 10:30:57:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Initializing options.... ok
Initializing random generators... ok
Initialization complete.
Watchdog active.

ERROR: unknown model name: 1DK8A_1
ERROR:: Exit from: d:boinc_buildminirosetta_windowsminisrcprotocols/abinitio/PairingStatistics.hh line: 170
called boinc_finish
[/quote]
ID: 4481 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4482 - Posted: 22 Jan 2009, 14:14:35 UTC

This task is running v1.51. It seems a bit on the large side in all dimensions here. It's been running for 6 hours, but is only on model 2. Step 900,000! It's peak memory usage so far was 430MB.

Probably what you were expecting for such a large protein, but definitely needs a high memory flag.

I believe the memory usage of the tasks is reported back with scheduler requests. Does the project process this data and query for anonomolies in memory usage? Or do you need us to report such things?
ID: 4482 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4483 - Posted: 22 Jan 2009, 14:22:56 UTC

This task completed my 24hr runtime preference on v1.48. But it reported 99 starting structures (not the usual "1") and 98 decoys. So, what happened to the last one?
ID: 4483 · Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 07
Posts: 75
Credit: 69,584
RAC: 0
Message 4484 - Posted: 22 Jan 2009, 17:36:44 UTC

Add these to the mammoth second time failures:

1257164
1257066
1257064
ID: 4484 · Report as offensive    Reply Quote
mtyka
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 19 Mar 08
Posts: 79
Credit: 0
RAC: 0
Message 4485 - Posted: 22 Jan 2009, 20:41:50 UTC

Morning.

Things are moving forward - 1.51 is out.

This should fix the checkpoint issue - they should not dump as frequently
as before and try to honor the user's setting. Now - i have to say this is not *always* possible. To answer your question Feet1st, due to the structure of most simulation software (not just ours) you cant just checkpoint at any arbitrary point. sometime you *need* to checkpoint frequently or not at all, sometimes
you can't checkpoint (or at least it would take huge amounts of data which is not ok). So what happens is that we checkpoint, but we "hold on" to the data in memory (i.e buffer it) until it is official time to checkpoint, and then we dump all the gathered checkpoints. Now, of course there is a limit to how much we can hold on to in orde rnot to overflow the memory, so occasionally we *have* to dump, even if it's not time. also at the end of a decoy, all is dumped and deleted and dealt with, the user setting cannot have any control over that. HOpe that explains it.

However there was glitch in the buffering mechanism so it was dumping always. 1.51 should fix that - could you confirm that that is so !?

Otherwise this release has added debug information to let me figure out where all this stuff is failing. Believe me guys, we're now in the land whre i cannot reproduce these errors here what so ever. Not on the linux boxes, Mac boxes or windows boxes we have. nowhere. Why these remaining segfaults occur is a total mystery to me, so please bear with me. THis is going ot be incredibly difficult to track down.

Thanks for all your help! Every post is super useful to us!!

Mike
ID: 4485 · Report as offensive    Reply Quote
mtyka
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 19 Mar 08
Posts: 79
Credit: 0
RAC: 0
Message 4486 - Posted: 22 Jan 2009, 20:46:38 UTC - in response to Message 4482.  

This task is running v1.51. It seems a bit on the large side in all dimensions here. It's been running for 6 hours, but is only on model 2.


absolutely reasonable isnt it ? 3-4 hours per model is what we expect these days with the mammoth tasks.


> Step 900,000!

Step 900000 ? WHat does that mean ? "Step" ??


>It's peak memory usage so far was 430MB.

Hmm yeah, i guess we need to flag these. They're big, i know.


I believe the memory usage of the tasks is reported back with scheduler requests. Does the project process this data and query for anonomolies in memory usage? Or do you need us to report such things?


Not sure where i can get this info. If you see anomalies let us know.
ID: 4486 · Report as offensive    Reply Quote
mtyka
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 19 Mar 08
Posts: 79
Credit: 0
RAC: 0
Message 4487 - Posted: 22 Jan 2009, 20:49:59 UTC - in response to Message 4480.  

Tasks can not be suspended, boinc can do nothing with process. After few days I have about 10-15 death rosetta tasks with 3M RAM allocated.

If I don't kill the app, it runs till pc restart (on server usually about 30 days till MS security update and restart).

Can be problem with DEP turned on?

Ver 1.51:

BOINC:: Initializing ... ok.
[2009- 1-22 11: 0:26:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.


Unhandled Exception Detected...

- Unhandled Exception Record -

In Graphics application I have got:

Starting graphics application...
Setting window title to 'minirosetta version 1.51 [workunit: lr6_D_score12_rlbd_2ccv_IGNORE_THE_REST_DECOY_6840_2]'.
OpenSemaphore failure
Successfully loaded '../../projects/ralph.bakerlab.org/Helvetica.txf'...
Close event (shmem not updated) detected, shutting down.
Shutting down graphics application...

but I'm running boinc like service app (and connecting over rdp) and this may be the problem for graphics.



Is there a way to turn off the graphics ? What's DEP ?

THis is interesting, i will have to look at the code to see if there's a way for it to hang somewhere. I wonder what happens to mini if the graphics app fails ..

I will have to try that locally. If you can decribe your set-up in more detail that would be excellent. Is this taks 1.51 or 1.50 or 1.48 ?

Mike

ID: 4487 · Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 13 Jan 09
Posts: 100
Credit: 331,865
RAC: 0
Message 4488 - Posted: 22 Jan 2009, 21:46:42 UTC

Should the name of this thread be editted to include 1.51, or should a new thread be started for 1.51?
ID: 4488 · Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread



©2024 University of Washington
http://www.bakerlab.org