Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread
Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next
Author | Message |
---|---|
I _ quit Send message Joined: 13 Jan 09 Posts: 44 Credit: 88,562 RAC: 0 |
every task given to me so far has completed ok. I see there is nothing more for jobs in queue, so i take it this test is coming to an end? |
![]() Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
so i take it this test is coming to an end? Don't rush it! There were still 1000 failures and 3000 successes. I'm not sure what time the last of the DB packaging problems were cleared through. But, you've barely given any time for a 24hr runtime to complete. Greg. Typically, they release a few tasks, if no clear problems like the DB packaging, and successes start coming in, then they release a few thousand tasks over the course of a day or so. And then, when they've made the final adjustments, explained most of the reported errors and confirmed the results, then they do a final push of 10,000+ tasks (again over the course of a day or two) to really seek out those rare and intermittant problems. THEN send it over to Rosetta. |
I _ quit Send message Joined: 13 Jan 09 Posts: 44 Credit: 88,562 RAC: 0 |
so i take it this test is coming to an end? ok...thanks for the explanation, just not sure what to expect here since this is the first time for me on ralph. |
Snagletooth Send message Joined: 4 May 07 Posts: 67 Credit: 134,427 RAC: 0 |
test_cc_1_8_nocst4_hb_t367__IGNORE_THE_REST_1WOLA_4_6830_2_0 ended with same rare error as Ian D and Paul D Buck. <core_client_version>6.2.18</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255) </message> <stderr_txt> Watchdog active. # cpu_run_time_pref: 14400 ERROR: target_strands.size() ERROR:: Exit from: src/protocols/abinitio/TemplateJumpSetup.cc line: 94 called boinc_finish Snags |
mtyka Volunteer moderator Project developer Project scientist Send message Joined: 19 Mar 08 Posts: 79 Credit: 0 RAC: 0 |
I still have problems on my new W2008 X64 server. How do you know it *hangs* ?? Rosetta will not print anything to stdout - that's normal. Are the graphics moving ? What happens if you just let it run for a few hours ? Mike |
![]() Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
More checkpointing is great! But... this is a bit extreme. My write to disk at MOST every... setting is at 1800 seconds. My harddrive will never be able to spin down and go in to power saver mode all night long if the checkpoints continue at this pace. Ya, this afternoon I've been running two tasks for about 4.25hrs and I've got 450 checkpoint taken messages in my messages tab. Sometimes showing two checkpoints on same task within the same second. Task names: test_cc_1_8_nocst4_hb_t342__IGNORE_THE_REST_2G0QA_13_6824_1_0 test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0 (both are running v1.50) |
I _ quit Send message Joined: 13 Jan 09 Posts: 44 Credit: 88,562 RAC: 0 |
More checkpointing is great! But... this is a bit extreme. My write to disk at MOST every... setting is at 1800 seconds. My harddrive will never be able to spin down and go in to power saver mode all night long if the checkpoints continue at this pace. interesting how you guys are getting checkpoint messages, i don't see that in my boinc manager. is that due to me using version 6 and your using version 5? |
![]() Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
Greg, this is one of the BOINC debug messages. You get them be setting up a cc_config.xml file. In this case you need "checkpoint_debug" set to 1. You also need at least the first three set to 1. Otherwise, the checkpoints are pretty transparent. But, in my case, my BOINC data directory is over on my second drive and it cannot spin down when idle now, because it is never idle. Normally, the drive is set to spin down when not in use, and then every 15-30 minutes BOINC kicks in and wants to write something and it spins up to do so, then goes back to sleep. The timer that makes the drive sleep is longer then the time between checkpoints with this new app. It really should be honoring the BOINC setting for "write at most". I'm not clear why, but many projects do not honor that setting. They take then checkpoints as they are able, regardless. ...which was fine, until they started checkpointing every 2 minutes :) |
Evan Send message Joined: 23 Dec 07 Posts: 75 Credit: 69,584 RAC: 0 |
I don't see that in my boinc manager You can find it in your slots directory. (v6.2.19) You will find the boinc_checkpoint_count file. Open it and you will see how many checkpoints you have. Also there is a list of the checkpoints. My work unit has been running for for about 34 minutes and I have accumulated 43 checkpoints. |
![]() Send message Joined: 16 Feb 06 Posts: 16 Credit: 39,518 RAC: 0 |
Version 1.51 https://ralph.bakerlab.org/result.php?resultid=1258153 <core_client_version>6.4.5</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> BOINC:: Initializing ... ok. [2009- 1-22 9:15:49:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. Initializing options.... ok Initializing random generators... ok Initialization complete. Watchdog active. Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x0081E942 read attempt to address 0x00000000 Engaging BOINC Windows Runtime Debugger... |
I _ quit Send message Joined: 13 Jan 09 Posts: 44 Credit: 88,562 RAC: 0 |
I don't see that in my boinc manager thanks for the info evan. found the file, in 3 hrs+ run time one task accumulated 561 so far and the other 305. that's a pretty healthy number for 3 hrs. feet1st looks easy enough. i may try that later. thanks mtyka - now into the 1.50 tasks and so far no errors on any of the tasks sent to my system. |
HA-SOFT, s.r.o. Send message Joined: 19 Jan 09 Posts: 6 Credit: 19,644 RAC: 0 |
Tasks can not be suspended, boinc can do nothing with process. After few days I have about 10-15 death rosetta tasks with 3M RAM allocated. If I don't kill the app, it runs till pc restart (on server usually about 30 days till MS security update and restart). Can be problem with DEP turned on? Ver 1.51: BOINC:: Initializing ... ok. [2009- 1-22 11: 0:26:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. Unhandled Exception Detected... - Unhandled Exception Record - In Graphics application I have got: Starting graphics application... Setting window title to 'minirosetta version 1.51 [workunit: lr6_D_score12_rlbd_2ccv_IGNORE_THE_REST_DECOY_6840_2]'. OpenSemaphore failure Successfully loaded '../../projects/ralph.bakerlab.org/Helvetica.txf'... Close event (shmem not updated) detected, shutting down. Shutting down graphics application... but I'm running boinc like service app (and connecting over rdp) and this may be the problem for graphics. |
Evan Send message Joined: 23 Dec 07 Posts: 75 Credit: 69,584 RAC: 0 |
This one 1256978 has failed for the second time. Maybe the same problem as reported by Zdenek Vasku [url]<core_client_version>6.2.19</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> BOINC:: Initializing ... ok. [2009- 1-22 10:30:57:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. Initializing options.... ok Initializing random generators... ok Initialization complete. Watchdog active. ERROR: unknown model name: 1DK8A_1 ERROR:: Exit from: d:boinc_buildminirosetta_windowsminisrcprotocols/abinitio/PairingStatistics.hh line: 170 called boinc_finish [/quote] |
![]() Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
This task is running v1.51. It seems a bit on the large side in all dimensions here. It's been running for 6 hours, but is only on model 2. Step 900,000! It's peak memory usage so far was 430MB. Probably what you were expecting for such a large protein, but definitely needs a high memory flag. I believe the memory usage of the tasks is reported back with scheduler requests. Does the project process this data and query for anonomolies in memory usage? Or do you need us to report such things? |
![]() Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
This task completed my 24hr runtime preference on v1.48. But it reported 99 starting structures (not the usual "1") and 98 decoys. So, what happened to the last one? |
Evan Send message Joined: 23 Dec 07 Posts: 75 Credit: 69,584 RAC: 0 |
|
mtyka Volunteer moderator Project developer Project scientist Send message Joined: 19 Mar 08 Posts: 79 Credit: 0 RAC: 0 |
Morning. Things are moving forward - 1.51 is out. This should fix the checkpoint issue - they should not dump as frequently as before and try to honor the user's setting. Now - i have to say this is not *always* possible. To answer your question Feet1st, due to the structure of most simulation software (not just ours) you cant just checkpoint at any arbitrary point. sometime you *need* to checkpoint frequently or not at all, sometimes you can't checkpoint (or at least it would take huge amounts of data which is not ok). So what happens is that we checkpoint, but we "hold on" to the data in memory (i.e buffer it) until it is official time to checkpoint, and then we dump all the gathered checkpoints. Now, of course there is a limit to how much we can hold on to in orde rnot to overflow the memory, so occasionally we *have* to dump, even if it's not time. also at the end of a decoy, all is dumped and deleted and dealt with, the user setting cannot have any control over that. HOpe that explains it. However there was glitch in the buffering mechanism so it was dumping always. 1.51 should fix that - could you confirm that that is so !? Otherwise this release has added debug information to let me figure out where all this stuff is failing. Believe me guys, we're now in the land whre i cannot reproduce these errors here what so ever. Not on the linux boxes, Mac boxes or windows boxes we have. nowhere. Why these remaining segfaults occur is a total mystery to me, so please bear with me. THis is going ot be incredibly difficult to track down. Thanks for all your help! Every post is super useful to us!! Mike |
mtyka Volunteer moderator Project developer Project scientist Send message Joined: 19 Mar 08 Posts: 79 Credit: 0 RAC: 0 |
This task is running v1.51. It seems a bit on the large side in all dimensions here. It's been running for 6 hours, but is only on model 2. absolutely reasonable isnt it ? 3-4 hours per model is what we expect these days with the mammoth tasks. > Step 900,000! Step 900000 ? WHat does that mean ? "Step" ?? >It's peak memory usage so far was 430MB. Hmm yeah, i guess we need to flag these. They're big, i know.
Not sure where i can get this info. If you see anomalies let us know. |
mtyka Volunteer moderator Project developer Project scientist Send message Joined: 19 Mar 08 Posts: 79 Credit: 0 RAC: 0 |
Tasks can not be suspended, boinc can do nothing with process. After few days I have about 10-15 death rosetta tasks with 3M RAM allocated. Is there a way to turn off the graphics ? What's DEP ? THis is interesting, i will have to look at the code to see if there's a way for it to hang somewhere. I wonder what happens to mini if the graphics app fails .. I will have to try that locally. If you can decribe your set-up in more detail that would be excellent. Is this taks 1.51 or 1.50 or 1.48 ? Mike |
![]() Send message Joined: 13 Jan 09 Posts: 100 Credit: 331,865 RAC: 0 |
Should the name of this thread be editted to include 1.51, or should a new thread be started for 1.51? |
Message boards :
RALPH@home bug list :
minirosetta v1.48-1.51 bug thread
©2023 University of Washington
http://www.bakerlab.org