Message boards : RALPH@home bug list : Bug Reports for Minirosetta v1.36
Author | Message |
---|---|
James Thompson Volunteer moderator Project developer Project scientist Send message Joined: 7 Jun 06 Posts: 16 Credit: 268 RAC: 0 |
Please post issues/bugs relating to minirosetta version 1.36 here. Version 1.35 has fixes for access violations, super-long workunit run times and communication problems with the BOINC manager. |
Conan Send message Joined: 16 Feb 06 Posts: 364 Credit: 1,368,421 RAC: 0 |
May I ask where the heck Application Version 1.00 came from if we are supposed to be up to 1.36 ??? I have had 3 validate errors and all had version 1.00 stamped on them. All have the same messages in the result that version 1.35 have (such as "recovering checkpoint" etc for about 30 lines for all new work units but does not validate. See 1107308 1107311 1107417 Also still no RAC decay on this project for participants and hosts. Thanks, Conan. |
AdeB Send message Joined: 22 Dec 07 Posts: 61 Credit: 161,367 RAC: 0 |
May I ask where the heck Application Version 1.00 came from if we are supposed to be up to 1.36 ??? A few days ago there was a new application running on my computer: minirosetta_split_terms version 100. |
Qui-Gon Jinn Send message Joined: 28 Sep 08 Posts: 3 Credit: 0 RAC: 0 |
I lost 1.5 hours of work when BOINC switched applications. The task is hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t303__IGNORE_THE_REST_1YS9A_13_5083_1_0 Applications were not left in memory while others (including mass production minirosetta 1.34's) were running |
Conan Send message Joined: 16 Feb 06 Posts: 364 Credit: 1,368,421 RAC: 0 |
I have 4 running work units (3 more waiting), that have been running for over 8 hours, with 3 of these running for over 11 hours (my preference is set to 6 hours). I had a power failure and on restarting Boinc all was running fine for quite a number of minutes when I noticed that two of the 11 hour work units had disappeared. They had not uploaded so on checking I noticed that I still had the same number of work units in my queue and they had restarted from zero. The other 11 hour WU then did the same thing, so now all of the 11 hour WU's have restarted from zero time, zero progress. From 7 hours progress there were 9 minutes 57 seconds to go, this stayed the same for the next 4 hours until restarting. Not wanting to waste another day of processing I have decided to abort all 4 of the ones that have gone past 8 and 11 hours. I am not happy about this. There is no error messages in the result to indicate what it did or why it did it, See 1109819 1109832 1109843 1109844 Conan. EDIT:: After reading that this same problem has been going on with the Rosetta work units since "hombench" started back in September I decided to abort all remaining work units on my Linux machines and leave the ones on the Windows machines as not sure if it affects Windows or not. I had 1110265 also exhibit the same behavior as the ones already reported, so I believe they are all faulty and just wasting our time and energy. |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
I lost 1.5 hours of work when BOINC switched applications. The task is ...that is normal when you do not leave applications in memory. They continue to work on increasing the frequency of checkpoints, but it is a process, not an event. Newer versions of BOINC try to wait until a checkpoint is reached before switching applications. This preserves more work. |
Path7 Send message Joined: 11 Feb 08 Posts: 56 Credit: 4,974 RAC: 0 |
The next Wu: was finished by the watchdog: hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t286__IGNORE_THE_REST_1A2OA_10_5077_1_0 Rosetta is going too long. Watchdog is ending the run! CPU time: 27579.8 seconds. Greater than 3X preferred time: 7200 seconds Have a nice day, Path7. |
Conan Send message Joined: 16 Feb 06 Posts: 364 Credit: 1,368,421 RAC: 0 |
I have 4 running work units (3 more waiting), that have been running for over 8 hours, with 3 of these running for over 11 hours (my preference is set to 6 hours). Well another day and more wasted effort, 1113445 1111201 1111140 1111132 1111123 1111049 1110265 1109843 All these work units went past my set preference and stuck with just under 10 minutes to go for ages. All WU's show the same symptoms as I have mentioned above. As I am going away, I have aborted all Ralph work units so that my computers don't get stuck for hours doing nothing and probably getting nothing for the effort. Please sort this out and I will gladly allow work went I return, it has been going on for close to a month now, here and on Rosetta, same problem and same work units type. Conan. |
Qui-Gon Jinn Send message Joined: 28 Sep 08 Posts: 3 Credit: 0 RAC: 0 |
Newer versions of BOINC try to wait until a checkpoint is reached before switching applications. This preserves more work. Yes, but losing (again) 84% of the work is not nice. The end of the task seems to take forever, though. I looked at boinc yesterday and 1 hr later it was only advanced .5% at 95%. In contrast, the first 85% my computer finished in about 2 hrs. P.S It is DEFINITELY not normal to go from 95% to 11% because I had to shut down my computer. There has to be a checkpoint in between. UPDATE: 20 minutes later... my computer advanced 40%. Isn't that strange |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
The end of the task seems to take forever, though. ...1 hr later it was only advanced .5% at 95%. ...the first 85%... in about 2 hrs. Now you understand why they are working hard to focus on, and eliminate the long running models! A checkpoint is made at the end of each model, and sometimes more frequently then that, depending on the type of work. And you are describing symptoms of a task that runs for 3 hours and still has not completed it's first model. |
Qui-Gon Jinn Send message Joined: 28 Sep 08 Posts: 3 Credit: 0 RAC: 0 |
Ok I get it now. Thanks. |
Rabinovitch Send message Joined: 7 Oct 08 Posts: 3 Credit: 191,411 RAC: 0 |
08.10.2008 5:34:48|ralph@home|Computation for task hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t322__IGNORE_THE_REST_1S5UA_15_5091_1_0 finished 08.10.2008 5:34:48|ralph@home|Output file hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t322__IGNORE_THE_REST_1S5UA_15_5091_1_0_0 for task hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t322__IGNORE_THE_REST_1S5UA_15_5091_1_0 absent |
AdeB Send message Joined: 22 Dec 07 Posts: 61 Credit: 161,367 RAC: 0 |
task 1109308 and task 1111517 ERROR: NANs occured in hbonding! ERROR:: Exit from: src/core/scoring/hbonds/hbonds_geom.cc line: 763 and Granted credit: 0 for both after running more than 4 hours. |
Ed and Harriet Griffith Send message Joined: 13 Apr 08 Posts: 2 Credit: 3,446 RAC: 0 |
Works great, but time to completion is off. (says needs 15 minutes when it takes 2 hours) |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
Looking at Ed's results, the last two reported were both ended by the watchdog because the 1hr runtime preference was exceeded by 3 times. The two tasks: hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t303__IGNORE_THE_REST_1FEZA_4_5083_1_0 hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t303__IGNORE_THE_REST_1FEZA_3_5083_1_0 |
Pieface Send message Joined: 16 Feb 06 Posts: 64 Credit: 203,513 RAC: 0 |
These guys are strange! I'm running win xp x64, with leave apps in memory. have three running simultaneously and getting re-start messages every 10 minutes or so. One just 'finished' according to boinc, but in the stderr I found: # cpu_run_time_pref: 86400 failed to create shared mem segment CreateSemaphore failure! Cannot create semaphore! # cpu_run_time_pref: 86400 # cpu_run_time_pref: 86400 # cpu_run_time_pref: 86400 Too many restarts with no progress. Keep application in memory while preempted. ====================================================== DONE :: 1 starting structures 1281.86 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== WU is: RESID 1114064 |
Rabinovitch Send message Joined: 7 Oct 08 Posts: 3 Credit: 191,411 RAC: 0 |
10.10.2008 1:07:39|ralph@home|Computation for task hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t322__IGNORE_THE_REST_1VPMA_17_5091_1_0 finished 10.10.2008 1:07:39|ralph@home|Output file hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t322__IGNORE_THE_REST_1VPMA_17_5091_1_0_0 for task hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t322__IGNORE_THE_REST_1VPMA_17_5091_1_0 absent |
Pieface Send message Joined: 16 Feb 06 Posts: 64 Credit: 203,513 RAC: 0 |
Second one ended same as the first; I was watching the graphics for a bit, and just before one of the re-starts I thought I saw a message go bye saying something about being in the same step for 5 mins with no progress.. is a 2ghz machine too slow for these guys? stderr this time: # cpu_run_time_pref: 86400 # cpu_run_time_pref: 86400 failed to create shared mem segment CreateSemaphore failure! Cannot create semaphore! # cpu_run_time_pref: 86400 # cpu_run_time_pref: 86400 # cpu_run_time_pref: 86400 # cpu_run_time_pref: 86400 Too many restarts with no progress. Keep application in memory while preempted. ====================================================== DONE :: 1 starting structures 5925.52 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== Wu output: Resid 1114065 |
EvoDude Send message Joined: 18 Feb 06 Posts: 28 Credit: 639,833 RAC: 0 |
Couple of computation errors today on the first series. 1111935 1111929 Same error message on both:- <core_client_version>6.2.19</core_client_version> <![CDATA[ <stderr_txt> ====================================================== DONE :: 1 starting structures 2852.38 cpu seconds This process generated 2 decoys from 2 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... called boinc_finish </stderr_txt> <message> <file_xfer_error> <file_name>hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t325__IGNORE_THE_REST_1P0KA_9_5092_1_0_0</file_name> <error_code>-161</error_code> </file_xfer_error> </message> ]]> |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t322__IGNORE_THE_REST_2F3XA_12_5091_1_0 has been going 16hrs and still on model 1. |
Message boards :
RALPH@home bug list :
Bug Reports for Minirosetta v1.36
©2024 University of Washington
http://www.bakerlab.org