Bug Reports for Minirosetta v1.36

Author	Message
James Thompson Volunteer moderator Project developer Project scientist Send message Joined: 7 Jun 06 Posts: 16 Credit: 268 RAC: 0	Message 4238 - Posted: 5 Oct 2008, 23:42:36 UTC Please post issues/bugs relating to minirosetta version 1.36 here. Version 1.35 has fixes for access violations, super-long workunit run times and communication problems with the BOINC manager. ID: 4238 · Reply Quote

Conan Send message Joined: 16 Feb 06 Posts: 364 Credit: 1,368,421 RAC: 0	Message 4239 - Posted: 6 Oct 2008, 8:08:27 UTC Last modified: 6 Oct 2008, 8:09:59 UTC May I ask where the heck Application Version 1.00 came from if we are supposed to be up to 1.36 ??? I have had 3 validate errors and all had version 1.00 stamped on them. All have the same messages in the result that version 1.35 have (such as "recovering checkpoint" etc for about 30 lines for all new work units but does not validate. See 1107308 1107311 1107417 Also still no RAC decay on this project for participants and hosts. Thanks, Conan. ID: 4239 · Reply Quote

AdeB Send message Joined: 22 Dec 07 Posts: 61 Credit: 161,367 RAC: 0	Message 4240 - Posted: 6 Oct 2008, 17:08:39 UTC - in response to Message 4239. May I ask where the heck Application Version 1.00 came from if we are supposed to be up to 1.36 ??? I have had 3 validate errors and all had version 1.00 stamped on them. All have the same messages in the result that version 1.35 have (such as "recovering checkpoint" etc for about 30 lines for all new work units but does not validate. See 1107308 1107311 1107417 Also still no RAC decay on this project for participants and hosts. Thanks, Conan. A few days ago there was a new application running on my computer: minirosetta_split_terms version 100. ID: 4240 · Reply Quote

Qui-Gon Jinn Send message Joined: 28 Sep 08 Posts: 3 Credit: 0 RAC: 0	Message 4241 - Posted: 7 Oct 2008, 0:36:07 UTC I lost 1.5 hours of work when BOINC switched applications. The task is hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t303__IGNORE_THE_REST_1YS9A_13_5083_1_0 Applications were not left in memory while others (including mass production minirosetta 1.34's) were running ID: 4241 · Reply Quote

Conan Send message Joined: 16 Feb 06 Posts: 364 Credit: 1,368,421 RAC: 0	Message 4242 - Posted: 7 Oct 2008, 11:36:07 UTC Last modified: 7 Oct 2008, 12:08:16 UTC I have 4 running work units (3 more waiting), that have been running for over 8 hours, with 3 of these running for over 11 hours (my preference is set to 6 hours). I had a power failure and on restarting Boinc all was running fine for quite a number of minutes when I noticed that two of the 11 hour work units had disappeared. They had not uploaded so on checking I noticed that I still had the same number of work units in my queue and they had restarted from zero. The other 11 hour WU then did the same thing, so now all of the 11 hour WU's have restarted from zero time, zero progress. From 7 hours progress there were 9 minutes 57 seconds to go, this stayed the same for the next 4 hours until restarting. Not wanting to waste another day of processing I have decided to abort all 4 of the ones that have gone past 8 and 11 hours. I am not happy about this. There is no error messages in the result to indicate what it did or why it did it, See 1109819 1109832 1109843 1109844 Conan. EDIT:: After reading that this same problem has been going on with the Rosetta work units since "hombench" started back in September I decided to abort all remaining work units on my Linux machines and leave the ones on the Windows machines as not sure if it affects Windows or not. I had 1110265 also exhibit the same behavior as the ones already reported, so I believe they are all faulty and just wasting our time and energy. ID: 4242 · Reply Quote

feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 4243 - Posted: 7 Oct 2008, 15:22:23 UTC - in response to Message 4241. I lost 1.5 hours of work when BOINC switched applications. The task is hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t303__IGNORE_THE_REST_1YS9A_13_5083_1_0 Applications were not left in memory while others (including mass production minirosetta 1.34's) were running ...that is normal when you do not leave applications in memory. They continue to work on increasing the frequency of checkpoints, but it is a process, not an event. Newer versions of BOINC try to wait until a checkpoint is reached before switching applications. This preserves more work. ID: 4243 · Reply Quote

Path7 Send message Joined: 11 Feb 08 Posts: 56 Credit: 4,974 RAC: 0	Message 4245 - Posted: 7 Oct 2008, 16:56:39 UTC The next Wu: was finished by the watchdog: hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t286__IGNORE_THE_REST_1A2OA_10_5077_1_0 Rosetta is going too long. Watchdog is ending the run! CPU time: 27579.8 seconds. Greater than 3X preferred time: 7200 seconds Have a nice day, Path7. ID: 4245 · Reply Quote

Conan Send message Joined: 16 Feb 06 Posts: 364 Credit: 1,368,421 RAC: 0	Message 4246 - Posted: 7 Oct 2008, 20:08:27 UTC - in response to Message 4242. I have 4 running work units (3 more waiting), that have been running for over 8 hours, with 3 of these running for over 11 hours (my preference is set to 6 hours). I had a power failure and on restarting Boinc all was running fine for quite a number of minutes when I noticed that two of the 11 hour work units had disappeared. They had not uploaded so on checking I noticed that I still had the same number of work units in my queue and they had restarted from zero. The other 11 hour WU then did the same thing, so now all of the 11 hour WU's have restarted from zero time, zero progress. From 7 hours progress there were 9 minutes 57 seconds to go, this stayed the same for the next 4 hours until restarting. Not wanting to waste another day of processing I have decided to abort all 4 of the ones that have gone past 8 and 11 hours. I am not happy about this. There is no error messages in the result to indicate what it did or why it did it, See 1109819 1109832 1109843 1109844 Conan. EDIT:: After reading that this same problem has been going on with the Rosetta work units since "hombench" started back in September I decided to abort all remaining work units on my Linux machines and leave the ones on the Windows machines as not sure if it affects Windows or not. I had 1110265 also exhibit the same behavior as the ones already reported, so I believe they are all faulty and just wasting our time and energy. Well another day and more wasted effort, 1113445 1111201 1111140 1111132 1111123 1111049 1110265 1109843 All these work units went past my set preference and stuck with just under 10 minutes to go for ages. All WU's show the same symptoms as I have mentioned above. As I am going away, I have aborted all Ralph work units so that my computers don't get stuck for hours doing nothing and probably getting nothing for the effort. Please sort this out and I will gladly allow work went I return, it has been going on for close to a month now, here and on Rosetta, same problem and same work units type. Conan. ID: 4246 · Reply Quote

Qui-Gon Jinn Send message Joined: 28 Sep 08 Posts: 3 Credit: 0 RAC: 0	Message 4247 - Posted: 7 Oct 2008, 23:17:10 UTC - in response to Message 4243. Newer versions of BOINC try to wait until a checkpoint is reached before switching applications. This preserves more work. Yes, but losing (again) 84% of the work is not nice. The end of the task seems to take forever, though. I looked at boinc yesterday and 1 hr later it was only advanced .5% at 95%. In contrast, the first 85% my computer finished in about 2 hrs. P.S It is DEFINITELY not normal to go from 95% to 11% because I had to shut down my computer. There has to be a checkpoint in between. UPDATE: 20 minutes later... my computer advanced 40%. Isn't that strange ID: 4247 · Reply Quote

feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 4248 - Posted: 8 Oct 2008, 1:37:15 UTC - in response to Message 4247. The end of the task seems to take forever, though. ...1 hr later it was only advanced .5% at 95%. ...the first 85%... in about 2 hrs. Now you understand why they are working hard to focus on, and eliminate the long running models! A checkpoint is made at the end of each model, and sometimes more frequently then that, depending on the type of work. And you are describing symptoms of a task that runs for 3 hours and still has not completed it's first model. ID: 4248 · Reply Quote

Qui-Gon Jinn Send message Joined: 28 Sep 08 Posts: 3 Credit: 0 RAC: 0	Message 4249 - Posted: 8 Oct 2008, 2:16:58 UTC Ok I get it now. Thanks. ID: 4249 · Reply Quote

Rabinovitch Send message Joined: 7 Oct 08 Posts: 3 Credit: 191,411 RAC: 0	Message 4250 - Posted: 8 Oct 2008, 2:59:03 UTC 08.10.2008 5:34:48\|ralph@home\|Computation for task hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t322__IGNORE_THE_REST_1S5UA_15_5091_1_0 finished 08.10.2008 5:34:48\|ralph@home\|Output file hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t322__IGNORE_THE_REST_1S5UA_15_5091_1_0_0 for task hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t322__IGNORE_THE_REST_1S5UA_15_5091_1_0 absent ID: 4250 · Reply Quote

AdeB Send message Joined: 22 Dec 07 Posts: 61 Credit: 161,367 RAC: 0	Message 4251 - Posted: 8 Oct 2008, 22:04:50 UTC Last modified: 8 Oct 2008, 22:09:41 UTC task 1109308 and task 1111517 ERROR: NANs occured in hbonding! ERROR:: Exit from: src/core/scoring/hbonds/hbonds_geom.cc line: 763 and Granted credit: 0 for both after running more than 4 hours. ID: 4251 · Reply Quote

Ed and Harriet Griffith Send message Joined: 13 Apr 08 Posts: 2 Credit: 3,446 RAC: 0	Message 4252 - Posted: 9 Oct 2008, 15:08:56 UTC Works great, but time to completion is off. (says needs 15 minutes when it takes 2 hours) ID: 4252 · Reply Quote

feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 4253 - Posted: 9 Oct 2008, 16:42:48 UTC Looking at Ed's results, the last two reported were both ended by the watchdog because the 1hr runtime preference was exceeded by 3 times. The two tasks: hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t303__IGNORE_THE_REST_1FEZA_4_5083_1_0 hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t303__IGNORE_THE_REST_1FEZA_3_5083_1_0 ID: 4253 · Reply Quote

Pieface Send message Joined: 16 Feb 06 Posts: 64 Credit: 203,513 RAC: 0	Message 4254 - Posted: 9 Oct 2008, 17:17:52 UTC These guys are strange! I'm running win xp x64, with leave apps in memory. have three running simultaneously and getting re-start messages every 10 minutes or so. One just 'finished' according to boinc, but in the stderr I found: # cpu_run_time_pref: 86400 failed to create shared mem segment CreateSemaphore failure! Cannot create semaphore! # cpu_run_time_pref: 86400 # cpu_run_time_pref: 86400 # cpu_run_time_pref: 86400 Too many restarts with no progress. Keep application in memory while preempted. ====================================================== DONE :: 1 starting structures 1281.86 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== WU is: RESID 1114064 ID: 4254 · Reply Quote

Rabinovitch Send message Joined: 7 Oct 08 Posts: 3 Credit: 191,411 RAC: 0	Message 4255 - Posted: 9 Oct 2008, 18:17:24 UTC 10.10.2008 1:07:39\|ralph@home\|Computation for task hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t322__IGNORE_THE_REST_1VPMA_17_5091_1_0 finished 10.10.2008 1:07:39\|ralph@home\|Output file hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t322__IGNORE_THE_REST_1VPMA_17_5091_1_0_0 for task hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t322__IGNORE_THE_REST_1VPMA_17_5091_1_0 absent ID: 4255 · Reply Quote

Pieface Send message Joined: 16 Feb 06 Posts: 64 Credit: 203,513 RAC: 0	Message 4256 - Posted: 9 Oct 2008, 19:10:43 UTC Second one ended same as the first; I was watching the graphics for a bit, and just before one of the re-starts I thought I saw a message go bye saying something about being in the same step for 5 mins with no progress.. is a 2ghz machine too slow for these guys? stderr this time: # cpu_run_time_pref: 86400 # cpu_run_time_pref: 86400 failed to create shared mem segment CreateSemaphore failure! Cannot create semaphore! # cpu_run_time_pref: 86400 # cpu_run_time_pref: 86400 # cpu_run_time_pref: 86400 # cpu_run_time_pref: 86400 Too many restarts with no progress. Keep application in memory while preempted. ====================================================== DONE :: 1 starting structures 5925.52 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== Wu output: Resid 1114065 ID: 4256 · Reply Quote

EvoDude Send message Joined: 18 Feb 06 Posts: 28 Credit: 639,833 RAC: 0	Message 4257 - Posted: 10 Oct 2008, 18:01:00 UTC Couple of computation errors today on the first series. 1111935 1111929 Same error message on both:- <core_client_version>6.2.19</core_client_version> <![CDATA[ <stderr_txt> ====================================================== DONE :: 1 starting structures 2852.38 cpu seconds This process generated 2 decoys from 2 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... called boinc_finish </stderr_txt> <message> <file_xfer_error> <file_name>hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t325__IGNORE_THE_REST_1P0KA_9_5092_1_0_0</file_name> <error_code>-161</error_code> </file_xfer_error> </message> ]]> ID: 4257 · Reply Quote

feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 4258 - Posted: 10 Oct 2008, 19:35:57 UTC Last modified: 10 Oct 2008, 19:36:51 UTC hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t322__IGNORE_THE_REST_2F3XA_12_5091_1_0 has been going 16hrs and still on model 1. ID: 4258 · Reply Quote