Bug Reports for Minirosetta v1.36

Message boards : RALPH@home bug list : Bug Reports for Minirosetta v1.36

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
James Thompson
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 7 Jun 06
Posts: 16
Credit: 268
RAC: 0
Message 4238 - Posted: 5 Oct 2008, 23:42:36 UTC

Please post issues/bugs relating to minirosetta version 1.36 here. Version 1.35 has fixes for access violations, super-long workunit run times and communication problems with the BOINC manager.


ID: 4238 · Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 16 Feb 06
Posts: 364
Credit: 1,368,421
RAC: 0
Message 4239 - Posted: 6 Oct 2008, 8:08:27 UTC
Last modified: 6 Oct 2008, 8:09:59 UTC

May I ask where the heck Application Version 1.00 came from if we are supposed to be up to 1.36 ???

I have had 3 validate errors and all had version 1.00 stamped on them.

All have the same messages in the result that version 1.35 have (such as "recovering checkpoint" etc for about 30 lines for all new work units but does not validate.

See 1107308
1107311
1107417


Also still no RAC decay on this project for participants and hosts.

Thanks, Conan.
ID: 4239 · Report as offensive    Reply Quote
AdeB
Avatar

Send message
Joined: 22 Dec 07
Posts: 61
Credit: 161,367
RAC: 0
Message 4240 - Posted: 6 Oct 2008, 17:08:39 UTC - in response to Message 4239.  

May I ask where the heck Application Version 1.00 came from if we are supposed to be up to 1.36 ???

I have had 3 validate errors and all had version 1.00 stamped on them.

All have the same messages in the result that version 1.35 have (such as "recovering checkpoint" etc for about 30 lines for all new work units but does not validate.

See 1107308
1107311
1107417


Also still no RAC decay on this project for participants and hosts.

Thanks, Conan.


A few days ago there was a new application running on my computer: minirosetta_split_terms version 100.
ID: 4240 · Report as offensive    Reply Quote
Qui-Gon Jinn

Send message
Joined: 28 Sep 08
Posts: 3
Credit: 0
RAC: 0
Message 4241 - Posted: 7 Oct 2008, 0:36:07 UTC

I lost 1.5 hours of work when BOINC switched applications. The task is

hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t303__IGNORE_THE_REST_1YS9A_13_5083_1_0

Applications were not left in memory while others (including mass production minirosetta 1.34's) were running
ID: 4241 · Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 16 Feb 06
Posts: 364
Credit: 1,368,421
RAC: 0
Message 4242 - Posted: 7 Oct 2008, 11:36:07 UTC
Last modified: 7 Oct 2008, 12:08:16 UTC

I have 4 running work units (3 more waiting), that have been running for over 8 hours, with 3 of these running for over 11 hours (my preference is set to 6 hours).

I had a power failure and on restarting Boinc all was running fine for quite a number of minutes when I noticed that two of the 11 hour work units had disappeared.
They had not uploaded so on checking I noticed that I still had the same number of work units in my queue and they had restarted from zero. The other 11 hour WU then did the same thing, so now all of the 11 hour WU's have restarted from zero time, zero progress.

From 7 hours progress there were 9 minutes 57 seconds to go, this stayed the same for the next 4 hours until restarting.

Not wanting to waste another day of processing I have decided to abort all 4 of the ones that have gone past 8 and 11 hours.
I am not happy about this.

There is no error messages in the result to indicate what it did or why it did it,
See 1109819
1109832
1109843
1109844

Conan.

EDIT:: After reading that this same problem has been going on with the Rosetta work units since "hombench" started back in September I decided to abort all remaining work units on my Linux machines and leave the ones on the Windows machines as not sure if it affects Windows or not.
I had 1110265 also exhibit the same behavior as the ones already reported, so I believe they are all faulty and just wasting our time and energy.
ID: 4242 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4243 - Posted: 7 Oct 2008, 15:22:23 UTC - in response to Message 4241.  

I lost 1.5 hours of work when BOINC switched applications. The task is

hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t303__IGNORE_THE_REST_1YS9A_13_5083_1_0

Applications were not left in memory while others (including mass production minirosetta 1.34's) were running


...that is normal when you do not leave applications in memory. They continue to work on increasing the frequency of checkpoints, but it is a process, not an event.

Newer versions of BOINC try to wait until a checkpoint is reached before switching applications. This preserves more work.
ID: 4243 · Report as offensive    Reply Quote
Path7

Send message
Joined: 11 Feb 08
Posts: 56
Credit: 4,974
RAC: 0
Message 4245 - Posted: 7 Oct 2008, 16:56:39 UTC

The next Wu: was finished by the watchdog:
hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t286__IGNORE_THE_REST_1A2OA_10_5077_1_0

Rosetta is going too long. Watchdog is ending the run!
CPU time: 27579.8 seconds. Greater than 3X preferred time: 7200 seconds

Have a nice day,
Path7.
ID: 4245 · Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 16 Feb 06
Posts: 364
Credit: 1,368,421
RAC: 0
Message 4246 - Posted: 7 Oct 2008, 20:08:27 UTC - in response to Message 4242.  

I have 4 running work units (3 more waiting), that have been running for over 8 hours, with 3 of these running for over 11 hours (my preference is set to 6 hours).

I had a power failure and on restarting Boinc all was running fine for quite a number of minutes when I noticed that two of the 11 hour work units had disappeared.
They had not uploaded so on checking I noticed that I still had the same number of work units in my queue and they had restarted from zero. The other 11 hour WU then did the same thing, so now all of the 11 hour WU's have restarted from zero time, zero progress.

From 7 hours progress there were 9 minutes 57 seconds to go, this stayed the same for the next 4 hours until restarting.

Not wanting to waste another day of processing I have decided to abort all 4 of the ones that have gone past 8 and 11 hours.
I am not happy about this.

There is no error messages in the result to indicate what it did or why it did it,
See 1109819
1109832
1109843
1109844

Conan.

EDIT:: After reading that this same problem has been going on with the Rosetta work units since "hombench" started back in September I decided to abort all remaining work units on my Linux machines and leave the ones on the Windows machines as not sure if it affects Windows or not.
I had 1110265 also exhibit the same behavior as the ones already reported, so I believe they are all faulty and just wasting our time and energy.


Well another day and more wasted effort,
1113445
1111201
1111140
1111132
1111123
1111049
1110265
1109843

All these work units went past my set preference and stuck with just under 10 minutes to go for ages. All WU's show the same symptoms as I have mentioned above.
As I am going away, I have aborted all Ralph work units so that my computers don't get stuck for hours doing nothing and probably getting nothing for the effort.

Please sort this out and I will gladly allow work went I return, it has been going on for close to a month now, here and on Rosetta, same problem and same work units type.

Conan.
ID: 4246 · Report as offensive    Reply Quote
Qui-Gon Jinn

Send message
Joined: 28 Sep 08
Posts: 3
Credit: 0
RAC: 0
Message 4247 - Posted: 7 Oct 2008, 23:17:10 UTC - in response to Message 4243.  

Newer versions of BOINC try to wait until a checkpoint is reached before switching applications. This preserves more work.


Yes, but losing (again) 84% of the work is not nice. The end of the task seems to take forever, though. I looked at boinc yesterday and 1 hr later it was only advanced .5% at 95%. In contrast, the first 85% my computer finished in about 2 hrs.

P.S It is DEFINITELY not normal to go from 95% to 11% because I had to shut down my computer. There has to be a checkpoint in between.

UPDATE: 20 minutes later... my computer advanced 40%. Isn't that strange
ID: 4247 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4248 - Posted: 8 Oct 2008, 1:37:15 UTC - in response to Message 4247.  

The end of the task seems to take forever, though. ...1 hr later it was only advanced .5% at 95%. ...the first 85%... in about 2 hrs.


Now you understand why they are working hard to focus on, and eliminate the long running models! A checkpoint is made at the end of each model, and sometimes more frequently then that, depending on the type of work. And you are describing symptoms of a task that runs for 3 hours and still has not completed it's first model.
ID: 4248 · Report as offensive    Reply Quote
Qui-Gon Jinn

Send message
Joined: 28 Sep 08
Posts: 3
Credit: 0
RAC: 0
Message 4249 - Posted: 8 Oct 2008, 2:16:58 UTC

Ok I get it now. Thanks.
ID: 4249 · Report as offensive    Reply Quote
Rabinovitch

Send message
Joined: 7 Oct 08
Posts: 3
Credit: 191,411
RAC: 0
Message 4250 - Posted: 8 Oct 2008, 2:59:03 UTC

08.10.2008 5:34:48|ralph@home|Computation for task hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t322__IGNORE_THE_REST_1S5UA_15_5091_1_0 finished
08.10.2008 5:34:48|ralph@home|Output file hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t322__IGNORE_THE_REST_1S5UA_15_5091_1_0_0 for task hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t322__IGNORE_THE_REST_1S5UA_15_5091_1_0 absent
ID: 4250 · Report as offensive    Reply Quote
AdeB
Avatar

Send message
Joined: 22 Dec 07
Posts: 61
Credit: 161,367
RAC: 0
Message 4251 - Posted: 8 Oct 2008, 22:04:50 UTC
Last modified: 8 Oct 2008, 22:09:41 UTC

task 1109308 and task 1111517

ERROR: NANs occured in hbonding!
ERROR:: Exit from: src/core/scoring/hbonds/hbonds_geom.cc line: 763

and Granted credit: 0 for both after running more than 4 hours.
ID: 4251 · Report as offensive    Reply Quote
Profile Ed and Harriet Griffith
Avatar

Send message
Joined: 13 Apr 08
Posts: 2
Credit: 3,446
RAC: 0
Message 4252 - Posted: 9 Oct 2008, 15:08:56 UTC

Works great, but time to completion is off. (says needs 15 minutes when it takes 2 hours)
ID: 4252 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4253 - Posted: 9 Oct 2008, 16:42:48 UTC

Looking at Ed's results, the last two reported were both ended by the watchdog because the 1hr runtime preference was exceeded by 3 times.

The two tasks:
hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t303__IGNORE_THE_REST_1FEZA_4_5083_1_0
hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t303__IGNORE_THE_REST_1FEZA_3_5083_1_0
ID: 4253 · Report as offensive    Reply Quote
Pieface

Send message
Joined: 16 Feb 06
Posts: 64
Credit: 203,513
RAC: 0
Message 4254 - Posted: 9 Oct 2008, 17:17:52 UTC

These guys are strange! I'm running win xp x64, with leave apps in memory. have three running simultaneously and getting re-start messages every 10 minutes or so. One just 'finished' according to boinc, but in the stderr I found:

# cpu_run_time_pref: 86400
failed to create shared mem segment
CreateSemaphore failure! Cannot create semaphore!
# cpu_run_time_pref: 86400
# cpu_run_time_pref: 86400
# cpu_run_time_pref: 86400
Too many restarts with no progress. Keep application in memory while preempted.
======================================================
DONE :: 1 starting structures 1281.86 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================

WU is: RESID 1114064
ID: 4254 · Report as offensive    Reply Quote
Rabinovitch

Send message
Joined: 7 Oct 08
Posts: 3
Credit: 191,411
RAC: 0
Message 4255 - Posted: 9 Oct 2008, 18:17:24 UTC

10.10.2008 1:07:39|ralph@home|Computation for task hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t322__IGNORE_THE_REST_1VPMA_17_5091_1_0 finished
10.10.2008 1:07:39|ralph@home|Output file hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t322__IGNORE_THE_REST_1VPMA_17_5091_1_0_0 for task hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t322__IGNORE_THE_REST_1VPMA_17_5091_1_0 absent
ID: 4255 · Report as offensive    Reply Quote
Pieface

Send message
Joined: 16 Feb 06
Posts: 64
Credit: 203,513
RAC: 0
Message 4256 - Posted: 9 Oct 2008, 19:10:43 UTC

Second one ended same as the first; I was watching the graphics for a bit, and just before one of the re-starts I thought I saw a message go bye saying something about being in the same step for 5 mins with no progress.. is a 2ghz machine too slow for these guys?

stderr this time:

# cpu_run_time_pref: 86400
# cpu_run_time_pref: 86400
failed to create shared mem segment
CreateSemaphore failure! Cannot create semaphore!
# cpu_run_time_pref: 86400
# cpu_run_time_pref: 86400
# cpu_run_time_pref: 86400
# cpu_run_time_pref: 86400
Too many restarts with no progress. Keep application in memory while preempted.
======================================================
DONE :: 1 starting structures 5925.52 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================

Wu output:

Resid 1114065
ID: 4256 · Report as offensive    Reply Quote
Profile EvoDude
Avatar

Send message
Joined: 18 Feb 06
Posts: 28
Credit: 639,833
RAC: 0
Message 4257 - Posted: 10 Oct 2008, 18:01:00 UTC

Couple of computation errors today on the first series.

1111935
1111929

Same error message on both:-

<core_client_version>6.2.19</core_client_version>
<![CDATA[
<stderr_txt>
======================================================
DONE :: 1 starting structures 2852.38 cpu seconds
This process generated 2 decoys from 2 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
<message>
<file_xfer_error>
<file_name>hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t325__IGNORE_THE_REST_1P0KA_9_5092_1_0_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>
]]>

ID: 4257 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4258 - Posted: 10 Oct 2008, 19:35:57 UTC
Last modified: 10 Oct 2008, 19:36:51 UTC

ID: 4258 · Report as offensive    Reply Quote
1 · 2 · 3 · Next

Message boards : RALPH@home bug list : Bug Reports for Minirosetta v1.36



©2024 University of Washington
http://www.bakerlab.org