minirosetta v1.54 bug thread

Message boards : RALPH@home bug list : minirosetta v1.54 bug thread

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Profile Brotherbard

Send message
Joined: 16 Feb 06
Posts: 15
Credit: 76,109
RAC: 0
Message 4568 - Posted: 27 Jan 2009, 17:01:18 UTC - in response to Message 4566.  


Did you ever increase the size of the shared memory segment?

It is POSSIBLE that the original configuration is too limiting and that may be causing the error ... there are directions for increasing the size that *MAY* help ...


Yes, I did that some time ago.

When I checked it this morning, I had three workunits (1 from ralph and 2 from rosetta) that were stuck at a very low completion (around 0.111 to 0.240) and the two rosetta ones were just shy of 8 hours with my time setting at 2 hours. Show graphics does not show anything.

I needed to reboot due to installing some system updates and when BOINC came back up they started over at zero time and two of them failed right away, with one getting stuck again. I'm down to just this one workunit on my machine and I noticed that it is running around 200% CPU usage. It appears that both the main thread and the watchdog thread are stuck in __spin_lock.

--Nathan

ID: 4568 · Report as offensive    Reply Quote
Path7

Send message
Joined: 11 Feb 08
Posts: 56
Credit: 4,974
RAC: 0
Message 4569 - Posted: 27 Jan 2009, 19:14:29 UTC

Hello all,

Running Ubuntu 8.04 & BOINC 5.10.45 the next Wu's had an error after 0 seconds:
ccc_1_8_mammoth_cst_homo_bench_foldcst_chunk_general_t331__ mtyka_IGNORE_THE_REST_1RFEA_12_6251_9_1
ccc_1_8_mammoth_cst_homo_bench_foldcst_chunk_general_t328
__mtyka_IGNORE_THE_REST_2GVKA_9_6250_9_0

ccc_1_8_mammoth_cst_homo_bench_foldcst_chunk_general_t331
__mtyka_IGNORE_THE_REST_1RFEA_12_6251_5_0


All three: process exited with code 1 (0x1, -255)
ERROR: Option matching -loop:close_loops not found in command line top-level context

From BOINC (one example):
di 27 jan 2009 19:00:50 CET|ralph@home|Computation for task ccc_1_8_mammoth_cst_homo_bench_foldcst_chunk_general_t331__mtyka_IGNORE_THE_REST_1RFEA_12_6251_5_0 finished

di 27 jan 2009 19:00:50 CET|ralph@home|Output file ccc_1_8_mammoth_cst_homo_bench_foldcst_chunk_general_t331__mtyka_IGNORE_THE_REST_1RFEA_12_6251_5_0_0 for task ccc_1_8_mammoth_cst_homo_bench_foldcst_chunk_general_t331__mtyka_IGNORE_THE_REST_1RFEA_12_6251_5_0 absent


I hope this will be helpful,
Path7.

ID: 4569 · Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 14 Jan 09
Posts: 62
Credit: 33,293
RAC: 0
Message 4570 - Posted: 27 Jan 2009, 20:18:47 UTC - in response to Message 4568.  


Did you ever increase the size of the shared memory segment?

It is POSSIBLE that the original configuration is too limiting and that may be causing the error ... there are directions for increasing the size that *MAY* help ...


Yes, I did that some time ago.

When I checked it this morning, I had three workunits (1 from ralph and 2 from rosetta) that were stuck at a very low completion (around 0.111 to 0.240) and the two rosetta ones were just shy of 8 hours with my time setting at 2 hours. Show graphics does not show anything.

I needed to reboot due to installing some system updates and when BOINC came back up they started over at zero time and two of them failed right away, with one getting stuck again. I'm down to just this one workunit on my machine and I noticed that it is running around 200% CPU usage. It appears that both the main thread and the watchdog thread are stuck in __spin_lock.

--Nathan



Ugh ...

Well, it was my best shot ...

I have had zero problems with virtually all versions of Rosetta on OS-X (Intel) even when I was having 50% failure rate with another 10-25% overrun problems on Windows with mini-Rosetta ... heck, with 1.54 I have put several machines back onto Rosetta now ...

Sorry I could not help ...

ID: 4570 · Report as offensive    Reply Quote
I _ quit

Send message
Joined: 13 Jan 09
Posts: 44
Credit: 88,562
RAC: 0
Message 4571 - Posted: 27 Jan 2009, 20:20:47 UTC

pretty good success rate on my machine
25 tasks downloaded 10 to still do 1 failure and 14 successful completions so far.
ID: 4571 · Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 14 Jan 09
Posts: 62
Credit: 33,293
RAC: 0
Message 4572 - Posted: 27 Jan 2009, 21:13:12 UTC - in response to Message 4571.  

pretty good success rate on my machine
25 tasks downloaded 10 to still do 1 failure and 14 successful completions so far.


I lost track of how many I have done to this point, but with 1.54 I have not had a single failure, 5 systems including windows XP Pro and OS-X ... as a side note, no failures on the Rosetta side with 1.54 either that I have seen so far ....
ID: 4572 · Report as offensive    Reply Quote
mtyka
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 19 Mar 08
Posts: 79
Credit: 0
RAC: 0
Message 4573 - Posted: 27 Jan 2009, 22:12:56 UTC - in response to Message 4566.  

Having problems with Mac OS X on a Mac Pro on both ralph and rosetta. So far not a single 1.54 task has completed successfully. I've stopped downloading more.

https://ralph.bakerlab.org/results.php?hostid=16351
https://boinc.bakerlab.org/rosetta/results.php?hostid=585071



Did you ever increase the size of the shared memory segment?

It is POSSIBLE that the original configuration is too limiting and that may be causing the error ... there are directions for increasing the size that *MAY* help ...


Did you mean *me* or brotherbard ?

Now *I* did indeed increase the shared memory buffersize required for mini -
i think it will now use about 3MB (per app i guess).

To quote the page that Paul. D Buck pointed out:
"The amount of shared memory available on a Mac is configured at boot time. Once the shared memory system has been initiallized it is not possible to change the shared memory configuration[1]. At present the same amount of shared memory is configured on any Mac (about 4MB), regardless of the number of processors or the amount of total memory available."

Holy smokes ?! When setting this app i considered 3MB to be quite conservative on, oyu know, machines that routinely have 1GB and more. But this might be an issue if the dfault config is closer to 4MB.

Let me talk to the original authors of the graphics app and find out more - we might be onto soemthing here...


What else ?

glad to hear the 99 decoy limit is working ! yeah!

There is still something seriously fishy in the options systems i see a bunch of traces that end straight after "Initializing options..ok", 1.56 is in preparation that will hopefully reveal more about this bug.


THis:
ERROR: Option matching -loop:close_loops not found in command line top-level context

Is ok, its just old WUs executing with the new version which does no longer suppor tthis option. not to worry.




ID: 4573 · Report as offensive    Reply Quote
mtyka
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 19 Mar 08
Posts: 79
Credit: 0
RAC: 0
Message 4574 - Posted: 27 Jan 2009, 22:32:50 UTC - in response to Message 4568.  


Did you ever increase the size of the shared memory segment?

It is POSSIBLE that the original configuration is too limiting and that may be causing the error ... there are directions for increasing the size that *MAY* help ...


Yes, I did that some time ago.

When I checked it this morning, I had three workunits (1 from ralph and 2 from rosetta) that were stuck at a very low completion (around 0.111 to 0.240) and the two rosetta ones were just shy of 8 hours with my time setting at 2 hours. Show graphics does not show anything.

I needed to reboot due to installing some system updates and when BOINC came back up they started over at zero time and two of them failed right away, with one getting stuck again. I'm down to just this one workunit on my machine and I noticed that it is running around 200% CPU usage. It appears that both the main thread and the watchdog thread are stuck in __spin_lock.

--Nathan



Hi Nathan,

You as well ramostol are having a strange problem on MacOS that i've not seen anywhere else yet. It always seems to start with an error just after the Semaphore initialization and then fails a litlte bit further down.
Not sure how to approach this. I could send you a directory with a debug build and see if i can get a trace or something. But its going ot be neigh impossible to debug this from here since, i'm sad to say, on our MAcOSX machines is does not happen.
ID: 4574 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4577 - Posted: 27 Jan 2009, 22:56:28 UTC

FYI, another task ended after 99 models completed if you wanted to review it.
https://ralph.bakerlab.org/result.php?resultid=1267792
ID: 4577 · Report as offensive    Reply Quote
mtyka
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 19 Mar 08
Posts: 79
Credit: 0
RAC: 0
Message 4580 - Posted: 28 Jan 2009, 2:22:46 UTC


Ok, you power users, i need your help. We are seeing in our statistics
that a lot of people are seeing these errors over on BOINC:


too many exit(0)s


or lots of these:

Can't acquire lockfile - exiting
BOINC:: Initializing ... ok.
Can't acquire lockfile - exiting
BOINC:: Initializing ... ok.
Can't acquire lockfile - exiting


These are one and the same problem. Just sometimes the actuall error messages dont get saved. If you see this please let me know h
ow this looks from your point of view ?

Those of you who had lockfile problems - how did you solve them ?

What Client versions do you use ?

Are these clients somehow stuck ?

Need info - am pretty stuck with this one - it accounts for a hell of a lot of failures.


Some people never seem to get them and some get them all the time, if not every time.

Mike
ID: 4580 · Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 14 Jan 09
Posts: 62
Credit: 33,293
RAC: 0
Message 4581 - Posted: 28 Jan 2009, 6:21:04 UTC
Last modified: 28 Jan 2009, 6:24:28 UTC

Tell them to change use CPU to 100% ... the lock file error is from the processor stopping and restarting ... so you cannot set "Use CPU 95% of the time" and not see this error though it may not occur for all tasks.


I am not sure what the other interaction is that gives rise to this issue .. but I did post a note about it in RaH forums awhile ago in that I learned of it in Einstein forums ... :)

This seems to be a bug in the BOINC Manager that has gotten no attention ... feel free to bug the BOINC Developers, just don't mention my name or they will certainly ignore you ... they always ignore me if I speak up ...

{eidt}

I can see that this is not clear ...

If you set lower lower CPU limits the CPU stops and starts, SOMETIMES this causes the lock-file problem ... what is the other essential ingredient is not clear. But if you run the CPU at 100% this problem does not seem to occur ...

I had a 50% failure rate for this problem at Rosetta along with non-stopping models on XP Pro which is why I stopped doing RaH work on windows systems ... now we are using 1.54 I have restarted running models there ....
ID: 4581 · Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 14 Jan 09
Posts: 62
Credit: 33,293
RAC: 0
Message 4582 - Posted: 28 Jan 2009, 6:28:09 UTC - in response to Message 4573.  
Last modified: 28 Jan 2009, 6:40:52 UTC

Having problems with Mac OS X on a Mac Pro on both ralph and rosetta. So far not a single 1.54 task has completed successfully. I've stopped downloading more.

https://ralph.bakerlab.org/results.php?hostid=16351
https://boinc.bakerlab.org/rosetta/results.php?hostid=585071



Did you ever increase the size of the shared memory segment?

It is POSSIBLE that the original configuration is too limiting and that may be causing the error ... there are directions for increasing the size that *MAY* help ...


Did you mean *me* or brotherbard ?

Now *I* did indeed increase the shared memory buffersize required for mini -
i think it will now use about 3MB (per app i guess).

To quote the page that Paul. D Buck pointed out:
"The amount of shared memory available on a Mac is configured at boot time. Once the shared memory system has been initiallized it is not possible to change the shared memory configuration[1]. At present the same amount of shared memory is configured on any Mac (about 4MB), regardless of the number of processors or the amount of total memory available."

Holy smokes ?! When setting this app i considered 3MB to be quite conservative on, oyu know, machines that routinely have 1GB and more. But this might be an issue if the dfault config is closer to 4MB.

Let me talk to the original authors of the graphics app and find out more - we might be onto soemthing here...


Which may be why my Mac Pro runs so well because I set that value very high (I think I went even bigger than the settings of the page I cited) because I have 16 G RAM in the Mac Pro I run ... just on the off chance that I would use some additional room ...

Not at all sure what the shared memory size is on XP machines but that too may be an issue ...

Lastly, Linux probably has some limit set though where and how it is set and what the default size is may not be easy to determine ...

{edit}
the numbers look ok for linux not sure how they are determined...

{edit 2}
I am using Ubuntu 8.10, two core system with 2G memory, disk space 58.5 G

shmax = 33554432
shall = 2097152
shmmni = 4096

the citation's note at the top on determining the values did work on Ubuntu ... if someone has another version of Linux could you post your numbers?

sysctl -A | grep shm

is the command to run in terminal, you may get some access errors which I suppose you could supress by running it as sudo

{edit 3}

My OS-X is:

kern.sysv.shmall: 4096
kern.sysv.shmseg: 32
kern.sysv.shmmni: 128
kern.sysv.shmmin: 1
kern.sysv.shmmax: 16777216
ID: 4582 · Report as offensive    Reply Quote
Snagletooth

Send message
Joined: 4 May 07
Posts: 67
Credit: 134,427
RAC: 0
Message 4583 - Posted: 28 Jan 2009, 9:24:56 UTC

Just confirming that the 99 model limit works for folks other than Feet1st :)
testD_cc_1_8_nocst4_hb_t362__IGNORE_THE_REST_2GF6A_10_7075_3_0

ended after 29642.2 secs with a 36000 sec target runtime.

Snags
ID: 4583 · Report as offensive    Reply Quote
I _ quit

Send message
Joined: 13 Jan 09
Posts: 44
Credit: 88,562
RAC: 0
Message 4584 - Posted: 28 Jan 2009, 9:32:45 UTC

testD_cc2_1_8_mammoth_mix_cen_cst_hb_t342__IGNORE_THE_REST_2G0QA_19_7107_1_0 had a validate error happen. the task ran ok producing 15 decoys.
ID: 4584 · Report as offensive    Reply Quote
AdeB
Avatar

Send message
Joined: 22 Dec 07
Posts: 61
Credit: 161,367
RAC: 0
Message 4585 - Posted: 28 Jan 2009, 9:56:43 UTC - in response to Message 4582.  
Last modified: 28 Jan 2009, 9:58:59 UTC


I am using Ubuntu 8.10, two core system with 2G memory, disk space 58.5 G

shmax = 33554432
shall = 2097152
shmmni = 4096

the citation's note at the top on determining the values did work on Ubuntu ... if someone has another version of Linux could you post your numbers?

sysctl -A | grep shm

is the command to run in terminal, you may get some access errors which I suppose you could supress by running it as sudo



Both of my Gentoo Linux systems (Athlon XP with 512M memory) show exact the same numbers:

kernel.shmmax = 33554432
kernel.shmall = 2097152
kernel.shmmni = 4096
ID: 4585 · Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 14 Jan 09
Posts: 62
Credit: 33,293
RAC: 0
Message 4586 - Posted: 28 Jan 2009, 11:06:16 UTC - in response to Message 4585.  

Both of my Gentoo Linux systems (Athlon XP with 512M memory) show exact the same numbers:

kernel.shmmax = 33554432
kernel.shmall = 2097152
kernel.shmmni = 4096


Cool ...

It may be a problem localized to OS-X ... though changing the application may be the best route forward ... assuming that this is the cause, or one of the causes, of some of the problems ...

Anyone running Red Hat? Suse? Others I have never heard of? :)
ID: 4586 · Report as offensive    Reply Quote
Profile Brotherbard

Send message
Joined: 16 Feb 06
Posts: 15
Credit: 76,109
RAC: 0
Message 4587 - Posted: 28 Jan 2009, 15:32:18 UTC - in response to Message 4586.  

My Mac Pro was set to:
	kern.sysv.shmall: 8192
	kern.sysv.shmseg: 32
	kern.sysv.shmmni: 128
	kern.sysv.shmmin: 1
	kern.sysv.shmmax: 33554432

I changed it to match what Paul has on his Mac but that did not fix anything.

--Nathan

ID: 4587 · Report as offensive    Reply Quote
lazypug

Send message
Joined: 15 Jun 08
Posts: 2
Credit: 14,214
RAC: 0
Message 4588 - Posted: 28 Jan 2009, 18:19:23 UTC

using p4 1.8 winpro sp3


Task ID 1273176
Name testD_cc_1_8_nocst4_hb_t313__IGNORE_THE_REST_1BG2A_8_7063_4_0
Workunit 1123469
Created 27 Jan 2009 8:49:15 UTC
Sent 27 Jan 2009 8:54:13 UTC
Received 28 Jan 2009 14:17:36 UTC
Server state Over
Outcome Validate error
Client state Done
Exit status 0 (0x0)
Computer ID 14204
Report deadline 31 Jan 2009 8:54:13 UTC
CPU time 18011.97
stderr out <core_client_version>6.2.28</core_client_version>
<![CDATA[
<stderr_txt>
BOINC:: Initializing ... ok.
[2009- 1-28 4: 2:22:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing core...
Initializing options.... ok
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/ralph.bakerlab.org/minirosetta_database_rev26003.zip
<unzip> <-oq> <../../projects/ralph.bakerlab.org/minirosetta_database_rev26003.zip> <-d./>
Firstarg=true; pp=-d./
firstarg: <-d./>
End of unzipping.
Unpacking WU data ...
Unpacking data: ../../projects/ralph.bakerlab.org/testD_cc_1_8_nocst4.foldcst_chunk_general.t313_.mtyka.boinc_files.zip
<unzip> <-oq> <../../projects/ralph.bakerlab.org/testD_cc_1_8_nocst4.foldcst_chunk_general.t313_.mtyka.boinc_files.zip> <-d./>
Firstarg=true; pp=-d./
firstarg: <-d./>
End of unzipping.
Setting database description ...
Setting up checkpointing ...
Setting up folding (abrelax) ...
Beginning folding (abrelax) ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
Starting work on structure: _1BG2A_8_00001
====>
called boinc_finish

</stderr_txt>
]]>


Validate state Invalid
Claimed credit 28.8840523942973
Granted credit 0
application version 1.54
ID: 4588 · Report as offensive    Reply Quote
Profile Brotherbard

Send message
Joined: 16 Feb 06
Posts: 15
Credit: 76,109
RAC: 0
Message 4589 - Posted: 28 Jan 2009, 19:00:16 UTC - in response to Message 4587.  

I ran the minirosetta 1.54 app in gdb and here is the stack trace:

[code]Breakpoint 1, 0x9603b4a9 in malloc_error_break ()
(gdb
ID: 4589 · Report as offensive    Reply Quote
Profile Brotherbard

Send message
Joined: 16 Feb 06
Posts: 15
Credit: 76,109
RAC: 0
Message 4590 - Posted: 28 Jan 2009, 19:10:14 UTC - in response to Message 4587.  

I ran the minirosetta 1.54 app in gdb and here is the stack trace:

Breakpoint 1, 0x9603b4a9 in malloc_error_break ()
(gdb) bt
#0  0x9603b4a9 in malloc_error_break ()
#1  0x96036497 in szone_error ()
#2  0x95f60463 in szone_free ()
#3  0x95f602cd in free ()
#4  0x005d8576 in WEEK_PREFS::~WEEK_PREFS () at /usr/include/c++/4.0.0/bits/basic_string.h:227
#5  0x00d13e76 in GLOBAL_PREFS::~GLOBAL_PREFS ()
#6  0x00103d8d in protocols::boinc::Boinc::initialize_worker () at /usr/include/c++/4.0.0/bits/basic_string.h:227
#7  0x000034a1 in main () 

And here is the beginning of stderr.txt

BOINC:: Initializing ... ok.
[2009- 1-28  8:51: 4:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
minirosetta_1.54_i686-apple-darwin(1142,0xa07b2720) malloc: *** error for object 0x1a3d2e0: Non-aligned pointer being freed (2)
*** set a breakpoint in malloc_error_break to debug
minirosetta_1.54_i686-apple-darwin(1142,0xa07b2720) malloc: *** error for object 0x1a3c270: Non-aligned pointer being freed (2)
*** set a breakpoint in malloc_error_break to debug
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.

I do have two day-of-the-week overrides set and turning them off fixed the problem! I had one weekday set in CPU Usage and one in Network Usage and there were two malloc errors, a quick test shows you get one error from each weekday set. The only reason I had them set was because I was playing with the GUI-RPCs, I don't actually need them for anything. Also the daily time settings do not have this problem.

I'm not sure if minirosetta is doing anything special with the global prefs, I would suspect this is a BOINC defect. I'm running BOINC 6.2.18, and have not tested this on other versions, nor do I have work from any other project on this machine at the moment so cannot test if other projects fail like this too.

--Nathan

ID: 4590 · Report as offensive    Reply Quote
mtyka
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 19 Mar 08
Posts: 79
Credit: 0
RAC: 0
Message 4591 - Posted: 28 Jan 2009, 20:05:07 UTC - in response to Message 4590.  
Last modified: 28 Jan 2009, 20:08:30 UTC

I ran the minirosetta 1.54 app in gdb and here is the stack trace:


genius.



I'm not sure if minirosetta is doing anything special with the global prefs,

nope.


I would suspect this is a BOINC defect. I'm running BOINC 6.2.18, and have not tested this on other versions, nor do I have work from any other project on this machine at the moment so cannot test if other projects fail like this too.

--Nathan



Nathan - awesome ! Let me have a look at the code now, at least we have a handle on what failed. Maybe i can fix it in the next release. We should notify the boinc people too if this is an API error.

So what exactly is the week override setting ?

What happens if you remove it ?

Does that app run fine and through to the end ?


-- Mike
ID: 4591 · Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : RALPH@home bug list : minirosetta v1.54 bug thread



©2024 University of Washington
http://www.bakerlab.org