Message boards : RALPH@home bug list : minirosetta v1.54 bug thread
Previous · 1 · 2
Author | Message |
---|---|
Paul D. Buck Send message Joined: 14 Jan 09 Posts: 62 Credit: 33,293 RAC: 0 |
pretty good success rate on my machine I lost track of how many I have done to this point, but with 1.54 I have not had a single failure, 5 systems including windows XP Pro and OS-X ... as a side note, no failures on the Rosetta side with 1.54 either that I have seen so far .... |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
FYI, another task ended after 99 models completed if you wanted to review it. https://ralph.bakerlab.org/result.php?resultid=1267792 |
Paul D. Buck Send message Joined: 14 Jan 09 Posts: 62 Credit: 33,293 RAC: 0 |
Tell them to change use CPU to 100% ... the lock file error is from the processor stopping and restarting ... so you cannot set "Use CPU 95% of the time" and not see this error though it may not occur for all tasks. I am not sure what the other interaction is that gives rise to this issue .. but I did post a note about it in RaH forums awhile ago in that I learned of it in Einstein forums ... :) This seems to be a bug in the BOINC Manager that has gotten no attention ... feel free to bug the BOINC Developers, just don't mention my name or they will certainly ignore you ... they always ignore me if I speak up ... {eidt} I can see that this is not clear ... If you set lower lower CPU limits the CPU stops and starts, SOMETIMES this causes the lock-file problem ... what is the other essential ingredient is not clear. But if you run the CPU at 100% this problem does not seem to occur ... I had a 50% failure rate for this problem at Rosetta along with non-stopping models on XP Pro which is why I stopped doing RaH work on windows systems ... now we are using 1.54 I have restarted running models there .... |
Paul D. Buck Send message Joined: 14 Jan 09 Posts: 62 Credit: 33,293 RAC: 0 |
Having problems with Mac OS X on a Mac Pro on both ralph and rosetta. So far not a single 1.54 task has completed successfully. I've stopped downloading more. Which may be why my Mac Pro runs so well because I set that value very high (I think I went even bigger than the settings of the page I cited) because I have 16 G RAM in the Mac Pro I run ... just on the off chance that I would use some additional room ... Not at all sure what the shared memory size is on XP machines but that too may be an issue ... Lastly, Linux probably has some limit set though where and how it is set and what the default size is may not be easy to determine ... {edit} the numbers look ok for linux not sure how they are determined... {edit 2} I am using Ubuntu 8.10, two core system with 2G memory, disk space 58.5 G shmax = 33554432 shall = 2097152 shmmni = 4096 the citation's note at the top on determining the values did work on Ubuntu ... if someone has another version of Linux could you post your numbers? sysctl -A | grep shm is the command to run in terminal, you may get some access errors which I suppose you could supress by running it as sudo {edit 3} My OS-X is: kern.sysv.shmall: 4096 kern.sysv.shmseg: 32 kern.sysv.shmmni: 128 kern.sysv.shmmin: 1 kern.sysv.shmmax: 16777216 |
Snagletooth Send message Joined: 4 May 07 Posts: 67 Credit: 134,427 RAC: 0 |
Just confirming that the 99 model limit works for folks other than Feet1st :) testD_cc_1_8_nocst4_hb_t362__IGNORE_THE_REST_2GF6A_10_7075_3_0 ended after 29642.2 secs with a 36000 sec target runtime. Snags |
I _ quit Send message Joined: 13 Jan 09 Posts: 44 Credit: 88,562 RAC: 0 |
testD_cc2_1_8_mammoth_mix_cen_cst_hb_t342__IGNORE_THE_REST_2G0QA_19_7107_1_0 had a validate error happen. the task ran ok producing 15 decoys. |
AdeB Send message Joined: 22 Dec 07 Posts: 61 Credit: 161,367 RAC: 0 |
Both of my Gentoo Linux systems (Athlon XP with 512M memory) show exact the same numbers: kernel.shmmax = 33554432 kernel.shmall = 2097152 kernel.shmmni = 4096 |
Paul D. Buck Send message Joined: 14 Jan 09 Posts: 62 Credit: 33,293 RAC: 0 |
Both of my Gentoo Linux systems (Athlon XP with 512M memory) show exact the same numbers: Cool ... It may be a problem localized to OS-X ... though changing the application may be the best route forward ... assuming that this is the cause, or one of the causes, of some of the problems ... Anyone running Red Hat? Suse? Others I have never heard of? :) |
Brotherbard Send message Joined: 16 Feb 06 Posts: 15 Credit: 76,109 RAC: 0 |
My Mac Pro was set to: kern.sysv.shmall: 8192 kern.sysv.shmseg: 32 kern.sysv.shmmni: 128 kern.sysv.shmmin: 1 kern.sysv.shmmax: 33554432 I changed it to match what Paul has on his Mac but that did not fix anything. --Nathan |
lazypug Send message Joined: 15 Jun 08 Posts: 2 Credit: 14,214 RAC: 0 |
using p4 1.8 winpro sp3 Task ID 1273176 Name testD_cc_1_8_nocst4_hb_t313__IGNORE_THE_REST_1BG2A_8_7063_4_0 Workunit 1123469 Created 27 Jan 2009 8:49:15 UTC Sent 27 Jan 2009 8:54:13 UTC Received 28 Jan 2009 14:17:36 UTC Server state Over Outcome Validate error Client state Done Exit status 0 (0x0) Computer ID 14204 Report deadline 31 Jan 2009 8:54:13 UTC CPU time 18011.97 stderr out <core_client_version>6.2.28</core_client_version> <![CDATA[ <stderr_txt> BOINC:: Initializing ... ok. [2009- 1-28 4: 2:22:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. Registering options.. Registered extra options. Initializing core... Initializing options.... ok Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/ralph.bakerlab.org/minirosetta_database_rev26003.zip <unzip> <-oq> <../../projects/ralph.bakerlab.org/minirosetta_database_rev26003.zip> <-d./> Firstarg=true; pp=-d./ firstarg: <-d./> End of unzipping. Unpacking WU data ... Unpacking data: ../../projects/ralph.bakerlab.org/testD_cc_1_8_nocst4.foldcst_chunk_general.t313_.mtyka.boinc_files.zip <unzip> <-oq> <../../projects/ralph.bakerlab.org/testD_cc_1_8_nocst4.foldcst_chunk_general.t313_.mtyka.boinc_files.zip> <-d./> Firstarg=true; pp=-d./ firstarg: <-d./> End of unzipping. Setting database description ... Setting up checkpointing ... Setting up folding (abrelax) ... Beginning folding (abrelax) ... BOINC:: Worker startup. Starting watchdog... Watchdog active. Starting work on structure: _1BG2A_8_00001 ====> called boinc_finish </stderr_txt> ]]> Validate state Invalid Claimed credit 28.8840523942973 Granted credit 0 application version 1.54 |
Brotherbard Send message Joined: 16 Feb 06 Posts: 15 Credit: 76,109 RAC: 0 |
I ran the minirosetta 1.54 app in gdb and here is the stack trace: [code]Breakpoint 1, 0x9603b4a9 in malloc_error_break () (gdb |
Brotherbard Send message Joined: 16 Feb 06 Posts: 15 Credit: 76,109 RAC: 0 |
I ran the minirosetta 1.54 app in gdb and here is the stack trace: Breakpoint 1, 0x9603b4a9 in malloc_error_break () (gdb) bt #0 0x9603b4a9 in malloc_error_break () #1 0x96036497 in szone_error () #2 0x95f60463 in szone_free () #3 0x95f602cd in free () #4 0x005d8576 in WEEK_PREFS::~WEEK_PREFS () at /usr/include/c++/4.0.0/bits/basic_string.h:227 #5 0x00d13e76 in GLOBAL_PREFS::~GLOBAL_PREFS () #6 0x00103d8d in protocols::boinc::Boinc::initialize_worker () at /usr/include/c++/4.0.0/bits/basic_string.h:227 #7 0x000034a1 in main () And here is the beginning of stderr.txt BOINC:: Initializing ... ok. [2009- 1-28 8:51: 4:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. minirosetta_1.54_i686-apple-darwin(1142,0xa07b2720) malloc: *** error for object 0x1a3d2e0: Non-aligned pointer being freed (2) *** set a breakpoint in malloc_error_break to debug minirosetta_1.54_i686-apple-darwin(1142,0xa07b2720) malloc: *** error for object 0x1a3c270: Non-aligned pointer being freed (2) *** set a breakpoint in malloc_error_break to debug BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. I do have two day-of-the-week overrides set and turning them off fixed the problem! I had one weekday set in CPU Usage and one in Network Usage and there were two malloc errors, a quick test shows you get one error from each weekday set. The only reason I had them set was because I was playing with the GUI-RPCs, I don't actually need them for anything. Also the daily time settings do not have this problem. I'm not sure if minirosetta is doing anything special with the global prefs, I would suspect this is a BOINC defect. I'm running BOINC 6.2.18, and have not tested this on other versions, nor do I have work from any other project on this machine at the moment so cannot test if other projects fail like this too. --Nathan |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
So what exactly is the week override setting ? In the preferences on the advanced view of the BOINC Manager, you can define specific days of the week you wish BOINC to use CPU, and/or network. Let's a school or corporate environment run only on weekends for example. |
Brotherbard Send message Joined: 16 Feb 06 Posts: 15 Credit: 76,109 RAC: 0 |
I have 8 minirosetta 1.54 workunits from r@h that have completed successfully now without failures. --Nathan |
Paul D. Buck Send message Joined: 14 Jan 09 Posts: 62 Credit: 33,293 RAC: 0 |
Has anyone with the lock-file problem tried the solution I suggested? I am curious if that is the cause of that problem ... |
Snagletooth Send message Joined: 4 May 07 Posts: 67 Credit: 134,427 RAC: 0 |
Not Paul but just came from Einstein so think I think I remember right where to look It's a long thread so for the short, extremely lucid version go straight to Gary Roberts Hope this helps. Snags |
I _ quit Send message Joined: 13 Jan 09 Posts: 44 Credit: 88,562 RAC: 0 |
Not Paul but just came from Einstein so think I think I remember right where to look we had a discussion about lockfile problem in rosie. I can't seem to find the thread(s) related to that. essentially it was just deleting the file from the slots folder where the file/folder equaled 0 bytes. once you cleaned that up and restarted boinc and had all your usual settings of memory and the such set up then it seemed to go away. also with boinc 6.4.5 the problem went away. but this was mostly with windows machines. |
Message boards :
RALPH@home bug list :
minirosetta v1.54 bug thread
©2024 University of Washington
http://www.bakerlab.org