Dead boinc

Message boards : RALPH@home bug list : Dead boinc

To post messages, you must log in.

AuthorMessage
Darren

Send message
Joined: 16 Feb 06
Posts: 5
Credit: 777
RAC: 0
Message 517 - Posted: 23 Feb 2006, 5:43:40 UTC

Had a ralph and a rosetta unit running and now have a dead boinc. Ralph was this unit and rosetta was this unit. Ralph was about 6 hours into 8, and rosetta was about 12 hours into 24. 4 seti units were in the cache also, but had not started any processing. Don't know which triggered the problem, but I suddenly got a screen full of nothing but pages of this:

abort calledSIGABRT: abort calledSIGABRT: abort calledSIGABRT: abort calledSIGABRT: abort calledSIGABRT: abort calledSIGABRT: abort calledSIGABRT: abort calledSIGABRT: abort calledSIGABRT: abort calledSIGABRT: abort calledSIGABRT: abort calledSIGABRT: abort calledSIGABRT: abort calledSIGABRT: abort calledSIGABRT: abort calledSIGABRT:


Any attempt to restart boinc only generates the same thing. I can't catch any message yet at the beginning of the error - I can see the normal startup beginning (showing my projects, computer IDs, etc.) but the entire buffer then immediately fills with the above. I'll keep trying to catch the buffer and see if any message appears at onset. Thus far I have edited the client_state file to suspend all activity in every project and can now get boinc to restart. I'll keep digging and see if I can get any further than this.


ID: 517 · Report as offensive    Reply Quote
Darren

Send message
Joined: 16 Feb 06
Posts: 5
Credit: 777
RAC: 0
Message 518 - Posted: 23 Feb 2006, 6:38:51 UTC

OK, I got at least one step further here.

In digging a bit deeper, I was able to determine that the SIGABRT is somehow related to network activity being enabled. When I disabled all network activity in addition to all the projects and workunits being suspended, boinc would restart. Enabling work again for ralph and rosetta gave me the following:

Thu 23 Feb 2006 12:49:26 AM EST|rosetta@home|ACTIVE_TASKS::restart_tasks(); missing files
Thu 23 Feb 2006 12:49:26 AM EST|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_INCREASECYCLES50_1npsA_317_662_0 (One or more missing files)
Thu 23 Feb 2006 12:49:26 AM EST|ralph@home|ACTIVE_TASKS::restart_tasks(); missing files
Thu 23 Feb 2006 12:49:26 AM EST|ralph@home|Unrecoverable error for result BARCODE_30_1scjB_215_50_0 (One or more missing files)

In digging into that, I see that every single project folder on my system is totally empty. The 2 slots still had the files for those units, but no files in the project folders (for any project).

Simply enabling network activity - even with everything suspended so no network activity was actually trying to occur - immediately gave the SIGABRT as before.

My next step was to download a new copy of the boinc program. Putting the new copy in my existing boinc folder and executing it gives something that may be a bit more useful than pages of "SIGABRT" - it doesn't mean anything to me but gibberish, but maybe someone will get something from it. With the new copy, when I enable network activity I get the following:

2006-02-23 01:18:21 [---] request_reschedule_cpus: Resuming activities
2006-02-23 01:19:29 [---] request_reschedule_cpus: project op
2006-02-23 01:19:50 [---] Resuming network activity
SIGSEGV: segmentation violationStack trace (10 frames):
./boinc[0x80845b2]
[0xffffe420]
./boinc[0x805e21c]
./boinc[0x8079e9e]
./boinc[0x8066fb9]
./boinc[0x8058110]
./boinc[0x8078819]
./boinc[0x807895f]
/lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xd2)[0xb7e6cea2]
./boinc(shmat+0x59)[0x804bf21]

Exiting...


If I launch the new download in a clean folder, it starts up ok and brings me to the "attach to a project" point. So at that point I shut it back down and copied a couple of the account_*** files into the folder and it contacted the projects and seems to be good to go.

I'll dig through the old files tomorrow and see if I can find anything that looks out of whack.


ID: 518 · Report as offensive    Reply Quote
Darren

Send message
Joined: 16 Feb 06
Posts: 5
Credit: 777
RAC: 0
Message 545 - Posted: 24 Feb 2006, 1:38:05 UTC
Last modified: 24 Feb 2006, 1:41:26 UTC

Well, it doesn't seem like I'm going to get any further with this. The output files in the rosetta slot all just end abruptly with no apparent errors. All the output files in the ralph slot (stdout* & stderr*) all exist but are just empty files, even though the ralph wu ran somewhere in the neighboorhood of 6 hours of 8. The boinc stderr* and stdout* files are completely trashed, containing nothing but fragments of documents and such that were in memory thrown in all among the normal boinc entries - but much worse than I've ever seen before (instead of several lines of something, there only seem to be several characters before it changes to something else). I can find nothing in the file indicating any kind of an error before the explosion.

My client_state file seems fully intact and confirms my active tasks at the error were rosetta and ralph. Here's that part of the file:

<active_task>
    <project_master_url>https://boinc.bakerlab.org/rosetta/</project_master_url>
    <result_name>PRODUCTION_ABINITIO_INCREASECYCLES50_1npsA_317_662_0</result_name>
    <active_task_state>1</active_task_state>
    <app_version_num>481</app_version_num>
    <slot>2</slot>
    <scheduler_state>2</scheduler_state>
    <checkpoint_cpu_time>36037.556206</checkpoint_cpu_time>
    <fraction_done>0.379433</fraction_done>
    <current_cpu_time>36329.966480</current_cpu_time>
    <vm_bytes>0.000000</vm_bytes>
    <rss_bytes>0.000000</rss_bytes>
</active_task>
<active_task>
    <project_master_url>https://ralph.bakerlab.org/</project_master_url>
    <result_name>BARCODE_30_1scjB_215_50_0</result_name>
    <active_task_state>1</active_task_state>
    <app_version_num>484</app_version_num>
    <slot>1</slot>
    <scheduler_state>2</scheduler_state>
    <checkpoint_cpu_time>21994.982600</checkpoint_cpu_time>
    <fraction_done>0.763715</fraction_done>
    <current_cpu_time>21995.278618</current_cpu_time>
    <vm_bytes>0.000000</vm_bytes>
    <rss_bytes>0.000000</rss_bytes>
</active_task>


The only other thing I can find out of place is a file in my main boinc directory that I don't recognize. The file name is blcSgNo5S, and it's just an empty text file. I don't think it has anything to do with anything, I just have no clue where it came from.

There have been no other problems with my system before or after this, so it definately seems confined to boinc. I did upgrade my kernel today - you know, why limit the things that can cause problems to only one at a time :)

Anyway, my new install has ran several seti and einstein units now (in both the old and new kernel), so, being a glutton for punishment, I've just "unsuspended" ralph and let it download some work, but I still have rosetta suspended. Hopefully, we won't get to have all this fun again.


ID: 545 · Report as offensive    Reply Quote

Message boards : RALPH@home bug list : Dead boinc



©2024 University of Washington
http://www.bakerlab.org