Posts by Darren

1) Message boards : RALPH@home bug list : Dead boinc (Message 545)
Posted 24 Feb 2006 by Darren
Post:
Well, it doesn't seem like I'm going to get any further with this. The output files in the rosetta slot all just end abruptly with no apparent errors. All the output files in the ralph slot (stdout* & stderr*) all exist but are just empty files, even though the ralph wu ran somewhere in the neighboorhood of 6 hours of 8. The boinc stderr* and stdout* files are completely trashed, containing nothing but fragments of documents and such that were in memory thrown in all among the normal boinc entries - but much worse than I've ever seen before (instead of several lines of something, there only seem to be several characters before it changes to something else). I can find nothing in the file indicating any kind of an error before the explosion.

My client_state file seems fully intact and confirms my active tasks at the error were rosetta and ralph. Here's that part of the file:

<active_task>
    <project_master_url>http://boinc.bakerlab.org/rosetta/</project_master_url>
    <result_name>PRODUCTION_ABINITIO_INCREASECYCLES50_1npsA_317_662_0</result_name>
    <active_task_state>1</active_task_state>
    <app_version_num>481</app_version_num>
    <slot>2</slot>
    <scheduler_state>2</scheduler_state>
    <checkpoint_cpu_time>36037.556206</checkpoint_cpu_time>
    <fraction_done>0.379433</fraction_done>
    <current_cpu_time>36329.966480</current_cpu_time>
    <vm_bytes>0.000000</vm_bytes>
    <rss_bytes>0.000000</rss_bytes>
</active_task>
<active_task>
    <project_master_url>http://ralph.bakerlab.org/</project_master_url>
    <result_name>BARCODE_30_1scjB_215_50_0</result_name>
    <active_task_state>1</active_task_state>
    <app_version_num>484</app_version_num>
    <slot>1</slot>
    <scheduler_state>2</scheduler_state>
    <checkpoint_cpu_time>21994.982600</checkpoint_cpu_time>
    <fraction_done>0.763715</fraction_done>
    <current_cpu_time>21995.278618</current_cpu_time>
    <vm_bytes>0.000000</vm_bytes>
    <rss_bytes>0.000000</rss_bytes>
</active_task>


The only other thing I can find out of place is a file in my main boinc directory that I don't recognize. The file name is blcSgNo5S, and it's just an empty text file. I don't think it has anything to do with anything, I just have no clue where it came from.

There have been no other problems with my system before or after this, so it definately seems confined to boinc. I did upgrade my kernel today - you know, why limit the things that can cause problems to only one at a time :)

Anyway, my new install has ran several seti and einstein units now (in both the old and new kernel), so, being a glutton for punishment, I've just "unsuspended" ralph and let it download some work, but I still have rosetta suspended. Hopefully, we won't get to have all this fun again.

2) Message boards : RALPH@home bug list : Dead boinc (Message 518)
Posted 23 Feb 2006 by Darren
Post:
OK, I got at least one step further here.

In digging a bit deeper, I was able to determine that the SIGABRT is somehow related to network activity being enabled. When I disabled all network activity in addition to all the projects and workunits being suspended, boinc would restart. Enabling work again for ralph and rosetta gave me the following:

Thu 23 Feb 2006 12:49:26 AM EST|rosetta@home|ACTIVE_TASKS::restart_tasks(); missing files
Thu 23 Feb 2006 12:49:26 AM EST|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_INCREASECYCLES50_1npsA_317_662_0 (One or more missing files)
Thu 23 Feb 2006 12:49:26 AM EST|ralph@home|ACTIVE_TASKS::restart_tasks(); missing files
Thu 23 Feb 2006 12:49:26 AM EST|ralph@home|Unrecoverable error for result BARCODE_30_1scjB_215_50_0 (One or more missing files)

In digging into that, I see that every single project folder on my system is totally empty. The 2 slots still had the files for those units, but no files in the project folders (for any project).

Simply enabling network activity - even with everything suspended so no network activity was actually trying to occur - immediately gave the SIGABRT as before.

My next step was to download a new copy of the boinc program. Putting the new copy in my existing boinc folder and executing it gives something that may be a bit more useful than pages of "SIGABRT" - it doesn't mean anything to me but gibberish, but maybe someone will get something from it. With the new copy, when I enable network activity I get the following:

2006-02-23 01:18:21 [---] request_reschedule_cpus: Resuming activities
2006-02-23 01:19:29 [---] request_reschedule_cpus: project op
2006-02-23 01:19:50 [---] Resuming network activity
SIGSEGV: segmentation violationStack trace (10 frames):
./boinc[0x80845b2]
[0xffffe420]
./boinc[0x805e21c]
./boinc[0x8079e9e]
./boinc[0x8066fb9]
./boinc[0x8058110]
./boinc[0x8078819]
./boinc[0x807895f]
/lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xd2)[0xb7e6cea2]
./boinc(shmat+0x59)[0x804bf21]

Exiting...


If I launch the new download in a clean folder, it starts up ok and brings me to the "attach to a project" point. So at that point I shut it back down and copied a couple of the account_*** files into the folder and it contacted the projects and seems to be good to go.

I'll dig through the old files tomorrow and see if I can find anything that looks out of whack.

3) Message boards : RALPH@home bug list : Dead boinc (Message 517)
Posted 23 Feb 2006 by Darren
Post:
Had a ralph and a rosetta unit running and now have a dead boinc. Ralph was this unit and rosetta was this unit. Ralph was about 6 hours into 8, and rosetta was about 12 hours into 24. 4 seti units were in the cache also, but had not started any processing. Don't know which triggered the problem, but I suddenly got a screen full of nothing but pages of this:

abort calledSIGABRT: abort calledSIGABRT: abort calledSIGABRT: abort calledSIGABRT: abort calledSIGABRT: abort calledSIGABRT: abort calledSIGABRT: abort calledSIGABRT: abort calledSIGABRT: abort calledSIGABRT: abort calledSIGABRT: abort calledSIGABRT: abort calledSIGABRT: abort calledSIGABRT: abort calledSIGABRT: abort calledSIGABRT:


Any attempt to restart boinc only generates the same thing. I can't catch any message yet at the beginning of the error - I can see the normal startup beginning (showing my projects, computer IDs, etc.) but the entire buffer then immediately fills with the above. I'll keep trying to catch the buffer and see if any message appears at onset. Thus far I have edited the client_state file to suspend all activity in every project and can now get boinc to restart. I'll keep digging and see if I can get any further than this.

4) Message boards : RALPH@home bug list : cpu_run_time_pref (Message 150)
Posted 17 Feb 2006 by Darren
Post:
This is more of a comment than a bug I guess, and it may be totally unimportant.

I had 2 WU's running on my HT machine with a run time setting of 4 hours. The second had 1 hour on it when the first was nearing completion. I then changed my online setting to 8 hours so any new work would go 8 instead of 4. To my surprise, and very neat, after reporting the first WU and doing the update triggered by that, it added 4 additional hours to the WU that was already underway.

My comment, though, is about the stamp in the stderr output file. There is no entry showing the time was ever set to increase, leaving only the entry for "cpu_run_time_pref: 14400". Considering some of the complaints/problems from standard Rosetta about WU run times and units getting stuck and such, should there not be an entry (if for no other reason than troubleshooting) showing that the cpu_run_time_pref was reset to 28,800 (or whatever it was really reset to)?

As it stands, just looking at this WU could lead to someone believing the program didn't work properly, as it says it should run 14,400 seconds, but it also says it really ran 29,014 seconds.

5) Message boards : Number crunching : Need for Linux testing? (Message 111)
Posted 17 Feb 2006 by Darren
Post:
Should I have my Linux machine attach to RALPH? (I've already joined the beta via WinXP)


There is work going out to linux machines, and there is a beta app for linux to crunch them. I got two WUs today.







©2024 University of Washington
http://www.bakerlab.org