Rosetta min 1.03

Message boards : RALPH@home bug list : Rosetta min 1.03

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Barraud Denis

Send message
Joined: 5 Apr 07
Posts: 3
Credit: 84,809
RAC: 0
Message 3607 - Posted: 14 Jan 2008, 7:02:24 UTC

This program is probablely bugged, when these wu start the boinc manager freeze, and dont work . the 2 process of boinc stay in memory, they start calculate and stop imediately the wus. No possibility to use the boinc manager completely blocked And no work are donne.

The only solution i found to recover boinc is :
to kill the 2 process boinc in task manager of XP, and
to modify the file client_stat.xml to add in the <project> section of ralph and rosetta project after <send_job_log>0</send_job_log>
the line <suspended_via_gui/>
to stop the project.

The right will be to stop rosetta mini 1.03 wu, or wait to new version 1.04.
ID: 3607 · Report as offensive    Reply Quote
Dr Who Fan
Avatar

Send message
Joined: 2 Sep 06
Posts: 76
Credit: 107,857
RAC: 0
Message 3610 - Posted: 14 Jan 2008, 8:13:54 UTC

I saw the same thing happen on one of my machines and had to use task manager to kill BOINC and all subprocess.
Strange thing is the Rosetta min 1.03 task did not produce an error log in BOINC and seems to have restarted/running ok. I'll keep watch on my machines to see what happens when or if they complete.
ID: 3610 · Report as offensive    Reply Quote
Pepo
Avatar

Send message
Joined: 8 Sep 06
Posts: 104
Credit: 36,890
RAC: 0
Message 3613 - Posted: 14 Jan 2008, 14:39:19 UTC

Shortly after starting my Rosetta Mini 1.03, Kaspersky Antivirus started to consume 100% CPU for more than 2 minutes (no idea, whether Rosetta Mini triggered it, or something else, I'd bet it was it), approx. until Rosetta Mini exited. A bit later all other Boinc (5.1035) applications exited too (possibly missed heartbeat?), Boinc Manager and client were left unresponsive, client did not respond to GUI RPCs. I've tried to start another Boinc Manager, but it was left somewhere in memory and the GUI did not appear.

After being left 3/4 hour untouched, finally both Managers suddenly appeared and the client started the apps again.

Last lines from stdoutdae.txt:

.....
12:42:27 [Einstein@Home] [task_debug] result h1_0703.70_S5R2__151_S5R3a_0 checkpointed
12:43:00 [SETI@home] [task_debug] Process for 13dc06ab.1691.2117.6.6.33_0 exited
12:43:00 [SETI@home] [task_debug] task_state=EXITED for 13dc06ab.1691.2117.6.6.33_0 from handle_exited_app
12:43:00 [SETI@home] Computation for task 13dc06ab.1691.2117.6.6.33_0 finished
12:43:00 [SETI@home] [task_debug] result state=FILES_UPLOADING for 13dc06ab.1691.2117.6.6.33_0 from CS::app_finished
12:43:00 [ralph@home] Starting mini_abrelax-1c8cA-test_james_2821_6_2
12:43:00 [ralph@home] [cpu_sched] Starting mini_abrelax-1c8cA-test_james_2821_6_2 (initial)
12:43:01 [ralph@home] [task_debug] task_state=EXECUTING for mini_abrelax-1c8cA-test_james_2821_6_2 from start
12:43:01 [ralph@home] Starting task mini_abrelax-1c8cA-test_james_2821_6_2 using minirosetta version 103
12:45:00 [DepSpid] [task_debug] result spider_153965_0 checkpointed
12:48:41 [ralph@home] [task_debug] Process for mini_abrelax-1c8cA-test_james_2821_6_2 exited
12:48:41 [ralph@home] [task_debug] task_state=EXITED for mini_abrelax-1c8cA-test_james_2821_6_2 from handle_exited_app
12:48:41 [ralph@home] [sched_op_debug] Deferring communication for 20 min 1 sec
12:48:41 [ralph@home] [sched_op_debug] Reason: Unrecoverable error for result mini_abrelax-1c8cA-test_james_2821_6_2 ( - exit code -1073741819 (0xc0000005))
12:48:41 [ralph@home] [task_debug] result state=COMPUTE_ERROR for mini_abrelax-1c8cA-test_james_2821_6_2 from CS::report_result_error
12:48:41 [ralph@home] [task_debug] Process for mini_abrelax-1c8cA-test_james_2821_6_2 exited


Because I've left Boinc running, after the long delay it could finally flush its internal buffers and output some additional error lines. These came then after resurrection:

12:48:41 [ralph@home] [task_debug] exit code -1073741819 (0xc0000005):
13:12:48 [ralph@home] Computation for task mini_abrelax-1c8cA-test_james_2821_6_2 finished
13:12:48 [ralph@home] Output file mini_abrelax-1c8cA-test_james_2821_6_2_0 for task mini_abrelax-1c8cA-test_james_2821_6_2 absent
13:12:48 [ralph@home] [task_debug] result state=COMPUTE_ERROR for mini_abrelax-1c8cA-test_james_2821_6_2 from CS::app_finished
13:36:41 [SETI@home] Starting 13dc06ab.1691.2117.6.6.189_0
13:36:41 [SETI@home] [cpu_sched] Starting 13dc06ab.1691.2117.6.6.189_0 (initial)
13:36:41 [SETI@home] [task_debug] task_state=EXECUTING for 13dc06ab.1691.2117.6.6.189_0 from start
13:36:41 [SETI@home] Starting task 13dc06ab.1691.2117.6.6.189_0 using setiathome_enhanced version 527
....


All other results from the WU 647168 crashed, with lots of error liness like:

00307e28 0202f099 0202f09a 0202f09b 0202f09c 0202f09d minirosetta_1.03_windows_intelx!+0x0 SymFromAddr(): GetLastError = '126' SymGetLineFromAddr(): GetLastError = '126' SymGetModuleInfo(): GetLastError = '126' Address = '0202f098'
00307e2c 0202f09a 0202f09b 0202f09c 0202f09d 0202f09e minirosetta_1.03_windows_intelx!+0x0 SymFromAddr(): GetLastError = '126' SymGetLineFromAddr(): GetLastError = '126' SymGetModuleInfo(): GetLastError = '126' Address = '0202f099'
00307e30 0202f09b 0202f09c 0202f09d 0202f09e 0202f09f minirosetta_1.03_windows_intelx!+0x0 SymFromAddr(): GetLastError = '126' SymGetLineFromAddr(): GetLastError = '126' SymGetModuleInfo(): GetLastError = '126' Address = '0202f09a'


My result 733546 produced BOINC Windows Runtime Debugger output instead.

Peter
ID: 3613 · Report as offensive    Reply Quote
[B^S] sTrey
Avatar

Send message
Joined: 15 Feb 06
Posts: 58
Credit: 15,430
RAC: 0
Message 3614 - Posted: 15 Jan 2008, 5:13:25 UTC

Other experiences in another thread.
This app did nasty things to 2/3 of my hosts, so Ralph is set to NNW. Please will the project people say when we allow work again without risk of running the toxic "mini" app? Thanks
ID: 3614 · Report as offensive    Reply Quote
Dr Who Fan
Avatar

Send message
Joined: 2 Sep 06
Posts: 76
Credit: 107,857
RAC: 0
Message 3615 - Posted: 15 Jan 2008, 11:57:57 UTC

I have had it with the RALPH@home project!
The "Rosetta min 1.03" app crashed my whole PC - I have just spent 5+ hours rebuilding it from install disks and backups.
I have detached all 3 of my machines from the project as a result and will not reattach.



ID: 3615 · Report as offensive    Reply Quote
Pepo
Avatar

Send message
Joined: 8 Sep 06
Posts: 104
Credit: 36,890
RAC: 0
Message 3616 - Posted: 15 Jan 2008, 15:54:46 UTC

My Ralph's "slots2" contained a folder "minirosetta_database", containing another 69 subfolders and 240 files, total 634 kB. After start, Boinc client seemed to be busy with these files looong after having started. One minute after start and unsuccessful attempts to connect, the Manager then asked whether to try to connect again...

And the client was still checking Ralph's files. A hour long...

It was enough to remove the "..Boincslots2minirosetta_database" tree, everything was then fine.

Peter
ID: 3616 · Report as offensive    Reply Quote
Profile dekim
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 20 Jan 06
Posts: 250
Credit: 543,579
RAC: 0
Message 3618 - Posted: 15 Jan 2008, 16:58:37 UTC

We are currently looking into this bug. In order to debug this new app, we will have to send out more work units. Sorry for all the troubles.
ID: 3618 · Report as offensive    Reply Quote
Luuklag

Send message
Joined: 5 Jan 08
Posts: 15
Credit: 80
RAC: 0
Message 3621 - Posted: 15 Jan 2008, 18:11:29 UTC

i set my runtime to 2 hours but yet boinc says the expected runtime is 5,30 hours, so 2.25 times the enforced runtime, this cant be any good, can it?
ID: 3621 · Report as offensive    Reply Quote
Luuklag

Send message
Joined: 5 Jan 08
Posts: 15
Credit: 80
RAC: 0
Message 3623 - Posted: 15 Jan 2008, 21:17:28 UTC

well that wu locks up whole boinc, any way i can pause ralph to continue rosie untill this is fixed.
ID: 3623 · Report as offensive    Reply Quote
Pepo
Avatar

Send message
Joined: 8 Sep 06
Posts: 104
Credit: 36,890
RAC: 0
Message 3624 - Posted: 15 Jan 2008, 22:49:10 UTC - in response to Message 3618.  

dekim wrote:
We are currently looking into this bug.

Actually a harmless issue with remarkable consequences :-)

Boinc devs wrote:
Here's what was happening:

The minirosetta app in Ralph unzipped an archive into its slot directory. This archive (accidentally) included some .svn directories (Subversion stuff) whose contents were flagged as read-only.

When the job finished or was aborted, and the BOINC client tried to clean out the slot directory, the delete of each of these files would fail. The client would wait for 5 seconds and try again (it does this because sometimes files are locked temporarily by virus checkers or other disk-scan apps).
This would happen for each file, resulting in a 10-20 minute period during which the client and Manager appear hung.


dekim wrote:
In order to debug this new app, we will have to send out more work units. Sorry for all the troubles.

I think... who is willing to continue testing Rosetta mini: if Boinc will hang next time, go to "..Boincslots*minirosetta_database" and delete all read-only files in the subdirectories. (Actually just in "...svn" subdirectories, but maybe it is the same, if just one Rosetta mini WU was running.) Boinc will immediately continue its work.

Peter
ID: 3624 · Report as offensive    Reply Quote
Profile dekim
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 20 Jan 06
Posts: 250
Credit: 543,579
RAC: 0
Message 3627 - Posted: 16 Jan 2008, 0:20:03 UTC - in response to Message 3624.  

When I said "we", I mainly meant David Anderson. I had to create more workunits for David Anderson to debug the client. Once he was able to get a work unit, he responded right away with the cause and checked in a fix. There are some non-boinc client related issues with mini also.

If you are experiencing this problem, manually remove the minirosetta_database directory in the slots directory.


This is David Anderson's reply:

"Here's what was happening:

The minirosetta app in Ralph unzipped an archive into its slot directory.
This archive (accidentally) included some .svn directories
(Subversion stuff) whose contents were flagged as read-only.

When the job finished or was aborted,
and the BOINC client tried to clean out the slot directory,
the delete of each of these files would fail.
The client would wait for 5 seconds and try again
(it does this because sometimes files are locked temporarily
by virus checkers or other disk-scan apps).
This would happen for each file, resulting in a 10-20 minute period
during which the client and Manager appear hung.

I made two changes that fix this (and hopefully avoid similar
problems in the future) as follows:

1) if a file delete fails with error ERROR_ACCESS_DENIED,
use SetFileAttributes() to clear the read-only flag, then try again.
2) Don't use the 5-second retry mechanism when clearing out
slot directories. These can contain unbounded numbers of files,
and this can lead to long periods where the client appears hung.

Rom, please back-port this to 5.10

-- DPA "



ID: 3627 · Report as offensive    Reply Quote
Pepo
Avatar

Send message
Joined: 8 Sep 06
Posts: 104
Credit: 36,890
RAC: 0
Message 3631 - Posted: 16 Jan 2008, 8:13:18 UTC - in response to Message 3627.  

If you are experiencing this problem, manually remove the minirosetta_database directory in the slots directory.

It is safe to say, as soon as the WU will start, it's enough to reset read-only flags on all files in the minirosetta_database folder tree? Boinc client is then able to handle them correctly.

Peter
ID: 3631 · Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 07
Posts: 75
Credit: 69,584
RAC: 0
Message 3632 - Posted: 16 Jan 2008, 9:26:19 UTC

After a complete failure with the minirosetta I deleted the database as advised and now back in control of my computer and haven't lost the Rosetta jobs I was working on.

I thought something was going wrong when there were no graphics and it failed after 8%.

I hope next time is less exiting!!
ID: 3632 · Report as offensive    Reply Quote
Profile Jack Shaftoe

Send message
Joined: 8 Aug 06
Posts: 13
Credit: 105,163
RAC: 0
Message 3635 - Posted: 16 Jan 2008, 11:44:16 UTC - in response to Message 3624.  

if Boinc will hang next time, go to "..Boincslots*minirosetta_database" and delete all read-only files in the subdirectories. (Actually just in "...svn" subdirectories, but maybe it is the same, if just one Rosetta mini WU was running.) Boinc will immediately continue its work.

Peter


Worked for me on my 2 machines with this problem. Thanks,
ID: 3635 · Report as offensive    Reply Quote
Luuklag

Send message
Joined: 5 Jan 08
Posts: 15
Credit: 80
RAC: 0
Message 3636 - Posted: 16 Jan 2008, 15:34:10 UTC

ok :D
my boinc is running again, after manualy exiting it with the good old ctrl+alt+del :)
removing the library wich my virusscan didn't really liked but i got it bypassed and now its running again :)

Ralph aint a project for Newbie's that for shure.
ID: 3636 · Report as offensive    Reply Quote
BigMike
Avatar

Send message
Joined: 23 Feb 06
Posts: 63
Credit: 58,730
RAC: 0
Message 3642 - Posted: 17 Jan 2008, 8:30:31 UTC

Yeow ... talk about ugly. I had to abort the two mini WU's I had (sorry) to keep everything from locking up. But I couldn't have done it without Barraud Denis's advice (thanks!):
add in the <project> section of ralph and rosetta project after <send_job_log>0</send_job_log>
the line <suspended_via_gui/>
to stop the project


And I guess some people don't understand that Alpha means Alpha:
I have detached all 3 of my machines from the project as a result and will not reattach

Too bad.

However, it's nice to know we're a fairly competent group of people and can get around the occasional challenge!

==Mike



Don't believe everything you think.
ID: 3642 · Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 16 Feb 06
Posts: 364
Credit: 1,368,421
RAC: 0
Message 3645 - Posted: 18 Jan 2008, 10:59:17 UTC
Last modified: 18 Jan 2008, 11:01:24 UTC

Those 'mini's stuffed my computer. I have spent over two hours trying to get things going again.
I tried restarting Boinc Manager, this failed.
I tried to reboot and then start BM, this failed (it rebooted twice before Windows started).
I tried to reinstall Boinc 5.10.35 but this failed.
I tried to install Boinc 5.8.16 and this worked until I added back my projects, then it failed.
I totally removed and reinstalled Boinc (various versions) all failed.
Removed again and downloaded 5.10.38, amazingly this worked and all files kept going and are still running.

I will check those other suggestions above to get rid of ralph mini.

This is possibly the worst Ralph release to date.

WU 735674
WU 735537
ID: 3645 · Report as offensive    Reply Quote
Pepo
Avatar

Send message
Joined: 8 Sep 06
Posts: 104
Credit: 36,890
RAC: 0
Message 3646 - Posted: 18 Jan 2008, 11:08:20 UTC - in response to Message 3645.  

Removed again and downloaded 5.10.38, amazingly this worked and all files kept going and are still running.

I will check those other suggestions above to get rid of ralph mini.

With 5.10.38 you are done.

Peter
ID: 3646 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 3647 - Posted: 18 Jan 2008, 14:34:00 UTC

...and without 5.10.38, you just have to follow Pepo's tip to remove the read-only attributes.

It "stuffed" my computer as well. I run Windows. Symptom was that it seemed to hang BOINC Manager, it was no longer communicating and nothing was running. A restart didn't clear up the situation (because the slots directories were in an invalid state).
ID: 3647 · Report as offensive    Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 8 Aug 06
Posts: 75
Credit: 2,396,363
RAC: 6,299
Message 3648 - Posted: 18 Jan 2008, 21:55:31 UTC

The minirosetta app in Ralph unzipped an archive into its slot directory.
This archive (accidentally) included some .svn directories
(Subversion stuff) whose contents were flagged as read-only.


Has this been fixed in the mini tasks and/or application yet?

Was this ever a problem with OSX version?
Reno, NV
Team: SETI.USA
ID: 3648 · Report as offensive    Reply Quote
1 · 2 · Next

Message boards : RALPH@home bug list : Rosetta min 1.03



©2024 University of Washington
http://www.bakerlab.org