rosetta_beta_4.83_i686-pc-linux-gnu -> frozen

Message boards : Number crunching : rosetta_beta_4.83_i686-pc-linux-gnu -> frozen

To post messages, you must log in.

AuthorMessage
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 181 - Posted: 18 Feb 2006, 3:11:51 UTC
Last modified: 18 Feb 2006, 3:24:35 UTC

[root@crobertp root]# top
1:06am up 4 days, 6:46, 3 users, load average: 0.03, 0.05, 0.34
121 processes: 119 sleeping, 2 running, 0 zombie, 0 stopped
CPU states: 0.0% user, 0.3% system, 0.0% nice, 99.6% idle
Mem: 248164K av, 242340K used, 5824K free, 0K shrd, 26016K buff
180816K actv, 0K in_d, 4772K in_c, 44280K target
Swap: 1020088K av, 94304K used, 925784K free 61204K cached

PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM CTIME COMMAND
1968 boinc 9 0 3624 3528 1756 S 0.0 1.4 5656m ./boinc -redirectio -allow_remote_gui_rpc
27682 boinc 19 19 1240 1240 944 S N 0.0 0.4 0:00 /bin/bash ./yasuc.sh
28607 boinc 19 19 62128 43M 5116 S N 0.0 18.1 29:55 rosetta_beta_4.83_i686-pc-linux-gnu cc 1tig _ -relax -stringen
28608 boinc 19 19 62128 43M 5116 S N 0.0 18.1 0:00 rosetta_beta_4.83_i686-pc-linux-gnu cc 1tig _ -relax -stringen
28609 boinc 19 19 62128 43M 5116 S N 0.0 18.1 0:00 rosetta_beta_4.83_i686-pc-linux-gnu cc 1tig _ -relax -stringen
29298 boinc 9 0 2484 2436 2212 S 0.0 0.9 0:00 /usr/sbin/sshd
29300 boinc 9 0 2352 2348 1200 S 0.0 0.9 0:01 -bash
29864 boinc 19 19 624 624 548 S N 0.0 0.2 0:00 sleep 600

crobertp [/home/boinc/BOINC] > cat /proc/version
Linux version 2.4.21-31301U90_4cl (andreas@buildmaster.distro.conectiva) (gcc version 3.2.2) #1 Qui Jun 26 01:44:43 BRT 2003

When will the ice melt ?
When this WU will be done ?
I believing on abort ... is that OK ?
What else I can do ?

11) -----------
name: BARCODE_30_1tig__NATIVE_210_39_0
WU name: BARCODE_30_1tig__NATIVE_210_39
project URL: http://ralph.bakerlab.org/
report deadline: Fri Feb 24 17:44:01 2006
ready to report: no
got server ack: no
final CPU time: 0.000000
state: 2
scheduler state: 2
exit_status: 0
signal: 0
suspended via GUI: no
aborted via GUI: no
active_task_state: 1
stderr_out:
app version num: 483
checkpoint CPU time: 1439.690000
current CPU time: 1763.480000
fraction done: 0.199956
VM usage: 0.000000
resident set size: 0.000000
estimated CPU time remaining: 9214.806238
supports graphics: no
12) -----------

Click signature for global team stats
ID: 181 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 184 - Posted: 18 Feb 2006, 3:56:46 UTC
Last modified: 18 Feb 2006, 4:02:35 UTC

Well, changed this setting from yes to no

Leave applications in memory while preempted?

the other pcs at location \"work\" are running only 1 project,
thus, should not be affected by this setting -:)

On this pc I am running ralph & boincsimap that anyway exits
everytime simap app is swaped-out
--> so, should not perform more bad with no than with yes

Then I refreshed ralph preferences and after killed boinc & re-started it

Let\'s wait next couple of hours to see what will happens

Click signature for global team stats
ID: 184 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 193 - Posted: 18 Feb 2006, 6:28:29 UTC
Last modified: 18 Feb 2006, 6:33:54 UTC

The job crunched, but after some time erroed !

2006-02-18 02:41:22 [boincsimap] Finished download of 200601277.032211
2006-02-18 02:41:22 [boincsimap] Throughput 28987 bytes/sec
2006-02-18 02:41:23 [---] request_reschedule_cpus: files downloaded
2006-02-18 02:41:23 [ralph@home] Restarting result BARCODE_30_1tig__NATIVE_210_39_0 using rosetta_beta version 483
2006-02-18 02:41:23 [boincsimap] Pausing result 200601277.029660_1 (removed from memory)
4 of 4 test sequences read
3026 of 3026 database sequences read
2429 of 2429 query sequences read
2006-02-18 02:41:24 [---] request_reschedule_cpus: process exited
2006-02-18 03:29:56 [ralph@home] Unrecoverable error for result BARCODE_30_1tig__NATIVE_210_39_0 (process exited with code 131 (0x83))
2006-02-18 03:29:56 [---] request_reschedule_cpus: process exited
2006-02-18 03:29:56 [ralph@home] Computation for result BARCODE_30_1tig__NATIVE_210_39_0 finished
2006-02-18 03:29:56 [boincsimap] Restarting result 200601277.029660_1 using simap version 507
2006-02-18 03:30:56 [ralph@home] Sending scheduler request to http://ralph.bakerlab.org/ralph_cgi/cgi
2006-02-18 03:30:56 [ralph@home] Reason: To report results
2006-02-18 03:30:56 [ralph@home] Reporting 1 results
2006-02-18 03:31:16 [ralph@home] Scheduler request to http://ralph.bakerlab.org/ralph_cgi/cgi succeeded

------------------------------

Exit status 131 (0x83)
Computer ID 459
Report deadline 24 Feb 2006 20:44:01 UTC
CPU time 4296.17
stderr out <core_client_version>5.2.14</core_client_version>
<message>process exited with code 131 (0x83)
</message>
<stderr_txt>
# cpu_run_time_pref: 7200
# random seed: 3997032
[0x8749543]
[0x8761a7c]
[0x87c79c8]
[0x87e23ac]
[0x87e3c7d]
[0x87b2947]
[0x87b43e1]
[0x85eba8f]
[0x84237c0]
[0x84242bf]
[0x8424ee3]
[0x8432f1d]
[0x8434df2]
[0x86e7593]
[0x85f2184]
[0x85f3808]
[0x83ee90e]
[0x83f130b]
[0x87c0d74]
[0x8048121]
# cpu_run_time_pref: 7200
[0x8749543]
[0x8761a7c]
[0x87c79c8]
[0x87422b8]
[0x856a1bf]
[0x845982c]
[0x8441ff8]
[0x8445190]
[0x86e7091]
[0x85f22c5]
[0x85f3808]
[0x83ee90e]
[0x83f130b]
[0x87c0d74]
[0x8048121]
No heartbeat from core client for 31 sec - exiting
SIGSEGV: segmentation violationStack trace (15 frames):

Exiting...

</stderr_txt>


Validate state Invalid
Claimed credit 12.7700238041793
Granted credit 0
application version 4.83


Click signature for global team stats
ID: 193 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 200 - Posted: 18 Feb 2006, 9:51:34 UTC
Last modified: 18 Feb 2006, 10:00:02 UTC

Other WU fozen

crobertp [/home/boinc/BOINC] > ps xu
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
boinc 27682 0.0 0.4 2616 1036 ? SN Feb17 0:00 /bin/bash ./yasuc.sh
boinc 30171 0.0 1.3 5892 3248 ? S 01:50 0:07 ./boinc -redirectio -allow_remote_gui_rpc -return_results_imme
boinc 31123 41.3 20.9 148776 51944 ? SN 05:00 70:37 rosetta_beta_4.83_i686-pc-linux-gnu cc 1shf A -relax -stringen
boinc 31124 0.0 20.9 148776 51944 ? SN 05:00 0:00 rosetta_beta_4.83_i686-pc-linux-gnu cc 1shf A -relax -stringen
boinc 31125 0.0 20.9 148776 51944 ? SN 05:00 0:00 rosetta_beta_4.83_i686-pc-linux-gnu cc 1shf A -relax -stringen
boinc 31679 0.0 0.8 7200 2136 ? S 07:47 0:00 /usr/sbin/sshd
boinc 31680 0.1 0.9 3480 2336 pts/4 S 07:47 0:00 -bash
boinc 31729 0.0 0.2 2084 624 ? SN 07:48 0:00 sleep 600
boinc 31744 0.0 0.2 2544 668 pts/4 R 07:51 0:00 ps xu
crobertp [/home/boinc/BOINC] > free
total used free shared buffers cached
Mem: 248164 244772 3392 0 24364 67016
-/+ buffers/cache: 153392 94772
Swap: 1020088 77832 942256
crobertp [/home/boinc/BOINC] > w
7:52am up 4 days, 13:33, 2 users, load average: 0.00, 0.00, 0.00
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
saigam pts/1 matrix.cp3 11:18pm 7:23m 0.17s 0.17s -bash
boinc pts/4 200.216.141.84 7:47am 0.00s 0.27s 0.01s w
crobertp [/home/boinc/BOINC] >

What I should do ?
This one is @ 54.99% done !
*note load average --> whole system doing nothing
*may be the 1% bug has advanced to be the 54.99% bug ?
Click signature for global team stats
ID: 200 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 207 - Posted: 18 Feb 2006, 11:08:27 UTC

after 1 hour with ralph doing nothing boinc switched to next app (simap)
note that now the system is using CPU

crobertp [/home/boinc/BOINC] > w
8:59am up 4 days, 14:40, 2 users, load average: 1.00, 1.00, 0.93
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
saigam pts/1 matrix.cp3 11:18pm 8:29m 0.17s 0.17s -bash
boinc pts/4 200.216.141.84 7:47am 0.00s 0.30s 0.00s w
crobertp [/home/boinc/BOINC] >

boinc log
2006-02-18 07:02:10 [---] request_reschedule_cpus: process exited
2006-02-18 08:02:11 [ralph@home] Pausing result BARCODE_30_1shfA_NATIVE_210_41_0 (removed from memory)
2006-02-18 08:02:11 [boincsimap] Restarting result 200601277.029753_0 using simap version 507

However was *not* removed from memory !!!
ps xu
showed rosetta still into ram
when boinc said (removed from memory) for the simap
simap app does not appeared into ps xu afterwards

I believe that this (removed from memory) -> but *not* removed is what
is causing the 54.99% bug !

The comprobation of this thepry is into this thread on previous WU
that returned to crunch again after I kill boinc and roseta was no more
occuping RAM


Click signature for global team stats
ID: 207 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 209 - Posted: 18 Feb 2006, 11:38:52 UTC

I did a pkill rosetta
and the results was
2006-02-18 09:34:49 [ralph@home] Result BARCODE_30_1shfA_NATIVE_210_41_0 exited with zero status but no \'finished\' file
2006-02-18 09:34:49 [ralph@home] If this happens repeatedly you may need to reset the project.
2006-02-18 09:34:49 [---] request_reschedule_cpus: process exited
2006-02-18 09:34:49 [ralph@home] Restarting result BARCODE_30_1shfA_NATIVE_210_41_0 using rosetta_beta version 483
crobertp [/home/boinc/BOINC] > w
9:39am up 4 days, 15:20, 2 users, load average: 0.99, 0.61, 0.32
*and afterwards rosetta returned to crunch -> see load average

*seems that rosetta must be removed from ram when switching apps
If for some reason it remains into ram -> it freezes 0.0% CPU
Maye be cause is I only have 256 megs ram, and the system moves inactive
pages to swap. later, when returning from swap rosetta does not behave well
Click signature for global team stats
ID: 209 · Report as offensive    Reply Quote
Dimitris Hatzopoulos

Send message
Joined: 16 Feb 06
Posts: 31
Credit: 2,308
RAC: 0
Message 267 - Posted: 19 Feb 2006, 0:06:43 UTC
Last modified: 19 Feb 2006, 0:08:37 UTC

Could be a coincidence, but also in my case, the problems of Rosetta v4.80 process being stuck \"frozen\" (not consuming any CPU time) appear on a Linux (Debian Sarge kernel 2.4.27) machine with only 256MB RAM (but plenty of virtual).

No other science apps had a problem on that machine (tried 6 projects), including WCG/HPF which is running an older version of Rosetta v4.21

PS: I haven\'t attached this (underspec\'ed) machine to RALPH.
ID: 267 · Report as offensive    Reply Quote

Message boards : Number crunching : rosetta_beta_4.83_i686-pc-linux-gnu -> frozen



©2018 University of Washington
http://www.bakerlab.org