Report \"stuck at 1%\" bugs here

Message boards : RALPH@home bug list : Report \"stuck at 1%\" bugs here

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 7 · Next

AuthorMessage
Profile dekim
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 20 Jan 06
Posts: 250
Credit: 543,579
RAC: 0
Message 1 - Posted: 10 Feb 2006, 19:40:03 UTC

Please include a link to the work unit, result, and host.
ID: 1 · Report as offensive    Reply Quote
HollyXYZ

Send message
Joined: 15 Feb 06
Posts: 2
Credit: 759
RAC: 0
Message 11 - Posted: 15 Feb 2006, 23:33:09 UTC
Last modified: 15 Feb 2006, 23:35:56 UTC

Hi , this WU remains after Step 340000 always with 1%!



Result


WUID





BYE H (°!°)


ID: 11 · Report as offensive    Reply Quote
Profile dekim
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 20 Jan 06
Posts: 250
Credit: 543,579
RAC: 0
Message 12 - Posted: 15 Feb 2006, 23:41:57 UTC - in response to Message 11.  

Hi , this WU remains after Step 340000 always with 1%!


Let it continue to run for at least an hour or so.

ID: 12 · Report as offensive    Reply Quote
HollyXYZ

Send message
Joined: 15 Feb 06
Posts: 2
Credit: 759
RAC: 0
Message 14 - Posted: 15 Feb 2006, 23:51:53 UTC

I have already broken off them, sorry. The next WU is already with 4.59%!




BYE H (°!°)
ID: 14 · Report as offensive    Reply Quote
Profile [B^S] ThatGuy

Send message
Joined: 16 Feb 06
Posts: 3
Credit: 2,322
RAC: 0
Message 199 - Posted: 18 Feb 2006, 9:20:35 UTC
Last modified: 18 Feb 2006, 9:23:32 UTC

I just had a WU that I noticed has been running for 7 hours and is at 1% HBLR_1.0_2reb_206_38
Host
Result

This WU is slotted to run with 4.83, I have some of the BARCODE WUs that are coming up that are 4.84. I restarted BOINC to see if it would straighten itself out. It has gotten a little over 2.5 hours after restarting, and it still says 1%. Oh, and I still have mine set for leaving the exes in memory.

What SHOULD we do to best help figure out what the problem is? Should I abort it now that I've posted the related links? Is there any other info that I should post about my environment? Try something else? Turn on super-secret-debugging mode?

ID: 199 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 224 - Posted: 18 Feb 2006, 17:18:31 UTC
Last modified: 18 Feb 2006, 17:21:48 UTC

After a number of pkill rosetta +++ plus some pkill boinc
Previous WU finally ended OK
see it after upload -:)
https://ralph.bakerlab.org/result.php?resultid=3932

Now, next WU is stuck at 1%

crobertp [/home/boinc/BOINC] > w
3:00pm up 4 days, 20:41, 2 users, load average: 0.00, 0.00, 0.08 <-see
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
saigam pts/1 matrix.cp3 Fri11pm 14:31m 0.17s 0.17s -bash
boinc pts/4 200.216.141.84 7:47am 0.00s 10:02 0.01s w
crobertp [/home/boinc/BOINC] > ps xu
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
boinc 27682 0.0 0.4 2616 1036 ? SN Feb17 0:00 /bin/bash ./yasuc.sh
boinc 31679 0.0 0.8 7220 2132 ? S 07:47 0:00 /usr/sbin/sshd
boinc 31680 0.0 0.8 3480 2152 pts/4 S 07:47 0:00 -bash
boinc 725 0.0 1.3 5744 3468 pts/4 S 14:02 0:00 ./boinc -redirectio -allow_remote_gui_rpc
boinc 727 17.0 19.9 96076 49528 pts/4 SN 14:02 10:01 rosetta_beta_4.83_i686-pc-linux-gnu cc 1shf A -relax -stringen
boinc 728 0.0 19.9 96076 49528 pts/4 SN 14:02 0:00 rosetta_beta_4.83_i686-pc-linux-gnu cc 1shf A -relax -stringen
boinc 729 0.0 19.9 96076 49528 pts/4 SN 14:02 0:00 rosetta_beta_4.83_i686-pc-linux-gnu cc 1shf A -relax -stringen
boinc 903 0.0 0.2 2084 624 ? SN 14:52 0:00 sleep 600

However , cause this WU is *not* using CPU and the whole system is IDLE
This 1% BUG is diffrent from the Other 1% BUG - really not a 1% bug
*previus WU got caugth by the 54.99% BUG

From now on, I will keep monitoring each WU on this remote pc,
at (01 hour (meridien) of distance) from me,
and will do the standard actions need to keep the WU running,
every time the CPU temperature of remote pc goes down to ambient temperature.

That actions are
1) pkill rosetta
wait ...
if problem continues ... do
2) pkill boinc
./boinc -redirectio -allow_rem... &
and on and on ... until the WU Finishes OK.

IF any problem different of WU freeze/stuck occurs, I will post that, here
Click signature for global team stats
ID: 224 · Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 16 Feb 06
Posts: 251
Credit: 0
RAC: 0
Message 299 - Posted: 19 Feb 2006, 6:23:33 UTC

I have moved discussions of the 1% hang issue to here to keep this thread focused on reports of the problem.

Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact
ID: 299 · Report as offensive    Reply Quote
Profile David@home
Avatar

Send message
Joined: 16 Feb 06
Posts: 24
Credit: 409
RAC: 0
Message 300 - Posted: 19 Feb 2006, 7:22:38 UTC
Last modified: 19 Feb 2006, 7:23:47 UTC

I have a RALPH WU stuck at 1% after 37 minutes of CPU time. I have currently suspended the project so it is left in memory.

Is there anything the devs would like me to do to check this out further or should I just abort it and post the link to the result file?




ID: 300 · Report as offensive    Reply Quote
Profile David@home
Avatar

Send message
Joined: 16 Feb 06
Posts: 24
Credit: 409
RAC: 0
Message 302 - Posted: 19 Feb 2006, 7:54:17 UTC - in response to Message 301.  
Last modified: 19 Feb 2006, 8:09:18 UTC

I'd be curious to know the model# and Step number it's frozen at, but don't want you to lose the possiblility of them asking you to do something first. This data is on the graphic.
Is it a 4.83, 4.85??
What's your switch between projects time?
Are you doing more than one project?
Is this a Hyperthreading host?
CPU type?


WU is BARCODE_30_256bA_NATIVE_210_24_0
Application is rosetta_beta 4.84
3 projects RALPH and SETI active (+Rosetta suspended)
Switch interval 60 minutes
No hyperthreading unfortunately
CPU: Pentium 4 2.5GHz
OS is Windows XP Pro SP2


Full Proc specs:

Intel(R) Processor Frequency ID Utility
Version: 5.5.20030402
Time Stamp: 2006/02/19 07:58:58
Number of processors in system: 1
Current processor: #1
Processor Name: Intel(R) Pentium(R) 4 CPU 2.53GHz
Type: 0
Family: F
Model: 2
Stepping: 4
Revision: 1E
L1 Trace Cache: 12 Kµops
L1 Data Cache: 8 KB
L2 Cache: 512 KB
Packaging: FC-PGA2
MMX(TM): Yes
SIMD: Yes
SIMD2: Yes
NetBurst(TM) Microarchitecture: Yes
Expected Processor Frequency: 2.53 GHz
Reported Processor Frequency: 2.53 GHz
Expected System Bus Frequency: 533 MHz
Reported System Bus Frequency: 533 MHz


ID: 302 · Report as offensive    Reply Quote
Profile David@home
Avatar

Send message
Joined: 16 Feb 06
Posts: 24
Credit: 409
RAC: 0
Message 303 - Posted: 19 Feb 2006, 9:49:09 UTC
Last modified: 19 Feb 2006, 9:53:21 UTC

Sorry just seen your request for step number etc.

There is a screen shot of the graphic at

http://mercury.walagata.com/w/appetiser/ralph.gif
ID: 303 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 304 - Posted: 19 Feb 2006, 13:21:22 UTC
Last modified: 19 Feb 2006, 13:41:06 UTC

I get a truly 1% bug !
https://ralph.bakerlab.org/result.php?resultid=5090
Clicking above does not say much, except for cpu type O/S and things like.

Below info I read on the screen saver, and I am typing here carefully

CPU time 4 hours 2 minutes 36 seconds (this number is increasing at each second)

*all the rest of the screen is absolutely frozen

1% complete
Stage: Ab initio
model: 1 step : 2001
Accepted RMSD: 6.134
Accepted Energy: -11.31033

The "Native" can be moved holding left mouse buttom , and moving the mouse

my cpu speed: 5000 mhz
cooper cooler, base argentum, 80 mm fan, 3000 rpm, actual cpu temperature, 57 C

I am now suspending this job by manual command, awaiting further instructions
*It will be left into RAM - until next reboot occurs - should not (nobreak)

app rosetta_beta 4.85

Click signature for global team stats
ID: 304 · Report as offensive    Reply Quote
Profile David@home
Avatar

Send message
Joined: 16 Feb 06
Posts: 24
Credit: 409
RAC: 0
Message 306 - Posted: 19 Feb 2006, 15:01:53 UTC - in response to Message 305.  
Last modified: 19 Feb 2006, 15:07:11 UTC

The people with hung work units in memory waiting for instructions:

I will send a note to David Kim to get his attention to this thread and provide you furthur instructions on what to do. As I write this it is 7:00 am Sunday on the West coast, so assuming he checks his mail on Sunday mornings he should get back to you soon. The information you can provide him is valuable so please hang in there till he gets back to you.



Many thanks for the update. I just checked and for some reason the WU has dropped out of memory. Even though the project was suspended and BOINC manager shows the work unit still as preempted it is nolonger in Windows Task Manager and in the BOINC Manager log there is this info:

19/02/2006 10:59:19|ralph@home|Result BARCODE_30_256bA_NATIVE_210_24_0 exited with zero status but no 'finished' file
19/02/2006 10:59:19|ralph@home|If this happens repeatedly you may need to reset the project.
19/02/2006 10:59:19||request_reschedule_cpus: process exited


Why after it was happily suspended for several hours it did this is not clear. The other project was not doing anything other than crunch its work unit at this time so it was not a side effect of the other project.

My understanding is that the CC will retry this WU once again when I unsuspend the client. I will wait to hear from the devs before doing this.

Edit >> Hmmm, interesting, I just checked something... the last Antispyware scan I ran was at around 11:00. maybe Windows defender kicked the binary in memory which caused it to fail as above.



ID: 306 · Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 16 Feb 06
Posts: 251
Credit: 0
RAC: 0
Message 308 - Posted: 19 Feb 2006, 15:12:39 UTC - in response to Message 306.  
Last modified: 19 Feb 2006, 15:39:26 UTC

The people with hung work units in memory waiting for instructions:

I will send a note to David Kim to get his attention to this thread and provide you further instructions on what to do. As I write this it is 7:00 am Sunday on the West coast, so assuming he checks his mail on Sunday mornings he should get back to you soon. The information you can provide him is valuable so please hang in there till he gets back to you.



Many thanks for the update. I just checked and for some reason the WU has dropped out of memory. Even though the project was suspended and BOINC manager shows the work unit still as preempted it is nolonger in Windows Task Manager and in the BOINC Manager log there is this info:

19/02/2006 10:59:19|ralph@home|Result BARCODE_30_256bA_NATIVE_210_24_0 exited with zero status but no 'finished' file
19/02/2006 10:59:19|ralph@home|If this happens repeatedly you may need to reset the project.
19/02/2006 10:59:19||request_reschedule_cpus: process exited


Why after it was happily suspended for several hours it did this is not clear. The other project was not doing anything other than crunch its work unit at this time so it was not a side effect of the other project.

My understanding is that the CC will retry this WU once again when I unsuspend the client. I will wait to hear from the devs before doing this.


The BOINC software probably unloaded it because the project was suspended. If you get another one of these try suspending just that Work Unit from the Work tab, thereby keeping the project active but trapping the work unit.

I have notified Dr(s) Kim and Baker to take a look at this thread and provide you with additional instructions.

It is possible that when you restart the project on your system, that it will pick up your formerly hung WU and run it to completion. While many people at Rosetta simply abort hung WUs, usually you can get them going again by just stopping and restarting BOINC. However for Ralph the project team really needs to capture some of these in a "bottle" like you are trying to do to get a good look at them.

Keep up the good work folks, together we can kill this bug.

Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact
ID: 308 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 317 - Posted: 19 Feb 2006, 16:37:54 UTC - in response to Message 314.  
Last modified: 19 Feb 2006, 16:48:52 UTC

I get a truly 1% bug !
https://ralph.bakerlab.org/result.php?resultid=5090
Clicking above does not say much, except for cpu type O/S and things like.

Below info I read on the screen saver, and I am typing here carefully

CPU time 4 hours 2 minutes 36 seconds (this number is increasing at each second)

*all the rest of the screen is absolutely frozen

1% complete
Stage: Ab initio
model: 1 step : 2001
Accepted RMSD: 6.134
Accepted Energy: -11.31033....


Carlos:

I just looked at the two Work Units that failed to download. They seem to have failed on the same file. This is probably something on the server, like a dropped connection or bad file. I have brought a possible cause to the attention of the server administrator. I suspect you will not see this again, but please report it if you do. Just keep the WU that is hung warm until Dr Kim can post some instructions on what to do with it. Thanks for your help.

I am posting some additional information, I think may be of interest

D:boincBOINCprojectsralph.bakerlab.org>head -200 BARCODE_30_1elwA_212_3_0_0
[REAL OPT]Default value for [-cpu_frac] 10
[REAL OPT]Default value for [-frame_rate] 10
[INT OPT]New value for [-cpu_run_time] 3600
command executed: projects/ralph.bakerlab.org/rosetta_beta_4.85_windows_intelx86.exe cc 1elw A -abre
lax -stringent_relax -more_relax_cycles -output_chi_silent -vary_omega -rand_envpair_res_wt -rand_SS
_wt -farlx -ex1 -ex2 -silent -barcode_from_fragments -new_centroid_packing -barcode_from_fragments_l
ength 30 -ssblocks -barcode_mode 3 -omega_weight 0.5 -jitter_frag -jitter_variation gauss -output_si
lent_gz -nstruct 10 -paths ccfrags200.txt -relax_score_filter -filter1 -240 -filter2 -255 -cpu_run_t
ime 3600 -constant_seed -jran 3995108
[STR OPT]New value for [-paths] ccfrags200.txt.
[T/F OPT]Default FALSE value for [-unix_paths]
[T/F OPT]Default FALSE value for [-version]
[T/F OPT]Default FALSE value for [-score]
[T/F OPT]Default FALSE value for [-abinitio]
[T/F OPT]Default FALSE value for [-refine]
[T/F OPT]Default FALSE value for [-assemble]
[T/F OPT]Default FALSE value for [-idealize]
[T/F OPT]Default FALSE value for [-relax]
[T/F OPT]New TRUE value for [-abrelax]
[T/F OPT]Default FALSE value for [-sim_aneal]
[T/F OPT]Default FALSE value for [-cenrlx]
[T/F OPT]Default FALSE value for [-force_expand]
[T/F OPT]Default FALSE value for [-minimize]
Rosetta mode: abrelax
[T/F OPT]Default FALSE value for [-chain]
[T/F OPT]Default FALSE value for [-protein]
[T/F OPT]Default FALSE value for [-series]
series_code = cc :: protein_name is 1elw:: chain_id is A.
[INT OPT]New value for [-nstruct] 10
[T/F OPT]Default FALSE value for [-use_pdbseq]
[T/F OPT]Default FALSE value for [-read_all_chains]
[T/F OPT]Default FALSE value for [-use_pdb_numbering]
[T/F OPT]Default FALSE value for [-fa_input]
[T/F OPT]Default FALSE value for [-overwrite]
[T/F OPT]Default FALSE value for [-output_pdb_gz]
[T/F OPT]New TRUE value for [-output_silent_gz]
[T/F OPT]Default FALSE value for [-output_scorefile_gz]
[T/F OPT]Default FALSE value for [-termini]
[T/F OPT]Default FALSE value for [-Nterminus]
[T/F OPT]Default FALSE value for [-Cterminus]
[T/F OPT]Default FALSE value for [-use_trie]
[T/F OPT]Default FALSE value for [-no_trie]
[T/F OPT]Default FALSE value for [-trials_trie]
[T/F OPT]Default FALSE value for [-no_trials_trie]
[T/F OPT]Default FALSE value for [-read_interaction_graph]
[T/F OPT]Default FALSE value for [-write_interaction_graph]
[STR OPT]Default value for [-ig_file] .
[T/F OPT]Default FALSE value for [-silent_input]
[T/F OPT]Default FALSE value for [-timer]
[T/F OPT]Default FALSE value for [-count_attempts]
[T/F OPT]Default FALSE value for [-status]
[T/F OPT]Default FALSE value for [-ise_movie]
[T/F OPT]Default FALSE value for [-output_all]
[T/F OPT]New TRUE value for [-output_chi_silent]
[T/F OPT]Default FALSE value for [-skip_missing_residues]
[STR OPT]Default value for [-cst] cst.
[STR OPT]Default value for [-dpl] dpl.
[STR OPT]Default value for [-resfile] none.
[STR OPT]Default value for [-equiv_resfile] none.
[T/F OPT]Default FALSE value for [-auto_resfile]
[T/F OPT]Default FALSE value for [-chain_inc]
[T/F OPT]Default FALSE value for [-full_filename]
[T/F OPT]Default FALSE value for [-map_sequence]
[INT OPT]Default value for [-max_frags] 200
[STR OPT]Default value for [-protein_name_prefix] .
[STR OPT]Default value for [-frags_name_prefix] .
[T/F OPT]Default FALSE value for [-enable_dna]
[T/F OPT]Default FALSE value for [-phospho_ser]
[T/F OPT]Default FALSE value for [-loops]
[T/F OPT]Default FALSE value for [-taboo]
[T/F OPT]Default FALSE value for [-multi_chain]
[T/F OPT]New TRUE value for [-ex1]
[T/F OPT]New TRUE value for [-ex2]
[T/F OPT]Default FALSE value for [-ex3]
[T/F OPT]Default FALSE value for [-ex4]
[T/F OPT]Default FALSE value for [-ex1aro]
[T/F OPT]Default FALSE value for [-ex1aro_half]
[T/F OPT]Default FALSE value for [-ex2aro_only]
[INT OPT]Default value for [-extrachi_cutoff] 18
[T/F OPT]Default FALSE value for [-rot_pert]
[T/F OPT]Default FALSE value for [-rot_pert_input]
[T/F OPT]Default FALSE value for [-exdb]
[T/F OPT]Default FALSE value for [-use_electrostatic_repulsion]
[T/F OPT]Default FALSE value for [-explicit_h2o]
[T/F OPT]Default FALSE value for [-solvate]
[T/F OPT]Default FALSE value for [-pH]
[T/F OPT]Default FALSE value for [-try_both_his_tautomers]
[T/F OPT]Default FALSE value for [-minimize_rot]
[T/F OPT]Default FALSE value for [-read_hetero_h2o]
[T/F OPT]Default FALSE value for [-Wint_score_only]
[T/F OPT]Default FALSE value for [-Wint_repack_only]
[T/F OPT]Default FALSE value for [-ligand]
[T/F OPT]Default FALSE value for [-enzyme_design]
[T/F OPT]Default FALSE value for [-score_contact_flag]
[T/F OPT]Default FALSE value for [-score_contact_weight]
[T/F OPT]Default FALSE value for [-score_contact_threshold]
[T/F OPT]Default FALSE value for [-scorefxn]
default centroid scorefxn: 4
default fullatom scorefxn: 12
[INT OPT]Default value for [-run_level] 0
[T/F OPT]New TRUE value for [-silent]
run level: -4
[T/F OPT]Default FALSE value for [-benchmark]
[T/F OPT]Default FALSE value for [-debug]
[STR OPT]Default value for [-s] none.
[STR OPT]Default value for [-l] none.
Reading .Rama_smooth_dyn.dat_ss_6.4.gz
Reading .phi.theta.36.HS.resmooth.gz
Reading .phi.theta.36.SS.resmooth.gz
[STR OPT]Default value for [-atom_vdw_set] default.
[T/F OPT]Default FALSE value for [-IUPAC]
Atom_mode set to all
Reading .paircutoffs.gz
[T/F OPT]Default FALSE value for [-decoystats]
set_decoystats_flag: from,to F F
[T/F OPT]Default FALSE value for [-decoyfeatures]
[T/F OPT]Default FALSE value for [-evolution]
[T/F OPT]Default FALSE value for [-evol_recomb]
BOINC :: [2006-02-19 05:37:18] :: mode: abrelax :: nstartnm: 1 :: number_of_output: 10 :: num_decoys
: 0 :: pct_complete: 0.01
Searching for dat file: .1elw.dat
Searching for dat file: .1elw.dat
WARNING!! .dat file not found!
Looking for fasta file: .1elwA.fasta
[T/F OPT]Default FALSE value for [-find_disulf]
[T/F OPT]Default FALSE value for [-fix_disulf]
Looking for psipred file: .1elwA.psipred_ss2
Protein type: all alpha Fraction beta: 0nbeta 0
disabling sheet filter
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
WARNING: CONSTRAINT FILE NOT FOUND
Searched for: .1elwA.cst
Running without distance constraints
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
WARNING: DIPOLAR CONSTRAINT FILE NOT FOUND
Searched for: .1elwA.dpl
Dipolar constraints will not be used
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
fragment file: .cc1elwA03_05.200_v1_3.gz
Total Residue 117
frag size: 3 frags/residue: 200
fragment file: .cc1elwA09_05.200_v1_3.gz
Total Residue 117
frag size: 9 frags/residue: 200
generating 1mer library from 3mer library
[T/F OPT]New TRUE value for [-ssblocks]
1 0.714999974 0 -1.81021962E+01011
2 0.845000029 0 -457391011
3 1 0 -5.84474728E+02211
4 1 0 0.009290844211
5 0.985000014 0 0.014999999711
6 0.964999974 0.0299999993 0.0049999998911
7 0.964999974 0 9303.766611
8 0.995000005 0 -8.6274389E+02611
9 0.985000014 0.00999999978 1.27419852E+02611
10 0.970000029 0 -237.30987511
11 1 0 -3.57050048E-02111
12 1 0 -2.10856042E-03911
13 0.995000005 0 -8.15923702E+02411
14 0.930000007 0 0.070000000311
15 0.800000012 0 4.05845883E+01911
16 0.469999999 0.0299999993 1.86625286E+02600
17 0.300000012 0 0.69999998800
18 0.38499999 0.0149999997 0.60000002400
19 0.995000005 0 0.032656710611
20 1 0 5.50368345E-01311
21 0.995000005 0 0.0049999998911
22 1 0 1.64160455E+01311
23 1 0 6.48693289E-04011
24 0.99000001 0 -6.31571591E+01611
25 0.995000005 0 0.0049999998911
26 1 0 7893327.511
27 1 0 6.10060667E+02511
28 0.995000005 0 0.0049999998911
29 0.985000014 0.00999999978 2.53535253E+02211
30 0.920000017 0.0199999996 0.059999998711
31 0.670000017 0.075000003 0.25499999500
32 0.270000011 0.0900000036 0.64012342700
33 0 0.00999999978 -1.84238144E+02000
34 0.0500000007 0.00999999978 0.93999999800
35 0.230000004 0 -1.09468657E+01500
36 0.550000012 0.00499999989 -4.3622533E+02500
37 0.985000014 0 -1.9613201E+03511
38 0.99000001 0 0.0099999997811
39 0.985000014 0.00499999989 0.0076843472211
40 1 0 -1.8080512E+00911
41 0.980000019 0 -1.13549524E+02411
42 0.99000001 0 0.010001833611
43 1 0 6.42855928E-03811
44 1 0 6.28685685E+01811
45 1 0 3.61228439E+01611
46 0.995000005 0 -2.59030907E+03411
47 1 0 5.11612367E+02611
48 0.985000014 0 0.014999999711
49 0.870000005 0 0.12999999511
50 0.435000002 0.0500000007 41141.554700
51 0.200000003 0 0.80000001200
52 0.460000008 0.00499999989 2.92289013E+03500
53 0.995000005 0 0.0049999998911
54 1 0 2.34173571E+01311
55 1 0 8.93829884E-03811
56 0.99000001 0 -1.96550475E+01311
57 1 0 138633511
58 1 0 -39566345611
59 0.930000007 0.00999999978 0.059999998711

D:boincBOINCprojectsralph.bakerlab.org>

ps: first 200 lines of this wu, only shows first 59 steps. wu hang on 2001
do u want I post all lines from 201 thru end ?
Click signature for global team stats
ID: 317 · Report as offensive    Reply Quote
Profile dekim
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 20 Jan 06
Posts: 250
Credit: 543,579
RAC: 0
Message 328 - Posted: 19 Feb 2006, 19:16:55 UTC

Can you restart boinc and see if it continues on?
ID: 328 · Report as offensive    Reply Quote
Profile David@home
Avatar

Send message
Joined: 16 Feb 06
Posts: 24
Credit: 409
RAC: 0
Message 329 - Posted: 19 Feb 2006, 19:42:10 UTC - in response to Message 328.  
Last modified: 19 Feb 2006, 19:44:02 UTC

Can you restart boinc and see if it continues on?



Restarted BOINC, the WU appears to have gone back to the start, according to the graphic it is at Model 1 step 78 (and incrementing), if it gets stuck again is there any thing we can do to help pin this down?

From the log:

19/02/2006 19:42:20||request_reschedule_cpus: project op
19/02/2006 19:42:40|ralph@home|Restarting result BARCODE_30_256bA_NATIVE_210_24_0 using rosetta_beta version 4.84
19/02/2006 19:42:40|SETI@home|Pausing result 14au00aa.7506.496.234660.1.92_2 (left in memory)



ID: 329 · Report as offensive    Reply Quote
Profile David@home
Avatar

Send message
Joined: 16 Feb 06
Posts: 24
Credit: 409
RAC: 0
Message 332 - Posted: 19 Feb 2006, 20:15:55 UTC - in response to Message 329.  
Last modified: 19 Feb 2006, 20:24:26 UTC

Can you restart boinc and see if it continues on?



Restarted BOINC, the WU appears to have gone back to the start, according to the graphic it is at Model 1 step 78 (and incrementing), if it gets stuck again is there any thing we can do to help pin this down?

From the log:

19/02/2006 19:42:20||request_reschedule_cpus: project op
19/02/2006 19:42:40|ralph@home|Restarting result BARCODE_30_256bA_NATIVE_210_24_0 using rosetta_beta version 4.84
19/02/2006 19:42:40|SETI@home|Pausing result 14au00aa.7506.496.234660.1.92_2 (left in memory)





OK, this time it went thought to completion OK and credit was granted:

https://ralph.bakerlab.org/workunit.php?wuid=3325

Interestingly this is about 30 mins of CPU time, it had done 30 minutes previously before it hung. These test WUs typically take just under an hour on my PC. It is as if it has only claimed credit for the second 30 mins of CPU but carried on the calculations from where it got stuck. Should it have claimed the credit for both periods of CPU activity?

Any comments on the credit and the fact that it did not hang the second time?

It would be interesting to hear any ideas. Thanks.




ID: 332 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 340 - Posted: 19 Feb 2006, 23:38:20 UTC - in response to Message 337.  

I was out for about 1 hour to do love.
I'm back, now.
WU still undisturbed, suspended into RAM.
Anything to do ?


Hi,

There is a request from dekim a few posts down in this thread:

https://ralph.bakerlab.org/forum_thread.php?id=1#328p

My best guess is that this is for both of us.


I am not sure!
I prefer to wait some more, to risk removing this job from RAM.
*In case I am wrong, he will post again, following this one post.

btw: the english name of the roman (archaic latin) name "argentum" is silver


The post from Dr. Kim was for both of you. he wants you to restart the Work Unit and see if it will run.


YES, the job run -:)

However,
*the only thing, that changed is:
CPU time: 5 hr 14 min 50 sec and increasing ... more and more ...

*all other things on the screen completely frozen (as before)

1% complete
Stage: Ab initio
model: 1 step : 2001
Accepted RMSD: 6.134
Accepted Energy: -11.31033

*Suspending job again!

ps: Do Dr. Kim really hope on getting out of a endless loop by brute force ?

*the job is now with more than 5 hours of cpu time @ 5000mhz and still @ 1% !!!

My jobs are set to 8 hours of run ... may be the test he wants,
is IF after exceeding the limit of 8 hours of cpu time @ 1% the job will crash?

*Isn´t better I aborting it now ?

*What I expected from him was instructions on how to do a interactive trace
of the run, step by step

-or- using Drwatson to get a memory dump of my 512M of RAM and e-mailing him
that dump

*Never a brute-force test -:(
Click signature for global team stats
ID: 340 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 346 - Posted: 20 Feb 2006, 1:43:16 UTC - in response to Message 342.  
Last modified: 20 Feb 2006, 2:00:35 UTC


*Isn´t better I aborting it now ?

*What I expected from him was instructions on how to do a interactive trace
of the run, step by step

-or- using Drwatson to get a memory dump of my 512M of RAM and e-mailing him
that dump

*Never a brute-force test -:(

Carlos, you have a winner there, please don't abort it, keep it in memory, you may have the WU we testers need to fix this. I'd wait until instructed what to do next. Remember it's sunday. Leaving Ralph or that WU suspended is important to Ralph and is the whole reason Ralph even exists.

I wish I had what you have, I really do.

tony


Calm down, I have nobreak, unless occurs a power failure of more than half-hour
of duration, this WU will be keept into RAM ad infinitum
*IF my diesel generator wasn't being repaired, I could even garantee that!

ps: I posted into this thread the "random seed" as long as other
technical details for this job.
*for example:
WARNING: CONSTRAINT FILE NOT FOUND
Searched for: .1elwA.cst
Running without distance constraints
WARNING: DIPOLAR CONSTRAINT FILE NOT FOUND
Searched for: .1elwA.dpl
Dipolar constraints will not be used

*thus, even if a system reboot occurs,
I am about sure of being able to reproduce the problem again -:)
-constant_seed -jran 3995108
Click signature for global team stats
ID: 346 · Report as offensive    Reply Quote
John McLeod VII
Avatar

Send message
Joined: 16 Feb 06
Posts: 8
Credit: 39,560
RAC: 0
Message 347 - Posted: 20 Feb 2006, 2:25:48 UTC

I have one that is stuck WU, Computer. It has been going for 2 days, 20 hours, 58 minutes and 4 seconds of CPU time. This machine is currently estimating 8 hours for completion of other results.

Awaiting further instructions.

jm7


BOINC WIKI
ID: 347 · Report as offensive    Reply Quote
1 · 2 · 3 · 4 . . . 7 · Next

Message boards : RALPH@home bug list : Report \"stuck at 1%\" bugs here



©2024 University of Washington
http://www.bakerlab.org