Message boards : RALPH@home bug list : Report \"stuck at 1%\" bugs here
Author | Message |
---|---|
dekim Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 20 Jan 06 Posts: 250 Credit: 543,579 RAC: 0 |
Please include a link to the work unit, result, and host. |
HollyXYZ Send message Joined: 15 Feb 06 Posts: 2 Credit: 759 RAC: 0 |
|
dekim Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 20 Jan 06 Posts: 250 Credit: 543,579 RAC: 0 |
Hi , this WU remains after Step 340000 always with 1%! Let it continue to run for at least an hour or so. |
HollyXYZ Send message Joined: 15 Feb 06 Posts: 2 Credit: 759 RAC: 0 |
I have already broken off them, sorry. The next WU is already with 4.59%! BYE H (°!°) |
[B^S] ThatGuy Send message Joined: 16 Feb 06 Posts: 3 Credit: 2,322 RAC: 0 |
I just had a WU that I noticed has been running for 7 hours and is at 1% HBLR_1.0_2reb_206_38 Host Result This WU is slotted to run with 4.83, I have some of the BARCODE WUs that are coming up that are 4.84. I restarted BOINC to see if it would straighten itself out. It has gotten a little over 2.5 hours after restarting, and it still says 1%. Oh, and I still have mine set for leaving the exes in memory. What SHOULD we do to best help figure out what the problem is? Should I abort it now that I've posted the related links? Is there any other info that I should post about my environment? Try something else? Turn on super-secret-debugging mode? |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
After a number of pkill rosetta +++ plus some pkill boinc Previous WU finally ended OK see it after upload -:) https://ralph.bakerlab.org/result.php?resultid=3932 Now, next WU is stuck at 1% crobertp [/home/boinc/BOINC] > w 3:00pm up 4 days, 20:41, 2 users, load average: 0.00, 0.00, 0.08 <-see USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT saigam pts/1 matrix.cp3 Fri11pm 14:31m 0.17s 0.17s -bash boinc pts/4 200.216.141.84 7:47am 0.00s 10:02 0.01s w crobertp [/home/boinc/BOINC] > ps xu USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND boinc 27682 0.0 0.4 2616 1036 ? SN Feb17 0:00 /bin/bash ./yasuc.sh boinc 31679 0.0 0.8 7220 2132 ? S 07:47 0:00 /usr/sbin/sshd boinc 31680 0.0 0.8 3480 2152 pts/4 S 07:47 0:00 -bash boinc 725 0.0 1.3 5744 3468 pts/4 S 14:02 0:00 ./boinc -redirectio -allow_remote_gui_rpc boinc 727 17.0 19.9 96076 49528 pts/4 SN 14:02 10:01 rosetta_beta_4.83_i686-pc-linux-gnu cc 1shf A -relax -stringen boinc 728 0.0 19.9 96076 49528 pts/4 SN 14:02 0:00 rosetta_beta_4.83_i686-pc-linux-gnu cc 1shf A -relax -stringen boinc 729 0.0 19.9 96076 49528 pts/4 SN 14:02 0:00 rosetta_beta_4.83_i686-pc-linux-gnu cc 1shf A -relax -stringen boinc 903 0.0 0.2 2084 624 ? SN 14:52 0:00 sleep 600 However , cause this WU is *not* using CPU and the whole system is IDLE This 1% BUG is diffrent from the Other 1% BUG - really not a 1% bug *previus WU got caugth by the 54.99% BUG From now on, I will keep monitoring each WU on this remote pc, at (01 hour (meridien) of distance) from me, and will do the standard actions need to keep the WU running, every time the CPU temperature of remote pc goes down to ambient temperature. That actions are 1) pkill rosetta wait ... if problem continues ... do 2) pkill boinc ./boinc -redirectio -allow_rem... & and on and on ... until the WU Finishes OK. IF any problem different of WU freeze/stuck occurs, I will post that, here Click signature for global team stats |
Moderator9 Volunteer moderator Send message Joined: 16 Feb 06 Posts: 251 Credit: 0 RAC: 0 |
I have moved discussions of the 1% hang issue to here to keep this thread focused on reports of the problem. Moderator9 RALPH@home FAQs RALPH@home Guidelines Moderator Contact |
David@home Send message Joined: 16 Feb 06 Posts: 24 Credit: 409 RAC: 0 |
I have a RALPH WU stuck at 1% after 37 minutes of CPU time. I have currently suspended the project so it is left in memory. Is there anything the devs would like me to do to check this out further or should I just abort it and post the link to the result file? |
David@home Send message Joined: 16 Feb 06 Posts: 24 Credit: 409 RAC: 0 |
I'd be curious to know the model# and Step number it's frozen at, but don't want you to lose the possiblility of them asking you to do something first. This data is on the graphic. WU is BARCODE_30_256bA_NATIVE_210_24_0 Application is rosetta_beta 4.84 3 projects RALPH and SETI active (+Rosetta suspended) Switch interval 60 minutes No hyperthreading unfortunately CPU: Pentium 4 2.5GHz OS is Windows XP Pro SP2 Full Proc specs: Intel(R) Processor Frequency ID Utility Version: 5.5.20030402 Time Stamp: 2006/02/19 07:58:58 Number of processors in system: 1 Current processor: #1 Processor Name: Intel(R) Pentium(R) 4 CPU 2.53GHz Type: 0 Family: F Model: 2 Stepping: 4 Revision: 1E L1 Trace Cache: 12 Kµops L1 Data Cache: 8 KB L2 Cache: 512 KB Packaging: FC-PGA2 MMX(TM): Yes SIMD: Yes SIMD2: Yes NetBurst(TM) Microarchitecture: Yes Expected Processor Frequency: 2.53 GHz Reported Processor Frequency: 2.53 GHz Expected System Bus Frequency: 533 MHz Reported System Bus Frequency: 533 MHz |
David@home Send message Joined: 16 Feb 06 Posts: 24 Credit: 409 RAC: 0 |
Sorry just seen your request for step number etc. There is a screen shot of the graphic at http://mercury.walagata.com/w/appetiser/ralph.gif |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
I get a truly 1% bug ! https://ralph.bakerlab.org/result.php?resultid=5090 Clicking above does not say much, except for cpu type O/S and things like. Below info I read on the screen saver, and I am typing here carefully CPU time 4 hours 2 minutes 36 seconds (this number is increasing at each second) *all the rest of the screen is absolutely frozen 1% complete Stage: Ab initio model: 1 step : 2001 Accepted RMSD: 6.134 Accepted Energy: -11.31033 The "Native" can be moved holding left mouse buttom , and moving the mouse my cpu speed: 5000 mhz cooper cooler, base argentum, 80 mm fan, 3000 rpm, actual cpu temperature, 57 C I am now suspending this job by manual command, awaiting further instructions *It will be left into RAM - until next reboot occurs - should not (nobreak) app rosetta_beta 4.85 Click signature for global team stats |
David@home Send message Joined: 16 Feb 06 Posts: 24 Credit: 409 RAC: 0 |
The people with hung work units in memory waiting for instructions: Many thanks for the update. I just checked and for some reason the WU has dropped out of memory. Even though the project was suspended and BOINC manager shows the work unit still as preempted it is nolonger in Windows Task Manager and in the BOINC Manager log there is this info: 19/02/2006 10:59:19|ralph@home|Result BARCODE_30_256bA_NATIVE_210_24_0 exited with zero status but no 'finished' file 19/02/2006 10:59:19|ralph@home|If this happens repeatedly you may need to reset the project. 19/02/2006 10:59:19||request_reschedule_cpus: process exited Why after it was happily suspended for several hours it did this is not clear. The other project was not doing anything other than crunch its work unit at this time so it was not a side effect of the other project. My understanding is that the CC will retry this WU once again when I unsuspend the client. I will wait to hear from the devs before doing this. Edit >> Hmmm, interesting, I just checked something... the last Antispyware scan I ran was at around 11:00. maybe Windows defender kicked the binary in memory which caused it to fail as above. |
Moderator9 Volunteer moderator Send message Joined: 16 Feb 06 Posts: 251 Credit: 0 RAC: 0 |
The people with hung work units in memory waiting for instructions: The BOINC software probably unloaded it because the project was suspended. If you get another one of these try suspending just that Work Unit from the Work tab, thereby keeping the project active but trapping the work unit. I have notified Dr(s) Kim and Baker to take a look at this thread and provide you with additional instructions. It is possible that when you restart the project on your system, that it will pick up your formerly hung WU and run it to completion. While many people at Rosetta simply abort hung WUs, usually you can get them going again by just stopping and restarting BOINC. However for Ralph the project team really needs to capture some of these in a "bottle" like you are trying to do to get a good look at them. Keep up the good work folks, together we can kill this bug. Moderator9 RALPH@home FAQs RALPH@home Guidelines Moderator Contact |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
I get a truly 1% bug ! I am posting some additional information, I think may be of interest D:boincBOINCprojectsralph.bakerlab.org>head -200 BARCODE_30_1elwA_212_3_0_0 [REAL OPT]Default value for [-cpu_frac] 10 [REAL OPT]Default value for [-frame_rate] 10 [INT OPT]New value for [-cpu_run_time] 3600 command executed: projects/ralph.bakerlab.org/rosetta_beta_4.85_windows_intelx86.exe cc 1elw A -abre lax -stringent_relax -more_relax_cycles -output_chi_silent -vary_omega -rand_envpair_res_wt -rand_SS _wt -farlx -ex1 -ex2 -silent -barcode_from_fragments -new_centroid_packing -barcode_from_fragments_l ength 30 -ssblocks -barcode_mode 3 -omega_weight 0.5 -jitter_frag -jitter_variation gauss -output_si lent_gz -nstruct 10 -paths ccfrags200.txt -relax_score_filter -filter1 -240 -filter2 -255 -cpu_run_t ime 3600 -constant_seed -jran 3995108 [STR OPT]New value for [-paths] ccfrags200.txt. [T/F OPT]Default FALSE value for [-unix_paths] [T/F OPT]Default FALSE value for [-version] [T/F OPT]Default FALSE value for [-score] [T/F OPT]Default FALSE value for [-abinitio] [T/F OPT]Default FALSE value for [-refine] [T/F OPT]Default FALSE value for [-assemble] [T/F OPT]Default FALSE value for [-idealize] [T/F OPT]Default FALSE value for [-relax] [T/F OPT]New TRUE value for [-abrelax] [T/F OPT]Default FALSE value for [-sim_aneal] [T/F OPT]Default FALSE value for [-cenrlx] [T/F OPT]Default FALSE value for [-force_expand] [T/F OPT]Default FALSE value for [-minimize] Rosetta mode: abrelax [T/F OPT]Default FALSE value for [-chain] [T/F OPT]Default FALSE value for [-protein] [T/F OPT]Default FALSE value for [-series] series_code = cc :: protein_name is 1elw:: chain_id is A. [INT OPT]New value for [-nstruct] 10 [T/F OPT]Default FALSE value for [-use_pdbseq] [T/F OPT]Default FALSE value for [-read_all_chains] [T/F OPT]Default FALSE value for [-use_pdb_numbering] [T/F OPT]Default FALSE value for [-fa_input] [T/F OPT]Default FALSE value for [-overwrite] [T/F OPT]Default FALSE value for [-output_pdb_gz] [T/F OPT]New TRUE value for [-output_silent_gz] [T/F OPT]Default FALSE value for [-output_scorefile_gz] [T/F OPT]Default FALSE value for [-termini] [T/F OPT]Default FALSE value for [-Nterminus] [T/F OPT]Default FALSE value for [-Cterminus] [T/F OPT]Default FALSE value for [-use_trie] [T/F OPT]Default FALSE value for [-no_trie] [T/F OPT]Default FALSE value for [-trials_trie] [T/F OPT]Default FALSE value for [-no_trials_trie] [T/F OPT]Default FALSE value for [-read_interaction_graph] [T/F OPT]Default FALSE value for [-write_interaction_graph] [STR OPT]Default value for [-ig_file] . [T/F OPT]Default FALSE value for [-silent_input] [T/F OPT]Default FALSE value for [-timer] [T/F OPT]Default FALSE value for [-count_attempts] [T/F OPT]Default FALSE value for [-status] [T/F OPT]Default FALSE value for [-ise_movie] [T/F OPT]Default FALSE value for [-output_all] [T/F OPT]New TRUE value for [-output_chi_silent] [T/F OPT]Default FALSE value for [-skip_missing_residues] [STR OPT]Default value for [-cst] cst. [STR OPT]Default value for [-dpl] dpl. [STR OPT]Default value for [-resfile] none. [STR OPT]Default value for [-equiv_resfile] none. [T/F OPT]Default FALSE value for [-auto_resfile] [T/F OPT]Default FALSE value for [-chain_inc] [T/F OPT]Default FALSE value for [-full_filename] [T/F OPT]Default FALSE value for [-map_sequence] [INT OPT]Default value for [-max_frags] 200 [STR OPT]Default value for [-protein_name_prefix] . [STR OPT]Default value for [-frags_name_prefix] . [T/F OPT]Default FALSE value for [-enable_dna] [T/F OPT]Default FALSE value for [-phospho_ser] [T/F OPT]Default FALSE value for [-loops] [T/F OPT]Default FALSE value for [-taboo] [T/F OPT]Default FALSE value for [-multi_chain] [T/F OPT]New TRUE value for [-ex1] [T/F OPT]New TRUE value for [-ex2] [T/F OPT]Default FALSE value for [-ex3] [T/F OPT]Default FALSE value for [-ex4] [T/F OPT]Default FALSE value for [-ex1aro] [T/F OPT]Default FALSE value for [-ex1aro_half] [T/F OPT]Default FALSE value for [-ex2aro_only] [INT OPT]Default value for [-extrachi_cutoff] 18 [T/F OPT]Default FALSE value for [-rot_pert] [T/F OPT]Default FALSE value for [-rot_pert_input] [T/F OPT]Default FALSE value for [-exdb] [T/F OPT]Default FALSE value for [-use_electrostatic_repulsion] [T/F OPT]Default FALSE value for [-explicit_h2o] [T/F OPT]Default FALSE value for [-solvate] [T/F OPT]Default FALSE value for [-pH] [T/F OPT]Default FALSE value for [-try_both_his_tautomers] [T/F OPT]Default FALSE value for [-minimize_rot] [T/F OPT]Default FALSE value for [-read_hetero_h2o] [T/F OPT]Default FALSE value for [-Wint_score_only] [T/F OPT]Default FALSE value for [-Wint_repack_only] [T/F OPT]Default FALSE value for [-ligand] [T/F OPT]Default FALSE value for [-enzyme_design] [T/F OPT]Default FALSE value for [-score_contact_flag] [T/F OPT]Default FALSE value for [-score_contact_weight] [T/F OPT]Default FALSE value for [-score_contact_threshold] [T/F OPT]Default FALSE value for [-scorefxn] default centroid scorefxn: 4 default fullatom scorefxn: 12 [INT OPT]Default value for [-run_level] 0 [T/F OPT]New TRUE value for [-silent] run level: -4 [T/F OPT]Default FALSE value for [-benchmark] [T/F OPT]Default FALSE value for [-debug] [STR OPT]Default value for [-s] none. [STR OPT]Default value for [-l] none. Reading .Rama_smooth_dyn.dat_ss_6.4.gz Reading .phi.theta.36.HS.resmooth.gz Reading .phi.theta.36.SS.resmooth.gz [STR OPT]Default value for [-atom_vdw_set] default. [T/F OPT]Default FALSE value for [-IUPAC] Atom_mode set to all Reading .paircutoffs.gz [T/F OPT]Default FALSE value for [-decoystats] set_decoystats_flag: from,to F F [T/F OPT]Default FALSE value for [-decoyfeatures] [T/F OPT]Default FALSE value for [-evolution] [T/F OPT]Default FALSE value for [-evol_recomb] BOINC :: [2006-02-19 05:37:18] :: mode: abrelax :: nstartnm: 1 :: number_of_output: 10 :: num_decoys : 0 :: pct_complete: 0.01 Searching for dat file: .1elw.dat Searching for dat file: .1elw.dat WARNING!! .dat file not found! Looking for fasta file: .1elwA.fasta [T/F OPT]Default FALSE value for [-find_disulf] [T/F OPT]Default FALSE value for [-fix_disulf] Looking for psipred file: .1elwA.psipred_ss2 Protein type: all alpha Fraction beta: 0nbeta 0 disabling sheet filter XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX WARNING: CONSTRAINT FILE NOT FOUND Searched for: .1elwA.cst Running without distance constraints XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX WARNING: DIPOLAR CONSTRAINT FILE NOT FOUND Searched for: .1elwA.dpl Dipolar constraints will not be used XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX fragment file: .cc1elwA03_05.200_v1_3.gz Total Residue 117 frag size: 3 frags/residue: 200 fragment file: .cc1elwA09_05.200_v1_3.gz Total Residue 117 frag size: 9 frags/residue: 200 generating 1mer library from 3mer library [T/F OPT]New TRUE value for [-ssblocks] 1 0.714999974 0 -1.81021962E+01011 2 0.845000029 0 -457391011 3 1 0 -5.84474728E+02211 4 1 0 0.009290844211 5 0.985000014 0 0.014999999711 6 0.964999974 0.0299999993 0.0049999998911 7 0.964999974 0 9303.766611 8 0.995000005 0 -8.6274389E+02611 9 0.985000014 0.00999999978 1.27419852E+02611 10 0.970000029 0 -237.30987511 11 1 0 -3.57050048E-02111 12 1 0 -2.10856042E-03911 13 0.995000005 0 -8.15923702E+02411 14 0.930000007 0 0.070000000311 15 0.800000012 0 4.05845883E+01911 16 0.469999999 0.0299999993 1.86625286E+02600 17 0.300000012 0 0.69999998800 18 0.38499999 0.0149999997 0.60000002400 19 0.995000005 0 0.032656710611 20 1 0 5.50368345E-01311 21 0.995000005 0 0.0049999998911 22 1 0 1.64160455E+01311 23 1 0 6.48693289E-04011 24 0.99000001 0 -6.31571591E+01611 25 0.995000005 0 0.0049999998911 26 1 0 7893327.511 27 1 0 6.10060667E+02511 28 0.995000005 0 0.0049999998911 29 0.985000014 0.00999999978 2.53535253E+02211 30 0.920000017 0.0199999996 0.059999998711 31 0.670000017 0.075000003 0.25499999500 32 0.270000011 0.0900000036 0.64012342700 33 0 0.00999999978 -1.84238144E+02000 34 0.0500000007 0.00999999978 0.93999999800 35 0.230000004 0 -1.09468657E+01500 36 0.550000012 0.00499999989 -4.3622533E+02500 37 0.985000014 0 -1.9613201E+03511 38 0.99000001 0 0.0099999997811 39 0.985000014 0.00499999989 0.0076843472211 40 1 0 -1.8080512E+00911 41 0.980000019 0 -1.13549524E+02411 42 0.99000001 0 0.010001833611 43 1 0 6.42855928E-03811 44 1 0 6.28685685E+01811 45 1 0 3.61228439E+01611 46 0.995000005 0 -2.59030907E+03411 47 1 0 5.11612367E+02611 48 0.985000014 0 0.014999999711 49 0.870000005 0 0.12999999511 50 0.435000002 0.0500000007 41141.554700 51 0.200000003 0 0.80000001200 52 0.460000008 0.00499999989 2.92289013E+03500 53 0.995000005 0 0.0049999998911 54 1 0 2.34173571E+01311 55 1 0 8.93829884E-03811 56 0.99000001 0 -1.96550475E+01311 57 1 0 138633511 58 1 0 -39566345611 59 0.930000007 0.00999999978 0.059999998711 D:boincBOINCprojectsralph.bakerlab.org> ps: first 200 lines of this wu, only shows first 59 steps. wu hang on 2001 do u want I post all lines from 201 thru end ? Click signature for global team stats |
dekim Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 20 Jan 06 Posts: 250 Credit: 543,579 RAC: 0 |
Can you restart boinc and see if it continues on? |
David@home Send message Joined: 16 Feb 06 Posts: 24 Credit: 409 RAC: 0 |
Can you restart boinc and see if it continues on? Restarted BOINC, the WU appears to have gone back to the start, according to the graphic it is at Model 1 step 78 (and incrementing), if it gets stuck again is there any thing we can do to help pin this down? From the log: 19/02/2006 19:42:20||request_reschedule_cpus: project op 19/02/2006 19:42:40|ralph@home|Restarting result BARCODE_30_256bA_NATIVE_210_24_0 using rosetta_beta version 4.84 19/02/2006 19:42:40|SETI@home|Pausing result 14au00aa.7506.496.234660.1.92_2 (left in memory) |
David@home Send message Joined: 16 Feb 06 Posts: 24 Credit: 409 RAC: 0 |
Can you restart boinc and see if it continues on? OK, this time it went thought to completion OK and credit was granted: https://ralph.bakerlab.org/workunit.php?wuid=3325 Interestingly this is about 30 mins of CPU time, it had done 30 minutes previously before it hung. These test WUs typically take just under an hour on my PC. It is as if it has only claimed credit for the second 30 mins of CPU but carried on the calculations from where it got stuck. Should it have claimed the credit for both periods of CPU activity? Any comments on the credit and the fact that it did not hang the second time? It would be interesting to hear any ideas. Thanks. |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
I was out for about 1 hour to do love. YES, the job run -:) However, *the only thing, that changed is: CPU time: 5 hr 14 min 50 sec and increasing ... more and more ... *all other things on the screen completely frozen (as before) 1% complete Stage: Ab initio model: 1 step : 2001 Accepted RMSD: 6.134 Accepted Energy: -11.31033 *Suspending job again! ps: Do Dr. Kim really hope on getting out of a endless loop by brute force ? *the job is now with more than 5 hours of cpu time @ 5000mhz and still @ 1% !!! My jobs are set to 8 hours of run ... may be the test he wants, is IF after exceeding the limit of 8 hours of cpu time @ 1% the job will crash? *Isn´t better I aborting it now ? *What I expected from him was instructions on how to do a interactive trace of the run, step by step -or- using Drwatson to get a memory dump of my 512M of RAM and e-mailing him that dump *Never a brute-force test -:( Click signature for global team stats |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
Calm down, I have nobreak, unless occurs a power failure of more than half-hour of duration, this WU will be keept into RAM ad infinitum *IF my diesel generator wasn't being repaired, I could even garantee that! ps: I posted into this thread the "random seed" as long as other technical details for this job. *for example: WARNING: CONSTRAINT FILE NOT FOUND Searched for: .1elwA.cst Running without distance constraints WARNING: DIPOLAR CONSTRAINT FILE NOT FOUND Searched for: .1elwA.dpl Dipolar constraints will not be used *thus, even if a system reboot occurs, I am about sure of being able to reproduce the problem again -:) -constant_seed -jran 3995108 Click signature for global team stats |
John McLeod VII Send message Joined: 16 Feb 06 Posts: 8 Credit: 39,560 RAC: 0 |
I have one that is stuck WU, Computer. It has been going for 2 days, 20 hours, 58 minutes and 4 seconds of CPU time. This machine is currently estimating 8 hours for completion of other results. Awaiting further instructions. jm7 BOINC WIKI |
Message boards :
RALPH@home bug list :
Report \"stuck at 1%\" bugs here
©2024 University of Washington
http://www.bakerlab.org