RALPH@home

Bug reports for 5.60-5.62

  UW Seal
 
[ Home ] [ Join ] [ About ] [ Participants ] [ Community ] [ Statistics ]
  [ login/out ]


Advanced search

Message boards : RALPH@home bug list : Bug reports for 5.60-5.62

AuthorMessage
Profile dekim
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Jan 20 06
Posts: 216
ID: 1
Credit: 484,843
RAC: 84
Message 3011 - Posted 26 Apr 2007 19:02:33 UTC

    This update contains the following:

    1. a fix for the percent complete going back to zero after restarts
    2. checkpointing for new pose and jumping jobs
    3. optimization in jumping jobs which were having issues with long and variable run times.

    I\'ll queue up some test jobs soon.
    ____________

    Mike Gelvin
    Avatar

    Joined: Feb 17 06
    Posts: 50
    ID: 468
    Credit: 55,397
    RAC: 0
    Message 3012 - Posted 26 Apr 2007 20:51:06 UTC

      First workunit I got promptly crashed:

      http://ralph.bakerlab.org/result.php?resultid=496686

      <core_client_version>5.8.15</core_client_version>
      <![CDATA[
      <message>
      Incorrect function. (0x1) - exit code 1 (0x1)
      </message>
      <stderr_txt>
      ERROR:: Unable to obtain total_residue & sequence.
      start pdb file must be provided.
      ERROR:: Exit from: .\\input_pdb.cc line: 2944
      # cpu_run_time_pref: 3600

      </stderr_txt>
      ]]>




      with 0 CPU time.... this does not bode well...


      ____________

      Profile dekim
      Forum moderator
      Project administrator
      Project developer
      Project scientist

      Joined: Jan 20 06
      Posts: 216
      ID: 1
      Credit: 484,843
      RAC: 84
      Message 3013 - Posted 26 Apr 2007 21:04:03 UTC

        I accidentally submitted a few bad jobs that will fail. Please ignore these.
        ____________

        Profile feet1st

        Joined: Mar 7 06
        Posts: 312
        ID: 1028
        Credit: 110,522
        RAC: 0
        Message 3014 - Posted 27 Apr 2007 1:41:12 UTC

          I\'ve got a search pairings WU. It was 3hrs 48min in to the run, on the 19th model and just cross into the second stage.

          I exited BOINC, restarted, and % completed still looks good and I only lost 2.5min of work! So, the checkpointing must be working too!
          ____________

          Profile feet1st

          Joined: Mar 7 06
          Posts: 312
          ID: 1028
          Credit: 110,522
          RAC: 0
          Message 3015 - Posted 27 Apr 2007 13:29:42 UTC

            I did another end and restart of BOINC on the same WU and it ended upon restart when it had 8.5hrs more to go to reach target CPU time. Upon restart it was initializing for 45 seconds or so and then ended. No indication of why in the result and the messages tab just says computation finished like normal. Looks like the 76 models it had completed were preserved and reported though, and credit was granted.
            ____________

            Profile feet1st

            Joined: Mar 7 06
            Posts: 312
            ID: 1028
            Credit: 110,522
            RAC: 0
            Message 3016 - Posted 27 Apr 2007 14:51:19 UTC

              Another search pairings WU doesn\'t seem to be displaying the sidechains properly in the graphic. Running on Windows XP Pro. They just appear as little dots, and the ribbons of the backbone are almost translucent. Using \"C\" to change coloration just changes the color of the dots. Never had such occur before.
              ____________

              Profile dekim
              Forum moderator
              Project administrator
              Project developer
              Project scientist

              Joined: Jan 20 06
              Posts: 216
              ID: 1
              Credit: 484,843
              RAC: 84
              Message 3017 - Posted 27 Apr 2007 19:04:53 UTC

                Odd issues.

                For the display, did you try \'b\' which changes the backbone display or \'s\' which changes the sidechain display. Your description sounds like abnormal behavior.

                I\'ll have to look into what may have caused the premature ending.

                thanks for your help!
                ____________

                Profile dekim
                Forum moderator
                Project administrator
                Project developer
                Project scientist

                Joined: Jan 20 06
                Posts: 216
                ID: 1
                Credit: 484,843
                RAC: 84
                Message 3018 - Posted 27 Apr 2007 19:17:39 UTC

                  Found the cause of the premature ending and will have to fix it. It\'s not a bad bug though, since the jobs return successfully. Thanks!
                  ____________

                  Profile feet1st

                  Joined: Mar 7 06
                  Posts: 312
                  ID: 1028
                  Credit: 110,522
                  RAC: 0
                  Message 3019 - Posted 27 Apr 2007 19:32:53 UTC

                    Just did another end and restart and had another end prematurely, so I can\'t try hitting other characters. Perhaps I had hit one of the other characters by mistake.

                    Here is a screenshot of this WU with the funky sidechains (or lack thereof).
                    ____________

                    Profile dekim
                    Forum moderator
                    Project administrator
                    Project developer
                    Project scientist

                    Joined: Jan 20 06
                    Posts: 216
                    ID: 1
                    Credit: 484,843
                    RAC: 84
                    Message 3020 - Posted 27 Apr 2007 19:49:29 UTC

                      that doesn\'t look right. Does it happen often? The premature ending can actually be fixed without a new compile. The problem is that the job is going by an nstruct argument we give that should never be used instead of the cpu run time pref after a restart.
                      ____________

                      Profile feet1st

                      Joined: Mar 7 06
                      Posts: 312
                      ID: 1028
                      Credit: 110,522
                      RAC: 0
                      Message 3021 - Posted 27 Apr 2007 19:55:18 UTC

                        Last modified: 27 Apr 2007 19:58:33 UTC

                        I just got this one down to same host, and it is happening there as well. Rosetta running on the other thread of HT CPU and looks fine.

                        Does it happen often?? No, just these two WUs is the only time I\'ve seen this happen.

                        [edit] I forgot to mention! I only lost 30 seconds of runtime on that last restart! The new checkpointing must be working well.
                        ____________

                        Profile feet1st

                        Joined: Mar 7 06
                        Posts: 312
                        ID: 1028
                        Credit: 110,522
                        RAC: 0
                        Message 3022 - Posted 27 Apr 2007 20:21:35 UTC

                          I got this one on another host and it seems to have same issue with the graphic.
                          ____________

                          Profile dekim
                          Forum moderator
                          Project administrator
                          Project developer
                          Project scientist

                          Joined: Jan 20 06
                          Posts: 216
                          ID: 1
                          Credit: 484,843
                          RAC: 84
                          Message 3023 - Posted 27 Apr 2007 21:27:45 UTC

                            The checkpoint interval for these test jobs is set at 60 seconds but it is also limited by the disk write interval user preference.
                            ____________

                            Profile Conan
                            Avatar

                            Joined: Feb 16 06
                            Posts: 344
                            ID: 145
                            Credit: 1,323,912
                            RAC: 681
                            Message 3024 - Posted 28 Apr 2007 7:07:14 UTC

                              Last modified: 28 Apr 2007 7:09:32 UTC

                              Error

                              <core_client_version>5.8.11</core_client_version>
                              <![CDATA[
                              <message>
                              process exited with code 1 (0x1)
                              </message>
                              <stderr_txt>
                              Graphics are disabled due to configuration...
                              # cpu_run_time_pref: 21600
                              ERROR:: Unable to obtain total_residue & sequence.
                              start pdb file must be provided.
                              ERROR:: Exit from: input_pdb.cc line: 2944

                              http://ralph.bakerlab.org/result.php?resultid=496662
                              http://ralph.bakerlab.org/result.php?resultid=496727

                              Also
                              0
                              stderr out

                              <core_client_version>5.8.11</core_client_version>
                              <![CDATA[
                              <message>
                              process exited with code 1 (0x1)
                              </message>
                              <stderr_txt>
                              Graphics are disabled due to configuration...
                              # cpu_run_time_pref: 21600
                              ERROR:: Unable to determine sequence length from starting structure coordinate file
                              ERROR:: Exit from: input_pdb.cc line: 2962

                              http://ralph.bakerlab.org/result.php?resultid=498608

                              ____________

                              mdettweiler
                              Avatar

                              Joined: Apr 4 07
                              Posts: 11
                              ID: 2886
                              Credit: 1,010
                              RAC: 0
                              Message 3025 - Posted 30 Apr 2007 5:10:44 UTC

                                Last modified: 30 Apr 2007 5:11:43 UTC

                                I got this workunit a couple of days ago, and it appears that after shutting down and restarting BOINC, it always picks up from the beginning of the model (which is model 1, since it normally doesn\'t get to do more in the default run time of 1 hour). This is despite the fact that sometimes it\'s been running for more than an hour on end--which is telling me that the checkpointing may not be working (although I wouldn\'t have suspected this at first, because the progress reporting was fine).

                                Does anyone know if the problem is with the checkpoints, or with something else? Should I abort the WU?

                                Edit: I forgot to mention this, but yes, I am sure that this workunit does use the new 5.60 application, so it\'s not an old holdover.

                                Profile dekim
                                Forum moderator
                                Project administrator
                                Project developer
                                Project scientist

                                Joined: Jan 20 06
                                Posts: 216
                                ID: 1
                                Credit: 484,843
                                RAC: 84
                                Message 3026 - Posted 30 Apr 2007 17:14:35 UTC

                                  Anonymous,

                                  That work unit uses the standard checkpointing method which doesn\'t have the same resolution as the new checkpointing for pose/jumping jobs. I also noticed that this particular job runs quite long, just over an hour per decoy on average. Unfortunately, the long run time is required for this particular experiment.
                                  ____________

                                  mdettweiler
                                  Avatar

                                  Joined: Apr 4 07
                                  Posts: 11
                                  ID: 2886
                                  Credit: 1,010
                                  RAC: 0
                                  Message 3027 - Posted 30 Apr 2007 19:58:51 UTC - in response to Message 3026.

                                    Last modified: 30 Apr 2007 19:59:18 UTC

                                    Anonymous,

                                    That work unit uses the standard checkpointing method which doesn\'t have the same resolution as the new checkpointing for pose/jumping jobs. I also noticed that this particular job runs quite long, just over an hour per decoy on average. Unfortunately, the long run time is required for this particular experiment.


                                    So...do you mean that the checkpointing is worthless in this particular task? Or, was that a problem but now fixed, after I already got that workunit?

                                    Profile feet1st

                                    Joined: Mar 7 06
                                    Posts: 312
                                    ID: 1028
                                    Credit: 110,522
                                    RAC: 0
                                    Message 3028 - Posted 30 Apr 2007 21:44:38 UTC

                                      Rosetta has several different modes. Different types of tasks. They have had better checkpointing for some types of tasks for some time. They\'ve now added better checkpointing for some additional specific types of tasks... but they aren\'t done yet adding checkpointing to all of different types of work that Rosetta is capable of doing.

                                      David Kim, is the plan to roll out the next release with only the improvements to the pose/jumping checkpointing as it is? Or will you have all task types enhanced prior to a Rosetta release?
                                      ____________

                                      Profile dekim
                                      Forum moderator
                                      Project administrator
                                      Project developer
                                      Project scientist

                                      Joined: Jan 20 06
                                      Posts: 216
                                      ID: 1
                                      Credit: 484,843
                                      RAC: 84
                                      Message 3029 - Posted 1 May 2007 0:29:23 UTC

                                        The new checkpointing is just for pose/jumping jobs. The recent long jobs, such as the one anonymous pointed out (FRA_a011_IG9_hom001_1_a011_1_bfac_S_00001_0000495_0.pdb), use the older method of checkpointing. I\'ll discuss the possibility of improving the checkpointing for these jobs with Bin, who is the one who developed it and is running these long jobs. I don\'t think we\'ll be able to add this to the next release though.
                                        ____________

                                        mdettweiler
                                        Avatar

                                        Joined: Apr 4 07
                                        Posts: 11
                                        ID: 2886
                                        Credit: 1,010
                                        RAC: 0
                                        Message 3030 - Posted 1 May 2007 4:14:26 UTC - in response to Message 3029.

                                          The new checkpointing is just for pose/jumping jobs. The recent long jobs, such as the one anonymous pointed out (FRA_a011_IG9_hom001_1_a011_1_bfac_S_00001_0000495_0.pdb), use the older method of checkpointing. I\'ll discuss the possibility of improving the checkpointing for these jobs with Bin, who is the one who developed it and is running these long jobs. I don\'t think we\'ll be able to add this to the next release though.


                                          So, are you saying that checkpointing is a no-go for this type of workunit? It sure looks like it from what I\'m seeing.

                                          I\'m also having a similar problem with a regular Rosetta@Home workunit (a 10 hour one); except that it checkpointed fine until it was, like, 65-70 percent done, and it appears that one of the checkpoints fell through and it started from the beginning. (See the \"problems with version 5.59\" thread over there for details).

                                          Profile dekim
                                          Forum moderator
                                          Project administrator
                                          Project developer
                                          Project scientist

                                          Joined: Jan 20 06
                                          Posts: 216
                                          ID: 1
                                          Credit: 484,843
                                          RAC: 84
                                          Message 3031 - Posted 1 May 2007 5:34:03 UTC

                                            There is checkpointing for that type of work unit. It just happens less frequently. It may be that your R@h workunit displayed 0% complete after a restart but there is a bug in the % complete display so I would ignore it and go by the cpu run time. This minor bug is fixed in this ralph version.
                                            ____________

                                            mdettweiler
                                            Avatar

                                            Joined: Apr 4 07
                                            Posts: 11
                                            ID: 2886
                                            Credit: 1,010
                                            RAC: 0
                                            Message 3033 - Posted 1 May 2007 19:04:18 UTC - in response to Message 3031.

                                              There is checkpointing for that type of work unit. It just happens less frequently. It may be that your R@h workunit displayed 0% complete after a restart but there is a bug in the % complete display so I would ignore it and go by the cpu run time. This minor bug is fixed in this ralph version.


                                              Yes, I figured there might be a problem with the progress reporting--but even though the WU was on the 12th model or so before, it had now jumped back to the first one, so I knew a checkpoint must have fallen through.

                                              Profile dekim
                                              Forum moderator
                                              Project administrator
                                              Project developer
                                              Project scientist

                                              Joined: Jan 20 06
                                              Posts: 216
                                              ID: 1
                                              Credit: 484,843
                                              RAC: 84
                                              Message 3034 - Posted 1 May 2007 19:09:39 UTC

                                                Actually, there is another bug in the current R@h app that doesn\'t report the model number correctly under certain circumstances. This is fixed in the current ralph app being tested.
                                                ____________

                                                Profile dekim
                                                Forum moderator
                                                Project administrator
                                                Project developer
                                                Project scientist

                                                Joined: Jan 20 06
                                                Posts: 216
                                                ID: 1
                                                Credit: 484,843
                                                RAC: 84
                                                Message 3035 - Posted 1 May 2007 20:07:14 UTC

                                                  We\'re now testing version 5.61. This version has a few science code updates and uses the latest version of the boinc api.
                                                  ____________

                                                  Profile dekim
                                                  Forum moderator
                                                  Project administrator
                                                  Project developer
                                                  Project scientist

                                                  Joined: Jan 20 06
                                                  Posts: 216
                                                  ID: 1
                                                  Credit: 484,843
                                                  RAC: 84
                                                  Message 3036 - Posted 1 May 2007 23:17:26 UTC

                                                    Version 5.62 has a fix in the stack size adjuster for macs.
                                                    ____________

                                                    Profile feet1st

                                                    Joined: Mar 7 06
                                                    Posts: 312
                                                    ID: 1028
                                                    Credit: 110,522
                                                    RAC: 0
                                                    Message 3037 - Posted 2 May 2007 2:21:12 UTC - in response to Message 3033.

                                                      Yes, I figured there might be a problem with the progress reporting--but even though the WU was on the 12th model or so before, it had now jumped back to the first one, so I knew a checkpoint must have fallen through.


                                                      Work should be preserved after each model. They don\'t refer to that save as a checkpoint, it is actually the point that the application saves it\'s work. So you shouldn\'t lose previously completed models under any circumstance. I\'ve never seen that happen before. The checkpoints are mid-model saves.
                                                      ____________

                                                      Dr Who Fan
                                                      Avatar

                                                      Joined: Sep 2 06
                                                      Posts: 63
                                                      ID: 1787
                                                      Credit: 55,275
                                                      RAC: 520
                                                      Message 3038 - Posted 2 May 2007 6:49:08 UTC

                                                        Client error Compute error
                                                        http://ralph.bakerlab.org/result.php?resultid=496726
                                                        <core_client_version>5.8.11</core_client_version>
                                                        <![CDATA[
                                                        <message>
                                                        Incorrect function. (0x1) - exit code 1 (0x1)
                                                        </message>
                                                        <stderr_txt>
                                                        # cpu_run_time_pref: 7200
                                                        ERROR:: Unable to obtain total_residue & sequence.
                                                        start pdb file must be provided.
                                                        ERROR:: Exit from: .\\input_pdb.cc line: 2944

                                                        </stderr_txt>
                                                        ]]>

                                                        and

                                                        Client error Compute error
                                                        http://ralph.bakerlab.org/result.php?resultid=496717
                                                        <core_client_version>5.8.15</core_client_version>
                                                        <![CDATA[
                                                        <message>
                                                        Incorrect function. (0x1) - exit code 1 (0x1)
                                                        </message>
                                                        <stderr_txt>
                                                        # cpu_run_time_pref: 7200
                                                        ERROR:: Unable to obtain total_residue & sequence.
                                                        start pdb file must be provided.
                                                        ERROR:: Exit from: .\\input_pdb.cc line: 2944

                                                        </stderr_txt>
                                                        ]]>

                                                        ____________

                                                        Profile feet1st

                                                        Joined: Mar 7 06
                                                        Posts: 312
                                                        ID: 1028
                                                        Credit: 110,522
                                                        RAC: 0
                                                        Message 3039 - Posted 2 May 2007 14:45:57 UTC

                                                          Last modified: 2 May 2007 14:53:21 UTC

                                                          I\'m still seeing the translucent backbone and dotted sidechains in the graphic as originally reported here. This is under Windows, with version 5.62, a search pairings task

                                                          [edit]same with this SYMM_FOLD_AND_DOCK_RELAX_BARCODE task. Also 5.62.
                                                          ____________

                                                          Profile feet1st

                                                          Joined: Mar 7 06
                                                          Posts: 312
                                                          ID: 1028
                                                          Credit: 110,522
                                                          RAC: 0
                                                          Message 3045 - Posted 3 May 2007 14:31:34 UTC

                                                            Last modified: 3 May 2007 14:37:42 UTC

                                                            Translucent backbone and dotted sidechains on 1bm8__DIVERSE_ABRELAX__NEWRELAXFLAGS_BCFROMFRAGS_20_frags83__1981

                                                            Same on Rosetta 5.62 task SEARCH_PAIRINGS_-1hz6A-round2_filters_SAVE_ALL_OUT_1694_1263

                                                            ____________

                                                            Message boards : RALPH@home bug list : Bug reports for 5.60-5.62


                                                            Home | Join | About | Participants | Community | Statistics

                                                            Copyright © 2017 University of Washington

                                                            Last Modified: 20 Nov 2008 19:41:56 UTC
                                                            Back to top ^