RALPH@home

Bug reports for 5.90/5.91

  UW Seal
 
[ Home ] [ Join ] [ About ] [ Participants ] [ Community ] [ Statistics ]
  [ login/out ]


Advanced search

Message boards : RALPH@home bug list : Bug reports for 5.90/5.91

AuthorMessage
Rhiju
Forum moderator
Project developer
Project scientist

Joined: Feb 14 06
Posts: 161
ID: 4
Credit: 3,725
RAC: 0
Message 3554 - Posted 20 Dec 2007 1:24:24 UTC

    Update to 5.90! Keep us posted on weirdness.
    ____________

    Profile Conan
    Avatar

    Joined: Feb 16 06
    Posts: 344
    ID: 145
    Credit: 1,309,534
    RAC: 0
    Message 3556 - Posted 20 Dec 2007 11:04:35 UTC

      I am currently running 3 work units of the \"1px8\" variety.
      They are behaving very weirdly.
      I checked my computer and saw that only one WU was progressing in time and percentage done (a CPDN WU) and no other counters were moving.
      I have 4 cores and Boinc Manager said that my 3 other cores were all running at High Priority on my 3 Ralph work units.
      No CPU Time done, No Percent Done and No Time Left counters were moving.
      In fact The CPU Time was on 0.00% as was Percent done.
      Suspending and restarting made no difference.
      Stopped Boinc and restarted Boinc Manager and I now had the RALPH WU\'s showing over and hour had been completed and between 18% and 30% had been completed.
      But the counters still are not moving, only when you stop and restart the manager do you see what progress has been done.

      The WU\'s are still running at the moment 715097
      715153
      715237

      I checked my system (Linux) and found all 4 cores are running so the WU\'s have not hung and are actually processing, it is just that I can\'t see what is happening.

      ____________

      Rhiju
      Forum moderator
      Project developer
      Project scientist

      Joined: Feb 14 06
      Posts: 161
      ID: 4
      Credit: 3,725
      RAC: 0
      Message 3558 - Posted 20 Dec 2007 18:38:44 UTC - in response to Message 3556.

        Hi Conan: Thanks, this is interesting and unexpected! I forgot to mention in the home page that we also updated the version of the BOINC api that is used to compile Rosetta@home, in anticipation of a new release of the BOINC client software (6.0). We\'ll look into whether this causes a problem with the reporting of % complete, etc.


        I am currently running 3 work units of the \"1px8\" variety.
        They are behaving very weirdly.
        I checked my computer and saw that only one WU was progressing in time and percentage done (a CPDN WU) and no other counters were moving.
        I have 4 cores and Boinc Manager said that my 3 other cores were all running at High Priority on my 3 Ralph work units.
        No CPU Time done, No Percent Done and No Time Left counters were moving.
        In fact The CPU Time was on 0.00% as was Percent done.
        Suspending and restarting made no difference.
        Stopped Boinc and restarted Boinc Manager and I now had the RALPH WU\'s showing over and hour had been completed and between 18% and 30% had been completed.
        But the counters still are not moving, only when you stop and restart the manager do you see what progress has been done.

        The WU\'s are still running at the moment 715097
        715153
        715237

        I checked my system (Linux) and found all 4 cores are running so the WU\'s have not hung and are actually processing, it is just that I can\'t see what is happening.


        ____________

        Thomas Leibold

        Joined: Feb 25 07
        Posts: 27
        ID: 2684
        Credit: 77,464
        RAC: 0
        Message 3559 - Posted 21 Dec 2007 6:04:57 UTC

          Rosetta Thread

          I have problems with 5.90 in both Ralph and Rosetta (details posted in the Rosetta thread above). Besides the task display, it also appears that the Ralph task is running indefinitely (with 4 hour runtime configuration the workunit is already running over 24 hours).

          BigMike
          Avatar

          Joined: Feb 23 06
          Posts: 63
          ID: 738
          Credit: 58,730
          RAC: 0
          Message 3560 - Posted 21 Dec 2007 6:04:58 UTC

            This isn\'t really a bug, but more of an item that Rosetta@Home users will probably go nuts about. I suspect it has to do with the new BOINC API.

            Recent WU\'s have an interesting line that says ERRORS: CANCELLED.

            Any idea what it means? Might be useful to put it in the FAQs if it\'s always going to say that.

            ==Mike
            ____________
            Don't believe everything you think.

            Profile Conan
            Avatar

            Joined: Feb 16 06
            Posts: 344
            ID: 145
            Credit: 1,309,534
            RAC: 0
            Message 3562 - Posted 21 Dec 2007 13:38:59 UTC - in response to Message 3558.

              Hi Conan: Thanks, this is interesting and unexpected! I forgot to mention in the home page that we also updated the version of the BOINC api that is used to compile Rosetta@home, in anticipation of a new release of the BOINC client software (6.0). We\'ll look into whether this causes a problem with the reporting of % complete, etc.


              I am currently running 3 work units of the \"1px8\" variety.
              They are behaving very weirdly.
              I checked my computer and saw that only one WU was progressing in time and percentage done (a CPDN WU) and no other counters were moving.
              I have 4 cores and Boinc Manager said that my 3 other cores were all running at High Priority on my 3 Ralph work units.
              No CPU Time done, No Percent Done and No Time Left counters were moving.
              In fact The CPU Time was on 0.00% as was Percent done.
              Suspending and restarting made no difference.
              Stopped Boinc and restarted Boinc Manager and I now had the RALPH WU\'s showing over and hour had been completed and between 18% and 30% had been completed.
              But the counters still are not moving, only when you stop and restart the manager do you see what progress has been done.

              The WU\'s are still running at the moment 715097
              715153
              715237

              I checked my system (Linux) and found all 4 cores are running so the WU\'s have not hung and are actually processing, it is just that I can\'t see what is happening.




              Thanks Rhiju,
              My next 4 WU\'s of the \"s099\" type are also doing the same thing.
              Running at \"High Priority\" and showing nothing in Boinc Manager as to how it is progressing. This is happening on a different computer.
              ____________

              Profile feet1st

              Joined: Mar 7 06
              Posts: 312
              ID: 1028
              Credit: 110,522
              RAC: 0
              Message 3563 - Posted 21 Dec 2007 15:30:51 UTC

                Last modified: 21 Dec 2007 15:32:40 UTC

                There appears to be a thread on the Linux CPU time issue in the BOINC Projects list:

                (user ID required)
                [boinc_projects] CPU Time not updated under linux
                ____________

                Profile Conan
                Avatar

                Joined: Feb 16 06
                Posts: 344
                ID: 145
                Credit: 1,309,534
                RAC: 0
                Message 3567 - Posted 21 Dec 2007 21:38:38 UTC

                  It would appear that no matter what time you have in your preferances it will not be adhered to.
                  I had 4 Ralph WU\'s running when I went to bed (no idea how long they had been running as nothing was showing), all were at high priority.
                  They were still running this morning so I stopped BM and restarted, this made the WU\'s immediatly report and give their stats.
                  They had run for 19 to 20 hours and probably would of kept running till they were stopped. They produced 11 to 13 decoys.

                  718960
                  719023
                  719030
                  719087
                  ____________

                  Rhiju
                  Forum moderator
                  Project developer
                  Project scientist

                  Joined: Feb 14 06
                  Posts: 161
                  ID: 4
                  Credit: 3,725
                  RAC: 0
                  Message 3568 - Posted 21 Dec 2007 21:50:53 UTC - in response to Message 3567.

                    The linux executable should behave better now. Thanks to Conan others for the \"front-lines\" warning from ralph! Let us know how its working.


                    It would appear that no matter what time you have in your preferances it will not be adhered to.
                    I had 4 Ralph WU\'s running when I went to bed (no idea how long they had been running as nothing was showing), all were at high priority.
                    They were still running this morning so I stopped BM and restarted, this made the WU\'s immediatly report and give their stats.
                    They had run for 19 to 20 hours and probably would of kept running till they were stopped. They produced 11 to 13 decoys.

                    718960
                    719023
                    719030
                    719087


                    ____________

                    Rhiju
                    Forum moderator
                    Project developer
                    Project scientist

                    Joined: Feb 14 06
                    Posts: 161
                    ID: 4
                    Credit: 3,725
                    RAC: 0
                    Message 3569 - Posted 21 Dec 2007 21:51:53 UTC - in response to Message 3560.

                      I didn\'t see the message in, e.g., the stderr -- where is it showing up?

                      This isn\'t really a bug, but more of an item that Rosetta@Home users will probably go nuts about. I suspect it has to do with the new BOINC API.

                      Recent WU\'s have an interesting line that says ERRORS: CANCELLED.

                      Any idea what it means? Might be useful to put it in the FAQs if it\'s always going to say that.

                      ==Mike


                      ____________

                      Thomas Leibold

                      Joined: Feb 25 07
                      Posts: 27
                      ID: 2684
                      Credit: 77,464
                      RAC: 0
                      Message 3571 - Posted 22 Dec 2007 6:16:26 UTC - in response to Message 3559.

                        Last modified: 22 Dec 2007 6:17:08 UTC

                        Rosetta Thread

                        I have problems with 5.90 in both Ralph and Rosetta (details posted in the Rosetta thread above). Besides the task display, it also appears that the Ralph task is running indefinitely (with 4 hour runtime configuration the workunit is already running over 24 hours).


                        Update: the Ralph 5.90 task is now running for 48.5 hours (over 2 days). Is there are anything that will stop those 5.90 tasks from running indefinitely ?

                        What happens when the task/workunit reaches the Report Deadline (other then no longer getting credit for it) ? Will that stop those 5.90 workunits or will they still continue on ?

                        Profile Conan
                        Avatar

                        Joined: Feb 16 06
                        Posts: 344
                        ID: 145
                        Credit: 1,309,534
                        RAC: 0
                        Message 3572 - Posted 22 Dec 2007 13:07:20 UTC - in response to Message 3571.

                          Rosetta Thread

                          I have problems with 5.90 in both Ralph and Rosetta (details posted in the Rosetta thread above). Besides the task display, it also appears that the Ralph task is running indefinitely (with 4 hour runtime configuration the workunit is already running over 24 hours).


                          Update: the Ralph 5.90 task is now running for 48.5 hours (over 2 days). Is there are anything that will stop those 5.90 tasks from running indefinitely ?

                          What happens when the task/workunit reaches the Report Deadline (other then no longer getting credit for it) ? Will that stop those 5.90 workunits or will they still continue on ?



                          Hello Thomas,
                          The ones that I had only stopped when I stopped Boinc Manager and then restarted it. The Work Units then stopped and uploaded their results, also giving time taken and other stats.
                          I did not have any of the ones I had error by doing this.

                          Another post I read in Rosetta said that there is a rule that will stop the Rosetta WU once it has run for 6 times its preference time, I have yet to verify this rule or prove it works.



                          ____________

                          Thomas Leibold

                          Joined: Feb 25 07
                          Posts: 27
                          ID: 2684
                          Credit: 77,464
                          RAC: 0
                          Message 3573 - Posted 22 Dec 2007 20:45:20 UTC - in response to Message 3572.


                            Hello Thomas,
                            The ones that I had only stopped when I stopped Boinc Manager and then restarted it. The Work Units then stopped and uploaded their results, also giving time taken and other stats.

                            Hi Conan, it appears that restarting Boinc is indeed the only way to dislodge one of those stuck 5.90 workunits. Left unattended those 5.90 workunits will continue indefinitely!. Unfortunately restarting Boinc is not a complete fix either: as soon as Boinc starts up again it will process the next 5.90 that it still has in its queue.

                            I did not have any of the ones I had error by doing this.

                            Out of around 250 stuck 5.90 workunits less than 10 have made it back to the Rosetta servers on their own (some with 299 and more decoys completed). I don\'t know why those finished when most others don\'t, but most of those that did finish failed in the Rosetta validator (presumably too little cpu time such as 0.0004 seconds reported to be credible ?). On the other hand two of them did get a large amount of credit for the high number of completed decoys.

                            The two 5.90 workunits on my home server (one Ralph, one Rosetta) both passed the Validator and did get huge amounts of credit when I restarted Boinc on that machine. More importantly the amount of cpu time reported back to Rosetta was correct for those two workunits! At least part of the Boinc/Rosetta system must be aware of the correct amount of time spend on the workunit and record it properly so that this amount of cpu time spend is available on restart.


                            Another post I read in Rosetta said that there is a rule that will stop the Rosetta WU once it has run for 6 times its preference time, I have yet to verify this rule or prove it works.

                            I wish that was true, but I have proof to the contrary. That Ralph 5.90 task was running 50 hours which is a lot more than 6 times the preference time of 4 hours.

                            I believe the only options available to cleanup this mess for Linux users are:
                            Option A:
                            1.) Where possible, use Boinc Manager to abort all 5.90 workunits that are in \"Ready to Start\" state. This is necessary so that they will not become stuck too!
                            2.) Stop and restart the Boinc client. This should immediately send home all those 5.90 workunits that have been in progress and that continued to run beyond the allocated preference time. Those workunits should validate fine and will contribute to the Rosetta Science (the time was not wasted).
                            3.) Use Boinc Manager to verify that all remaining 5.90 tasks in the system are now in the \"Ready to Report\" state. Note: if you have a large Boinc cache and/or run multiple Boinc projects there is a possibility of a currently running 5.90 workunit not yet having exceeded the preferred runtime. In this case you will need to restart Boinc again when that workunit has exceeded its preferred run time.

                            Option B:
                            If you are unable to use the Boinc Manager gui (perhaps on headless servers in remote locations) you need to reset the Rosetta project. Within the BOINC directory run \"./boinc_cmd --project http://boinc.bakerlab.org/rosetta reset\". Note: this will remove all workunits and clients for the Rosetta project. This means all completed, but not yet reported work and all work in progress for this project will be lost! The boinc client will perform a fresh load of 5.91 workunits along with the 5.91 client as a result of the reset operation.

                            Hans Sveen

                            Joined: Feb 17 06
                            Posts: 8
                            ID: 390
                            Credit: 269,861
                            RAC: 2
                            Message 3579 - Posted 25 Dec 2007 14:03:09 UTC

                              Hello and Merry Christmas!

                              I tried the show graphics is run, I saw something \"funny\"; it showed only the native foldig window. The others were empty! An other peculiar behavior is that the accepted Energy line is linear; never seen that before!
                              It seems to run well behaved so far, so just for Your information.

                              Screenshot

                              Thank You!


                              ____________
                              Hans Sveen
                              Oslo, Norway

                              ramostol

                              Joined: Mar 29 07
                              Posts: 24
                              ID: 2840
                              Credit: 31,121
                              RAC: 0
                              Message 3580 - Posted 27 Dec 2007 9:30:17 UTC - in response to Message 3572.

                                Another post I read in Rosetta said that there is a rule that will stop the Rosetta WU once it has run for 6 times its preference time, I have yet to verify this rule or prove it works.


                                Actually 4 times its preference time. Observed and proven.

                                Thomas Leibold

                                Joined: Feb 25 07
                                Posts: 27
                                ID: 2684
                                Credit: 77,464
                                RAC: 0
                                Message 3581 - Posted 28 Dec 2007 5:32:47 UTC - in response to Message 3580.



                                  Actually 4 times its preference time. Observed and proven.


                                  Correct, but probably not in the sense you meant it: it has been proven that the 5.90 workunits (with the defect Linux client) will NOT terminate by themselves regardless how much they exceed the preferred run time!

                                  While cleaning up the 5.90 mess I did find that this is apparently not entirely new either. On one server I found a stuck 5.81 workunit that had not only exceeded its preferred run time over ONE HUNDRED TIMES, but also exceeded the workunit deadline by a MONTH!

                                  It would not surprise me if the failure to terminate those kinds of rogue workunits is in some way related to the longstanding issues regarding Ralph/Rosetta watchdog on Linux (e.g.: timing dependencies during watchdog initiated client shutdown, corrupting the memory allocation chain by freeing the same block of memory twice resulting in subsequent segmentation faults which sometimes crashes the watchdog and fails to terminate the client IIRC).

                                  Dr Who Fan
                                  Avatar

                                  Joined: Sep 2 06
                                  Posts: 63
                                  ID: 1787
                                  Credit: 46,809
                                  RAC: 16
                                  Message 3582 - Posted 29 Dec 2007 3:37:04 UTC

                                    Last modified: 29 Dec 2007 3:38:10 UTC

                                    When this one crashed it caused BOINC to stop working and also triggered the Windows runtime debugger/crash reporter.

                                    Name: 1zpy__BOINC_TWIST_RINGS_TWIST_ANGLE_SYMM_FOLD_AND_DOCK_RELAX-1zpy_-native__2756_503_1

                                    It had a CPU time of 10077.23 seconds


                                    Here part of the dump info from BOINC:
                                    <core_client_version>6.1.0</core_client_version>
                                    <![CDATA[
                                    <message>
                                    - exit code -1073741819 (0xc0000005)
                                    </message>
                                    <stderr_txt>
                                    # cpu_run_time_pref: 7200
                                    # random seed: 1564821

                                    Unhandled Exception Detected...

                                    - Unhandled Exception Record -
                                    Reason: Access Violation (0xc0000005) at address 0x00BA900B read attempt to address 0x12443000

                                    Engaging BOINC Windows Runtime Debugger...
                                    ********************

                                    *** Dump of the Process Statistics: ***

                                    - I/O Operations Counters -
                                    Read: 5007, Write: 0, Other 10312

                                    - I/O Transfers Counters -
                                    Read: 0, Write: 186417, Other 0

                                    - Paged Pool Usage -
                                    QuotaPagedPoolUsage: 28848, QuotaPeakPagedPoolUsage: 35360
                                    QuotaNonPagedPoolUsage: 4328, QuotaPeakNonPagedPoolUsage: 4680

                                    - Virtual Memory Usage -
                                    VirtualSize: 351391744, PeakVirtualSize: 356331520

                                    - Pagefile Usage -
                                    PagefileUsage: 256843776, PeakPagefileUsage: 304713728

                                    - Working Set Size -
                                    WorkingSetSize: 42737664, PeakWorkingSetSize: 195563520, PageFaultCount: 944412

                                    *** Dump of thread ID 2892 (state: Waiting): ***

                                    - Information -
                                    Status: Wait Reason: UserRequest, ,

                                    Unhandled Exception Detected...

                                    - Unhandled Exception Record -
                                    Reason: Access Violation (0xc0000005) at address 0x00C08C14 read attempt to address 0x80000000

                                    Engaging BOINC Windows Runtime Debugger...

                                    </stderr_txt>
                                    ]]>



                                    Here is the dump info from the Windows Crash Report:
                                    <?xml version=\"1.0\" encoding=\"UTF-16\"?>
                                    <DATABASE>
                                    <EXE NAME=\"rosetta_beta_5.90_windows_intelx86.exe\" FILTER=\"GRABMI_FILTER_PRIVACY\">
                                    <MATCHING_FILE NAME=\"rosetta_5.69_windows_intelx86.exe\" SIZE=\"2570240\" CHECKSUM=\"0x57279008\" MODULE_TYPE=\"WIN32\" PE_CHECKSUM=\"0x0\" LINKER_VERSION=\"0x0\" LINK_DATE=\"08/20/2007 21:12:18\" UPTO_LINK_DATE=\"08/20/2007 21:12:18\" />
                                    <MATCHING_FILE NAME=\"rosetta_beta_5.90_windows_intelx86.exe\" SIZE=\"2615808\" CHECKSUM=\"0x3C8A6BC\" MODULE_TYPE=\"WIN32\" PE_CHECKSUM=\"0x0\" LINKER_VERSION=\"0x0\" LINK_DATE=\"12/20/2007 00:59:48\" UPTO_LINK_DATE=\"12/20/2007 00:59:48\" />
                                    </EXE>
                                    <EXE NAME=\"kernel32.dll\" FILTER=\"GRABMI_FILTER_THISFILEONLY\">
                                    <MATCHING_FILE NAME=\"kernel32.dll\" SIZE=\"984576\" CHECKSUM=\"0xF0B331F6\" BIN_FILE_VERSION=\"5.1.2600.3119\" BIN_PRODUCT_VERSION=\"5.1.2600.3119\" PRODUCT_VERSION=\"5.1.2600.3119\" FILE_DESCRIPTION=\"Windows NT BASE API Client DLL\" COMPANY_NAME=\"Microsoft Corporation\" PRODUCT_NAME=\"Microsoft® Windows® Operating System\" FILE_VERSION=\"5.1.2600.3119 (xpsp_sp2_gdr.070416-1301)\" ORIGINAL_FILENAME=\"kernel32\" INTERNAL_NAME=\"kernel32\" LEGAL_COPYRIGHT=\"© Microsoft Corporation. All rights reserved.\" VERFILEDATEHI=\"0x0\" VERFILEDATELO=\"0x0\" VERFILEOS=\"0x40004\" VERFILETYPE=\"0x2\" MODULE_TYPE=\"WIN32\" PE_CHECKSUM=\"0xF9293\" LINKER_VERSION=\"0x50001\" UPTO_BIN_FILE_VERSION=\"5.1.2600.3119\" UPTO_BIN_PRODUCT_VERSION=\"5.1.2600.3119\" LINK_DATE=\"04/16/2007 15:52:53\" UPTO_LINK_DATE=\"04/16/2007 15:52:53\" VER_LANGUAGE=\"English (United States) [0x409]\" />
                                    </EXE>
                                    </DATABASE>

                                    BigMike
                                    Avatar

                                    Joined: Feb 23 06
                                    Posts: 63
                                    ID: 738
                                    Credit: 58,730
                                    RAC: 0
                                    Message 3583 - Posted 1 Jan 2008 8:00:17 UTC

                                      I don\'t know if this is a bug or not:

                                      core_client_version>5.10.30</core_client_version>
                                      <![CDATA[
                                      <stderr_txt>
                                      # cpu_run_time_pref: 3600
                                      # random seed: 1559807
                                      No heartbeat from core client for 31 sec - exiting
                                      # cpu_run_time_pref: 3600
                                      # random seed: 1559807
                                      WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
                                      WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
                                      WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
                                      # cpu_run_time_pref: 3600
                                      # random seed: 1559807
                                      WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
                                      WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
                                      WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
                                      # cpu_run_time_pref: 3600
                                      # random seed: 1559807
                                      WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
                                      WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
                                      WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
                                      ======================================================
                                      DONE :: 1 starting structures 4759.66 cpu seconds
                                      This process generated 1 decoys from 1 attempts
                                      ======================================================


                                      BOINC :: Watchdog shutting down...
                                      BOINC :: BOINC support services shutting down...


                                      ==Mike
                                      ____________
                                      Don't believe everything you think.

                                      Evan

                                      Joined: Dec 23 07
                                      Posts: 75
                                      ID: 3893
                                      Credit: 69,584
                                      RAC: 0
                                      Message 3595 - Posted 4 Jan 2008 15:56:26 UTC

                                        I think that there were problems with this one. Looking at the graphics I did notice that it was stuck/calculating for a long period around the 86% mark.



                                        Z094__BOINC_SYMM_FOLD_AND_DOCK_RELAX_ONLY-Z094_-lowres_dock_-dock_3609__2788_1_0
                                        W


                                        **********************************************************************
                                        Rosetta score is stuck or going too long. Watchdog is ending the run!
                                        CPU time: 17003.9 seconds. Greater than 4X preferred time: 3600 seconds
                                        **********************************************************************

                                        Message boards : RALPH@home bug list : Bug reports for 5.90/5.91


                                        Home | Join | About | Participants | Community | Statistics

                                        Copyright © 2017 University of Washington

                                        Last Modified: 20 Nov 2008 19:41:56 UTC
                                        Back to top ^