RALPH@home

Bug reports for Ralph 5.20

  UW Seal
 
[ Home ] [ Join ] [ About ] [ Participants ] [ Community ] [ Statistics ]
  [ login/out ]


Advanced search

Message boards : RALPH@home bug list : Bug reports for Ralph 5.20

AuthorMessage
Profile dekim
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Jan 20 06
Posts: 216
ID: 1
Credit: 484,843
RAC: 52
Message 1751 - Posted 3 Jun 2006 1:11:00 UTC

    This version has some boinc-related fixes in the watchdog and graphics.

    Nikolay A. Saharov

    Joined: Feb 17 06
    Posts: 6
    ID: 380
    Credit: 17,030
    RAC: 89
    Message 1752 - Posted 3 Jun 2006 5:10:38 UTC

      Last modified: 3 Jun 2006 5:37:51 UTC

      Hi,

      I have Ralph WU Result 149188 that is stuck in BOINC Mgr queue at 100% and time 1:20:42. It has status \"Running\". But in Graphics window the result is completed at 67.2% with time 1:20:45.

      CPU usage is 50% and only another WU is really running. (I have P4-2.6 GHz HT with 2 logical CPUs). Or other words, 2 Ralph WUs are running but only one uses CPU at 50% and another at 0%.

      PS: This result is completed now successfully and reported with messages:
      BOINC :: Watchdog shutting down...
      BOINC :: BOINC support services shutting down...

      {Edit} No other problems.
      {Edit 2} There was something like described in this post.
      ____________

      [B^S] suguruhirahara

      Joined: Mar 5 06
      Posts: 40
      ID: 992
      Credit: 6,001
      RAC: 0
      Message 1753 - Posted 3 Jun 2006 7:37:58 UTC - in response to Message 1751.

        Last modified: 3 Jun 2006 7:38:10 UTC

        This version has some boinc-related fixes in the watchdog and graphics.
        I confirmed graphics has been fixed. It works more smoothly than before.
        ____________

        Niehaus
        Avatar

        Joined: Feb 22 06
        Posts: 10
        ID: 706
        Credit: 2,707
        RAC: 0
        Message 1755 - Posted 3 Jun 2006 14:21:16 UTC

          Last modified: 3 Jun 2006 14:33:34 UTC

          My Ralph calculates the WUs to 100% but doesnt send them, and they are still \"active\" but there is no further calculation, the programm continues with my rosetta WUs...


          Oh it DID send the WU after some time, sry!!!
          ____________

          Mike Gelvin
          Avatar

          Joined: Feb 17 06
          Posts: 50
          ID: 468
          Credit: 55,397
          RAC: 0
          Message 1756 - Posted 3 Jun 2006 14:53:44 UTC

            Last modified: 3 Jun 2006 14:55:15 UTC

            I ran 3 work units.

            Two actually completed but I suspect the \"dormant\" bug is still present as this first work (149120) unit completed in EXACTLY 1 hour with 36 min of CPU time, and this other one (149885) completed in EXACTLY 2 hours with 81 min of CPU time.

            The third (149194) errored out with:
            Unrecoverable error for result t296__CASP7_ABINITIO_SAVE_ALL_OUT_hom013__614_2_1 (One or more arguments are invalid (0x80000003) - exit code -2147483645 (0x80000003))

            The upload indicated a watchdog shut down.

            Mike

            ____________

            Profile Fuzzy Hollynoodles
            Avatar

            Joined: Feb 19 06
            Posts: 37
            ID: 585
            Credit: 2,089
            RAC: 0
            Message 1757 - Posted 3 Jun 2006 16:21:32 UTC - in response to Message 1756.

              Last modified: 3 Jun 2006 16:23:53 UTC

              I ran 3 work units.

              Two actually completed but I suspect the \"dormant\" bug is still present as this first work (149120) unit completed in EXACTLY 1 hour with 36 min of CPU time, and this other one (149885) completed in EXACTLY 2 hours with 81 min of CPU time.

              ...

              Mike


              The \"dormant\" bug was in this one also: http://ralph.bakerlab.org/workunit.php?wuid=131456

              Result: http://ralph.bakerlab.org/result.php?resultid=148875

              And unmonitored my computer went into sleepmode, so it started to upload, when I got back to my computer again. This means that my computer was idle for some time, where it could have crunched something else. :-(

              So I aborted the next WU and I have set Ralph to No new work, untill you have this sorted out. I will not have a computer being in sleepmode for a longer time untill I can get to it again, so it can continue crunching. In worst case it can be for a whole day. :-(


              EDIT: Can\'t you make a watchdog to activate the WU again, after it has been idle for, let\'s say 3 minutes? Or 5 minutes? Not crunching my computer goes into sleepmode after 15 minutes.


              ____________
              "I'm trying to maintain a shred of dignity in this world." - Me

              tralala

              Joined: Apr 12 06
              Posts: 52
              ID: 1266
              Credit: 15,257
              RAC: 0
              Message 1758 - Posted 3 Jun 2006 17:54:54 UTC - in response to Message 1757.


                EDIT: Can\'t you make a watchdog to activate the WU again, after it has been idle for, let\'s say 3 minutes? Or 5 minutes? Not crunching my computer goes into sleepmode after 15 minutes.


                This is a bug which was invented after 5.16 so I hope they can spot it and fix it completely rather than adding another safety mechanism.
                ____________

                [B^S] sTrey
                Avatar

                Joined: Feb 15 06
                Posts: 58
                ID: 36
                Credit: 15,430
                RAC: 0
                Message 1759 - Posted 3 Jun 2006 20:15:42 UTC - in response to Message 1755.

                  Last modified: 3 Jun 2006 20:18:41 UTC

                  My Ralph calculates the WUs to 100% but doesnt send them, and they are still \"active\" but there is no further calculation, the programm continues with my rosetta WUs...


                  Oh it DID send the WU after some time, sry!!!


                  I\'ve noticed this too with 5.19 and 5.20. My pref is set to 2 hours and my crunching interval is 2:01. The wus I\'ve been getting happen to finish early, say 1:45, go to 100% but then pause instead of completing. Nothing else such as downloads has triggered early rescheduling. The next time the wu gets crunch-time it completes immediately and uploads.

                  Not causing any problems but it\'s definitely different behavior, and after about 5 in a row not counting one that errored out, it doesn\'t seem coincidental.

                  sample result

                  Honza

                  Joined: Feb 16 06
                  Posts: 9
                  ID: 147
                  Credit: 1,962
                  RAC: 0
                  Message 1761 - Posted 4 Jun 2006 9:20:02 UTC

                    3WUs went fine, 4th got stucked at 100% for hours - http://ralph.bakerlab.org/result.php?resultid=150036.
                    3 more to go...
                    ____________

                    Honza

                    Joined: Feb 16 06
                    Posts: 9
                    ID: 147
                    Credit: 1,962
                    RAC: 0
                    Message 1762 - Posted 4 Jun 2006 12:53:26 UTC

                      (too late to edit). Another one sitting idle at 100% - http://ralph.bakerlab.org/result.php?resultid=150039 so 2 of 6 got stucked at finish in my case.
                      ____________

                      Profile dekim
                      Forum moderator
                      Project administrator
                      Project developer
                      Project scientist

                      Joined: Jan 20 06
                      Posts: 216
                      ID: 1
                      Credit: 484,843
                      RAC: 52
                      Message 1763 - Posted 4 Jun 2006 15:44:20 UTC

                        Rom tells me it is waiting for the watchdog to finish for debugging.

                        Here is his response:

                        \"When I added code .... to wait until
                        the thread is finished, it stalls for up to 30 minutes waiting until
                        watchdog makes its next check.\"

                        I think the watchdog can take up to 2x the cpu run time pref, which may explain the longer stalls.
                        ____________

                        tralala

                        Joined: Apr 12 06
                        Posts: 52
                        ID: 1266
                        Credit: 15,257
                        RAC: 0
                        Message 1764 - Posted 4 Jun 2006 16:04:37 UTC - in response to Message 1763.

                          Rom tells me it is waiting for the watchdog to finish for debugging.

                          Here is his response:

                          \"When I added code .... to wait until
                          the thread is finished, it stalls for up to 30 minutes waiting until
                          watchdog makes its next check.\"

                          I think the watchdog can take up to 2x the cpu run time pref, which may explain the longer stalls.


                          Does this mean it was intentionally implemented for debugging purposes? You could have saved us some investigation if you would have told us. Anyway it\'s good to know that the reason is known and won\'t delay any further development.
                          ____________

                          crossworks

                          Joined: May 19 06
                          Posts: 2
                          ID: 1415
                          Credit: 510
                          RAC: 0
                          Message 1765 - Posted 4 Jun 2006 19:28:55 UTC

                            How long before you should abort WU\'s stuck at 100%? Why does my firewall show a lot of traffic for bonic ralph client even though its stuck at 100% I have all other projects suspended to see if the WU will report.
                            ____________

                            NJMHoffmann

                            Joined: Feb 17 06
                            Posts: 8
                            ID: 395
                            Credit: 1,270
                            RAC: 0
                            Message 1766 - Posted 4 Jun 2006 20:48:49 UTC - in response to Message 1765.

                              How long before you should abort WU\'s stuck at 100%? Why does my firewall show a lot of traffic for bonic ralph client even though its stuck at 100% I have all other projects suspended to see if the WU will report.

                              Wild guess: The client is downloading (BIIIG) symbol tables for the debug output??

                              Norbert
                              ____________

                              Profile Carlos_Pfitzner
                              Avatar

                              Joined: Feb 16 06
                              Posts: 182
                              ID: 296
                              Credit: 22,792
                              RAC: 0
                              Message 1767 - Posted 4 Jun 2006 22:13:48 UTC

                                Last modified: 4 Jun 2006 22:21:14 UTC

                                Rosetta_betta_5.20 Windows

                                # This process generated 2 decoys from 2 attempts


                                BOINC :: Watchdog shutting down...


                                Unhandled Exception Detected...

                                - Unhandled Exception Record -
                                Reason: Breakpoint Encountered (0x80000003) at address 0x77F9193C

                                Engaging BOINC Windows Runtime Debugger...


                                </stderr_txt>
                                I abborted this result cause it was running using 0.0000% of CPU ie: STUCK

                                http://ralph.bakerlab.org/result.php?resultid=150083

                                With 5.19 I waited for 6 hours, what happens, and rebooted too -:(
                                *My preference runtime for ralph is 1 hour

                                But I will not do this anymore -:)
                                CPU Temperature changes can crack silicon
                                and renders my 7 ghz putter innoperant
                                *Also I am crunching to rosetta too. (CASP7) - in need of more cpu power !

                                Now,
                                If CPU temperature decreases to below 60 C, and the alarm sounds,
                                immediattely I act to find the cause -:)

                                So, IF I go to asleep, I stop crunching for ralph first.
                                *May be a sutck condition occurs while I asleep
                                Thanks
                                ____________
                                Click signature for global team stats

                                Profile Fuzzy Hollynoodles
                                Avatar

                                Joined: Feb 19 06
                                Posts: 37
                                ID: 585
                                Credit: 2,089
                                RAC: 0
                                Message 1769 - Posted 5 Jun 2006 2:50:21 UTC - in response to Message 1763.

                                  Rom tells me it is waiting for the watchdog to finish for debugging.

                                  Here is his response:

                                  \"When I added code .... to wait until
                                  the thread is finished, it stalls for up to 30 minutes waiting until
                                  watchdog makes its next check.\"

                                  I think the watchdog can take up to 2x the cpu run time pref, which may explain the longer stalls.


                                  Yes, but my problem is that my computer goes into sleepmode after 15 minutes, and what then? Then it takes untill I get to it and can start it again. And then, if I\'m unlucky, I can sit and wait with an idle computer for one hour untill the clock triggers the upload.

                                  No, I\'m still on No new work here. :-(



                                  ____________
                                  "I'm trying to maintain a shred of dignity in this world." - Me

                                  Rhiju
                                  Forum moderator
                                  Project developer
                                  Project scientist

                                  Joined: Feb 14 06
                                  Posts: 161
                                  ID: 4
                                  Credit: 3,725
                                  RAC: 0
                                  Message 1770 - Posted 5 Jun 2006 2:57:30 UTC - in response to Message 1769.

                                    Hi everybody: Rom and I fixed this silly watchdog thing. I\'m sending out work now with ralph 5.21! Thanks for helping us out with this.

                                    Rom tells me it is waiting for the watchdog to finish for debugging.

                                    Here is his response:

                                    \"When I added code .... to wait until
                                    the thread is finished, it stalls for up to 30 minutes waiting until
                                    watchdog makes its next check.\"

                                    I think the watchdog can take up to 2x the cpu run time pref, which may explain the longer stalls.


                                    Yes, but my problem is that my computer goes into sleepmode after 15 minutes, and what then? Then it takes untill I get to it and can start it again. And then, if I\'m unlucky, I can sit and wait with an idle computer for one hour untill the clock triggers the upload.

                                    No, I\'m still on No new work here. :-(




                                    ____________

                                    Profile RodEllery

                                    Joined: Feb 20 06
                                    Posts: 5
                                    ID: 648
                                    Credit: 8,820
                                    RAC: 0
                                    Message 1772 - Posted 5 Jun 2006 15:29:47 UTC

                                      Had 4-5 computing errors over weekend with 5.20.

                                      All with similar error. See below.

                                      WU: 132525
                                      Outcome Client error
                                      Client state Computing
                                      Exit status 1 (0x1)
                                      Computer ID 913
                                      Report deadline 8 Jun 2006 23:40:23 UTC
                                      CPU time 0.550792
                                      stderr out <core_client_version>5.4.9</core_client_version>
                                      <message>
                                      Incorrect function. (0x1) - exit code 1 (0x1)
                                      </message>
                                      <stderr_txt>
                                      ERROR:: Exit at: .\\fragments.cc line:767

                                      </stderr_txt>


                                      Validate state Invalid

                                      --
                                      RodEllery

                                      crossworks

                                      Joined: May 19 06
                                      Posts: 2
                                      ID: 1415
                                      Credit: 510
                                      RAC: 0
                                      Message 1773 - Posted 5 Jun 2006 16:00:20 UTC - in response to Message 1772.

                                        Last modified: 5 Jun 2006 16:00:44 UTC

                                        Had 4-5 computing errors over weekend with 5.20.

                                        All with similar error. See below.

                                        WU: 132525
                                        Outcome Client error
                                        Client state Computing
                                        Exit status 1 (0x1)
                                        Computer ID 913
                                        Report deadline 8 Jun 2006 23:40:23 UTC
                                        CPU time 0.550792
                                        stderr out <core_client_version>5.4.9</core_client_version>
                                        <message>
                                        Incorrect function. (0x1) - exit code 1 (0x1)
                                        </message>
                                        <stderr_txt>
                                        ERROR:: Exit at: .\\fragments.cc line:767

                                        </stderr_txt>


                                        Validate state Invalid

                                        --
                                        RodEllery

                                        I got that error when I killed 5.20.exe in windows task manger. I thought it was stuck. Next unit I wanted about 2 hours after it was 100% and it reported.

                                        ____________

                                        Profile Fuzzy Hollynoodles
                                        Avatar

                                        Joined: Feb 19 06
                                        Posts: 37
                                        ID: 585
                                        Credit: 2,089
                                        RAC: 0
                                        Message 1775 - Posted 5 Jun 2006 19:55:19 UTC - in response to Message 1770.

                                          Hi everybody: Rom and I fixed this silly watchdog thing. I\'m sending out work now with ralph 5.21! Thanks for helping us out with this.



                                          Ok, let me give it a try again.



                                          ____________
                                          "I'm trying to maintain a shred of dignity in this world." - Me

                                          Profile dekim
                                          Forum moderator
                                          Project administrator
                                          Project developer
                                          Project scientist

                                          Joined: Jan 20 06
                                          Posts: 216
                                          ID: 1
                                          Credit: 484,843
                                          RAC: 52
                                          Message 1776 - Posted 5 Jun 2006 20:27:10 UTC - in response to Message 1764.

                                            Last modified: 6 Jun 2006 20:47:12 UTC

                                            You could have saved us some investigation if you would have told us. Anyway it\'s good to know that the reason is known and won\'t delay any further development.


                                            It wasn\'t intentional. We posted as soon as we found the cause.

                                            edit: actually I think it was intentional to help diagnose issues with the watchdog. I think it was taking a bit longer than expected and has since been fixed by Rhiju and Rom.

                                            ____________

                                            Profile dekim
                                            Forum moderator
                                            Project administrator
                                            Project developer
                                            Project scientist

                                            Joined: Jan 20 06
                                            Posts: 216
                                            ID: 1
                                            Credit: 484,843
                                            RAC: 52
                                            Message 1789 - Posted 6 Jun 2006 20:48:01 UTC - in response to Message 1763.

                                              Rom tells me it is waiting for the watchdog to finish for debugging. edit: to debug the watchdog.

                                              Here is his response:

                                              \"When I added code .... to wait until
                                              the thread is finished, it stalls for up to 30 minutes waiting until
                                              watchdog makes its next check.\"

                                              I think the watchdog can take up to 2x the cpu run time pref, which may explain the longer stalls.


                                              ____________

                                              Message boards : RALPH@home bug list : Bug reports for Ralph 5.20


                                              Home | Join | About | Participants | Community | Statistics

                                              Copyright © 2017 University of Washington

                                              Last Modified: 20 Nov 2008 19:41:56 UTC
                                              Back to top ^