RALPH@home

minirosetta v1.47 bug thread

  UW Seal
 
[ Home ] [ Join ] [ About ] [ Participants ] [ Community ] [ Statistics ]
  [ login/out ]


Advanced search

Message boards : RALPH@home bug list : minirosetta v1.47 bug thread

AuthorMessage
mtyka
Forum moderator
Project developer
Project scientist

Joined: Mar 19 08
Posts: 79
ID: 4144
Credit: 0
RAC: 0
Message 4410 - Posted 13 Dec 2008 18:42:14 UTC

    This was a quick follow up update to fix an error that snuck into the update. see this thread for details:
    http://ralph.bakerlab.org/forum_thread.php?id=425

    This should no longer produce this error:

    ERROR: not able to build valid fold-tree in JumpingFoldConstraints::setup_foldtree
    ERROR:: Exit from: src/protocols/abinitio/LoopJumpFoldCst.cc line: 108
    called boinc_finish
    # cpu_run_time_pref: 14400

    Profile Reeltime

    Joined: Nov 1 08
    Posts: 1
    ID: 4871
    Credit: 6,349
    RAC: 0
    Message 4413 - Posted 14 Dec 2008 15:16:31 UTC

      Last modified: 14 Dec 2008 15:17:37 UTC

      Not sure if this counts as a bug or not, but my runtime is set to 1 hr, most of the tasks take just over this mark c.65-70 mins.

      The 1.47 tasks are taking considerably longer. Current one is at 1hr 33

      They are running normally upto about 78-80% then slowing down dramatically, then finishing somewhere about 90-91%

      Dont know if this is worth mentioning or not, so I thought I would :-)

      Host: 16239

      If there is anything I need to check, filewise let me know, Im still fairly new to alpha testing

      Quick edit: Mentioned this because it is unusual for this project

      ramostol

      Joined: Mar 29 07
      Posts: 24
      ID: 2840
      Credit: 31,121
      RAC: 0
      Message 4419 - Posted 16 Dec 2008 10:26:15 UTC

        This start is none too good I'm afraid.

        All cc2_1_8_mammoth-tasks are crashing after about 1 minute of computing.

        An example:

        cc2_1_8_mammoth_fa_cst_hb_t369__IGNORE_THE_REST_1S3QA_7_6585_1_0

        <message>
        process exited with code 193 (0xc1, -63)
        </message>
        <stderr_txt>
        minirosetta_1.47_i686-apple-darwin(90916,0xa0538fa0) malloc: *** error for object 0x1747d40: Non-aligned pointer being freed (2)
        *** set a breakpoint in malloc_error_break to debug
        SIGBUS: bus error

        Profile Conan
        Avatar

        Joined: Feb 16 06
        Posts: 344
        ID: 145
        Credit: 1,309,534
        RAC: 0
        Message 4420 - Posted 16 Dec 2008 11:39:20 UTC

          Last modified: 16 Dec 2008 11:41:21 UTC

          This WU and this one that I have also finished seem to take an unusual amount of time.

          Both of these ones took over 13 hours for just 1 Decoy.

          My preferences are set to 6 hours.

          As it took this time to complete a single decoy that is the reason for the long running time.

          No wonder they are called Mammoth work units.

          Both completed ok (credit very low for the effort put in, but that is normal for both Ralph and Rosetta).
          ____________

          Phil

          Joined: Jan 28 07
          Posts: 5
          ID: 2588
          Credit: 1,206
          RAC: 0
          Message 4421 - Posted 16 Dec 2008 17:55:15 UTC

            The Graphics in this one show the following:

            Total Credit: -5.6988E-05
            RAC 5.3133E-315

            Phil

            Joined: Jan 28 07
            Posts: 5
            ID: 2588
            Credit: 1,206
            RAC: 0
            Message 4422 - Posted 16 Dec 2008 22:26:32 UTC - in response to Message 4421.

              Last modified: 16 Dec 2008 23:01:29 UTC

              The Graphics in this one show the following:

              Total Credit: -5.6988E-05
              RAC 5.3133E-315



              Interesting, I got a bunch of mammoths now for the same machine but running XP rather than Linux and the display is correct.

              Profile Conan
              Avatar

              Joined: Feb 16 06
              Posts: 344
              ID: 145
              Credit: 1,309,534
              RAC: 0
              Message 4423 - Posted 17 Dec 2008 9:52:18 UTC - in response to Message 4420.

                Have now had This Task run for 58,307.80 seconds or over 16 hours with the generation of just the 1 decoy.

                They are getting longer.
                ____________

                AdeB
                Avatar

                Joined: Dec 22 07
                Posts: 61
                ID: 3888
                Credit: 104,771
                RAC: 17
                Message 4424 - Posted 17 Dec 2008 19:04:41 UTC

                  Another long task - over 10 hours for 1 decoy

                  What surprises me is that boinc during those 10 hours never switched to an other project. There was work for other projects and [Switch between applications every] is set to 120 minutes. It looks like this task 'hijacked' my PC until it was finished. Should it behave like this?

                  I also saw the strange values for Total Credit and RAC Phil is describing. Also on a linux PC.

                  AdeB

                  Profile feet1st

                  Joined: Mar 7 06
                  Posts: 312
                  ID: 1028
                  Credit: 110,522
                  RAC: 0
                  Message 4425 - Posted 17 Dec 2008 20:30:47 UTC - in response to Message 4424.

                    Last modified: 17 Dec 2008 20:35:28 UTC

                    It looks like this task 'hijacked' my PC until it was finished. Should it behave like this?


                    Sometimes it can seem that way. Ralph has short (3 day) deadlines, and so can easily find itself running "at high priority" on the BOINC list.

                    The other way this can happen is that BOINC tries to switch projects at checkpoints to preserve all the work possible (even for those not keeping tasks in memory). And some of these long running models do not take checkpoints. So BOINC was sitting there thinking it was just 10 min. from being done, and seeing no checkpoint to cut in on, so it just kept running it.

                    Another other way this can happen is if you rack up debt to Ralph when no work is available. BOINC knows it "owes" time to Ralph and so keeps running it.
                    ____________

                    Profile Conan
                    Avatar

                    Joined: Feb 16 06
                    Posts: 344
                    ID: 145
                    Credit: 1,309,534
                    RAC: 0
                    Message 4426 - Posted 18 Dec 2008 1:18:58 UTC

                      I have been doing both Ralph and Rosetta for quite some time now (was even number 1 in Ralph at one time), and I have noticed on Ralph over the last number of batch jobs that the Granted Credit equals the Claimed Credit and seems based on the Boinc Benchmark system.

                      Why has the Credit system that Rosetta changed to and Ralph was also changed to 6 months to a year ago now reverting back to Benchmark ???

                      Based on this I am no longer getting due value for the time I spend crunching a work unit.

                      I have seen other systems here on Ralph which have huge Benchmarks compared to me getting well over a hundred credits (114 was one example I saw for 13,400 seconds work) for 3 hours work when I do 6 or more hours work and don't get anywhere near as much as they do (from 55 to 90 for 4 to 7 hours).

                      Because of this a number of users don't understand what I complain about when I say credit is low at Ralph and Rosetta (for me 10 to 12 cr/h at the moment, down from 14 to 15 a few months ago which is still low compared to Seti and others) as they are getting up to 30 cr/h.

                      Can this be looked at please ??

                      If I do a 16 hour WU (like the current ones) I get 204 credits, others do a 3 hour WU and get 114, I don't see the fairness in that.
                      My computers and results are easy to access and open to view.
                      ____________

                      Profile feet1st

                      Joined: Mar 7 06
                      Posts: 312
                      ID: 1028
                      Credit: 110,522
                      RAC: 0
                      Message 4427 - Posted 18 Dec 2008 2:03:05 UTC

                        Last modified: 18 Dec 2008 2:13:00 UTC

                        My t328 mammoth is still on model 1 step 931,000 after nearly 17hrs ... and, of course, it's time to reboot to install MS fixes! ...wish me luck!

                        [update]
                        Interesting... it restarted on model 1, step 0 (yes, I waited for it to initialize and start incrementing steps) but with 2hr15min of CPU time on it. So, it's like it did take a checkpoint... only it didn't. Should be an interesting output file!
                        ____________

                        Path7

                        Joined: Feb 11 08
                        Posts: 56
                        ID: 4036
                        Credit: 4,974
                        RAC: 0
                        Message 4428 - Posted 18 Dec 2008 19:19:28 UTC

                          Hello all,

                          The next WU ran for about 4 hours, when I had to reboot my PC due to an IE7-update.
                          cc2_1_8_mammoth_mix_cen_cst_hb_t342__IGNORE_THE_REST_2G0QA_1_6636_1_0
                          The WU restarted from 0:00 hours runtime and finished within 4656 seconds (1.29 hours),
                          and generated 1 decoy; valid. Also nice within my runtime preference of 7200 seconds.

                          Why did this WU run for more than 4 hours at its first run?

                          Have a nice day,
                          Path7.

                          Profile Stephen

                          Joined: Dec 17 08
                          Posts: 3
                          ID: 5056
                          Credit: 6,566
                          RAC: 0
                          Message 4430 - Posted 19 Dec 2008 0:31:16 UTC

                            i'm getting some odd behavior.


                            * cpu timer sometimes is getting reset
                            * i suspended all work units, then unsuspended them and they all completed immediately.

                            Profile Stephen

                            Joined: Dec 17 08
                            Posts: 3
                            ID: 5056
                            Credit: 6,566
                            RAC: 0
                            Message 4431 - Posted 19 Dec 2008 1:48:18 UTC - in response to Message 4430.

                              to elaborate on the problem:

                              a WU will get to around 85% complete , progress will stay the same. time to completion stays around 10 minutes. i suspend all tasks, resume then the "stuck" WUs will complete

                              zombie67 [MM]
                              Avatar

                              Joined: Aug 8 06
                              Posts: 70
                              ID: 1666
                              Credit: 1,419,520
                              RAC: 352
                              Message 4432 - Posted 19 Dec 2008 4:04:32 UTC - in response to Message 4426.

                                I have been doing both Ralph and Rosetta for quite some time now (was even number 1 in Ralph at one time), and I have noticed on Ralph over the last number of batch jobs that the Granted Credit equals the Claimed Credit and seems based on the Boinc Benchmark system.

                                Why has the Credit system that Rosetta changed to and Ralph was also changed to 6 months to a year ago now reverting back to Benchmark ???

                                Based on this I am no longer getting due value for the time I spend crunching a work unit.


                                How so? Your machines claim based on benchmarks. If your benchmarks are not tampered with, then you are getting exactly what you are due. You can't just look at run time. Some machines are faster than others. So a fast machine running 4 hours will have done more work than a slower machine running 4 hours. So the faster machine should be awarded more credits, even though the crunch time is equal.
                                ____________
                                Dublin, CA
                                SETI.USA - Stats - My stuff - BOINC IRC chat

                                Profile Conan
                                Avatar

                                Joined: Feb 16 06
                                Posts: 344
                                ID: 145
                                Credit: 1,309,534
                                RAC: 0
                                Message 4434 - Posted 19 Dec 2008 13:36:48 UTC - in response to Message 4432.

                                  I have been doing both Ralph and Rosetta for quite some time now (was even number 1 in Ralph at one time), and I have noticed on Ralph over the last number of batch jobs that the Granted Credit equals the Claimed Credit and seems based on the Boinc Benchmark system.

                                  Why has the Credit system that Rosetta changed to and Ralph was also changed to 6 months to a year ago now reverting back to Benchmark ???

                                  Based on this I am no longer getting due value for the time I spend crunching a work unit.


                                  How so? Your machines claim based on benchmarks. If your benchmarks are not tampered with, then you are getting exactly what you are due. You can't just look at run time. Some machines are faster than others. So a fast machine running 4 hours will have done more work than a slower machine running 4 hours. So the faster machine should be awarded more credits, even though the crunch time is equal.


                                  What I am referring to is not the fact that I am getting granted a benchmark score (and no they are not tampered with as you can tell by the low figures on my computers), it is the fact that the crediting system on Ralph and Rosetta was no longer based on the Boinc Benchmark value and therefore I should not be getting the same as claimed.

                                  The crediting system is supposed to be based on number of decoys generated as well as when it is returned and length of processing with the first to be returned in a batch gets what they claim then each one after that gets some form of averaging to get the final amount.

                                  At the moment it would appear that all results are getting what they claim which is not how the Rosetta/Ralph fixed type crediting system was meant to be,
                                  unless of course I am some how returning all my work before any one else in my batch, this I don't believe due to my 6 run time preference.
                                  ____________

                                  Profile Conan
                                  Avatar

                                  Joined: Feb 16 06
                                  Posts: 344
                                  ID: 145
                                  Credit: 1,309,534
                                  RAC: 0
                                  Message 4435 - Posted 19 Dec 2008 13:41:15 UTC - in response to Message 4431.

                                    to elaborate on the problem:

                                    a WU will get to around 85% complete , progress will stay the same. time to completion stays around 10 minutes. i suspend all tasks, resume then the "stuck" WUs will complete


                                    With these current 'mammoth' work units I too have noticed that they get to a point with around 10 minutes to go and sit there for quite some time.

                                    The work units appear to be compiling all data generated before then finishing the task.
                                    I have had them run for over 16 hours for just the 1 Decoy and have finished ok with a valid result.
                                    ____________

                                    zombie67 [MM]
                                    Avatar

                                    Joined: Aug 8 06
                                    Posts: 70
                                    ID: 1666
                                    Credit: 1,419,520
                                    RAC: 352
                                    Message 4436 - Posted 19 Dec 2008 15:59:31 UTC - in response to Message 4434.

                                      Last modified: 19 Dec 2008 16:02:07 UTC

                                      Yes, I understand that the credit system changed back to pure benchmark. I noticed that too. But the unique method that used to be used here (and still used on Rosetta) is also benchmark based. It just averages with all the previous claims for that particular test. So in theory, as long as we don't mess with the benchmarks, the awarded credits should be about the same either way.

                                      Edit: I'm guessing the method changed back to the default when the server upgrade happened.
                                      ____________
                                      Dublin, CA
                                      SETI.USA - Stats - My stuff - BOINC IRC chat

                                      Klimax

                                      Joined: Nov 7 07
                                      Posts: 9
                                      ID: 3773
                                      Credit: 10,317
                                      RAC: 0
                                      Message 4440 - Posted 27 Dec 2008 6:13:54 UTC

                                        Hello,
                                        I have failure of three lr6_score12_... WU
                                        http://ralph.bakerlab.org/result.php?resultid=1241954
                                        http://ralph.bakerlab.org/result.php?resultid=1241953
                                        http://ralph.bakerlab.org/result.php?resultid=1241939

                                        apparently some sort of crash
                                        (maybe bug?)

                                        Klimax

                                        Joined: Nov 7 07
                                        Posts: 9
                                        ID: 3773
                                        Credit: 10,317
                                        RAC: 0
                                        Message 4441 - Posted 27 Dec 2008 12:18:19 UTC - in response to Message 4440.

                                          Hello,
                                          I have failure of three lr6_score12_... WU
                                          http://ralph.bakerlab.org/result.php?resultid=1241954
                                          http://ralph.bakerlab.org/result.php?resultid=1241953
                                          http://ralph.bakerlab.org/result.php?resultid=1241939

                                          apparently some sort of crash
                                          (maybe bug?)


                                          another three(all crashing in same function)

                                          http://ralph.bakerlab.org/result.php?resultid=1241948
                                          http://ralph.bakerlab.org/result.php?resultid=1241947
                                          http://ralph.bakerlab.org/result.php?resultid=1241936

                                          Profile sslickerson

                                          Joined: Feb 15 06
                                          Posts: 17
                                          ID: 37
                                          Credit: 4,006
                                          RAC: 0
                                          Message 4447 - Posted 14 Jan 2009 17:14:07 UTC

                                            Hi there,

                                            I am reattaching to RALPH to try and figure out why my Windows Vista 64bit laptop errors out on most minirosetta WU's. Will there be anymore WU coming up?

                                            Thanks,

                                            Timothy
                                            ____________


                                            Profile sslickerson

                                            Joined: Feb 15 06
                                            Posts: 17
                                            ID: 37
                                            Credit: 4,006
                                            RAC: 0
                                            Message 4448 - Posted 16 Jan 2009 14:57:58 UTC

                                              Ok, so I did get 2 WU over night that both failed as usual on my laptop with Exit status -1073741819 (0xffffffffc0000005). This is Windows vista 64bit.

                                              The failed WU are:

                                              1244381
                                              1244380

                                              This is the message in BOINC Manager:

                                              1/16/2009 5:27:12 AM|ralph@home|Output file _CAPRI17_T38_2.sjf_br_docking.protocol__6705_4_1_0 for task _CAPRI17_T38_2.sjf_br_docking.protocol__6705_4_1 absent

                                              This is typical for all tasks of this type that fail.

                                              Let me know if there is anything else that is needed at this point.

                                              Timothy

                                              Message boards : RALPH@home bug list : minirosetta v1.47 bug thread


                                              Home | Join | About | Participants | Community | Statistics

                                              Copyright © 2017 University of Washington

                                              Last Modified: 20 Nov 2008 19:41:56 UTC
                                              Back to top ^