RALPH@home

Bug reports for 5.56-5.59

  UW Seal
 
[ Home ] [ Join ] [ About ] [ Participants ] [ Community ] [ Statistics ]
  [ login/out ]


Advanced search

Message boards : RALPH@home bug list : Bug reports for 5.56-5.59

AuthorMessage
Rhiju
Forum moderator
Project developer
Project scientist

Joined: Feb 14 06
Posts: 161
ID: 4
Credit: 3,725
RAC: 0
Message 2929 - Posted 29 Mar 2007 5:43:19 UTC

    There are several issues that we\'re trying to resolve with this update:

    (1) The new FOLD_AND_DOCK workunits were crashing on Windows machines for a pretty subtle reason, hopefully fixed on this update.

    (2) Some Macs have been having consistent problems running Rosetta after recent updates. This update attempts to fix a potential stack overflows on Macs, and hopefully the problem computers (we now have a couple attached to ralph) will be happier.

    (3) Pct complete should remain sane (i.e. never infinity ot some huge number).

    If you have problems with preempting and resuming apps on Macs, also please post here!

    We\'ll probably do another RALPH update Thursday or Friday, based on how things go with this one.


    ____________

    Profile anders n

    Joined: Feb 16 06
    Posts: 166
    ID: 91
    Credit: 131,419
    RAC: 0
    Message 2932 - Posted 29 Mar 2007 9:36:34 UTC

      Last modified: 29 Mar 2007 9:36:55 UTC

      The issue on hanging after preemting did start in the middel of a version.
      I think there was some kind of security update on the OS about that time.
      Just a thought.
      Anders n
      ____________

      Profile ashriel

      Joined: Mar 3 07
      Posts: 11
      ID: 2714
      Credit: 648
      RAC: 0
      Message 2933 - Posted 29 Mar 2007 9:43:49 UTC

        Last modified: 29 Mar 2007 10:36:23 UTC

        The percentage of the WU 1kka__BOINC_SMOOTH_INCREASE_CYCLES10_RNA_ABINITIO-1kka_-_1877_20 increased normally till 84%. Then jumped on 100% and finished (after 57 minutes).

        At WU 1xjr__BOINC_SMOOTH_INCREASE_CYCLES10_RNA_ABINITIO-1xjr_-_1877_30 i realised that the percentage shows 1% for the first 3 minutes, then increases continuously - but having a remaining time that the total would be 2 hours.
        (Maybe it has happened to the WU mentioned above, too - that one i didn\'t watch from the beginning.)
        After 34 minutes it showed about 16% - then jumped to 100% and finished.
        ____________

        ramostol

        Joined: Mar 29 07
        Posts: 24
        ID: 2840
        Credit: 31,121
        RAC: 0
        Message 2935 - Posted 29 Mar 2007 12:12:57 UTC

          Still problems on Macs, see report

          -- R. A. Mostol

          Profile feet1st

          Joined: Mar 7 06
          Posts: 312
          ID: 1028
          Credit: 110,522
          RAC: 0
          Message 2939 - Posted 29 Mar 2007 15:29:17 UTC

            If you could let us know what approach you have taken to doing the estimated time to completion, that would be helpful in describing our observations and making suggestions on further improvement. Which number(s) do you actually have control over? And which are computed by the BOINC Manager? You just provide the % complete to BOINC?
            ____________

            tralala

            Joined: Apr 12 06
            Posts: 52
            ID: 1266
            Credit: 15,257
            RAC: 0
            Message 2941 - Posted 29 Mar 2007 16:10:28 UTC

              Got a strange \"Maximum Disk usage exceeded2 error. Runtime was 4 hours and Disk has 70 GB free, BOINC is allowed up to 50 GB and 99% disk usage. I guess the output file was too big or some other WU-specific settings were not correct.

              http://ralph.bakerlab.org/result.php?resultid=475000
              ____________

              Profile feet1st

              Joined: Mar 7 06
              Posts: 312
              ID: 1028
              Credit: 110,522
              RAC: 0
              Message 2942 - Posted 29 Mar 2007 16:43:05 UTC

                There is a maximum that the project sets somewhere that prevents things from consuming all your disk if a loop should occur or something. So this may be a problem with the WU.
                ____________

                Profile feet1st

                Joined: Mar 7 06
                Posts: 312
                ID: 1028
                Credit: 110,522
                RAC: 0
                Message 2943 - Posted 29 Mar 2007 16:49:06 UTC

                  With my 24hr runtime pref. it seems the time to completion still ticks up, (until the end of a model). Although the \"progress\" increases, so that should give users a warm fuzzy that they are making \"progress\".

                  In my case, I don\'t see the time to completion actually tick down at all. And based on the jump in % completed between models, it appears if you could tick up the % completed at about double the current rate, then you\'d be right on. And, I\'m not sure how BOINC works exactly, but perhaps that would result in my completion time actually ticking down instead. At present, completion time seems to tick up roughly 1 second for every 2 seconds of CPU used. I presume this ratio will differ as I get deeper in to my 24hr runtime though.

                  If doubling the increases in % completed isn\'t how things work... then if you could recalibrate the number you DO control, based on model 1, then at least remaining models would be much closer.
                  ____________

                  tralala

                  Joined: Apr 12 06
                  Posts: 52
                  ID: 1266
                  Credit: 15,257
                  RAC: 0
                  Message 2944 - Posted 29 Mar 2007 17:40:31 UTC - in response to Message 2941.

                    Got a strange \"Maximum Disk usage exceeded2 error. Runtime was 4 hours and Disk has 70 GB free, BOINC is allowed up to 50 GB and 99% disk usage. I guess the output file was too big or some other WU-specific settings were not correct.

                    http://ralph.bakerlab.org/result.php?resultid=475000


                    I guess stdout.txt reached the 100MB limit. I have a similar WU now and stdout ist at 50MB after two hours.
                    ____________

                    Rhiju
                    Forum moderator
                    Project developer
                    Project scientist

                    Joined: Feb 14 06
                    Posts: 161
                    ID: 4
                    Credit: 3,725
                    RAC: 0
                    Message 2945 - Posted 29 Mar 2007 18:54:58 UTC - in response to Message 2943.

                      Checking now on the Max Disk usage.

                      Feet1st -- in the past, did the \"Time to completion\" also increase during the run? Or did this start happpening with the % complete \"fix\". I\'m a little puzzled by this behavior.

                      With my 24hr runtime pref. it seems the time to completion still ticks up, (until the end of a model). Although the \"progress\" increases, so that should give users a warm fuzzy that they are making \"progress\".

                      In my case, I don\'t see the time to completion actually tick down at all. And based on the jump in % completed between models, it appears if you could tick up the % completed at about double the current rate, then you\'d be right on. And, I\'m not sure how BOINC works exactly, but perhaps that would result in my completion time actually ticking down instead. At present, completion time seems to tick up roughly 1 second for every 2 seconds of CPU used. I presume this ratio will differ as I get deeper in to my 24hr runtime though.

                      If doubling the increases in % completed isn\'t how things work... then if you could recalibrate the number you DO control, based on model 1, then at least remaining models would be much closer.


                      ____________

                      Profile feet1st

                      Joined: Mar 7 06
                      Posts: 312
                      ID: 1028
                      Credit: 110,522
                      RAC: 0
                      Message 2946 - Posted 29 Mar 2007 19:03:46 UTC - in response to Message 2945.

                        Last modified: 29 Mar 2007 19:05:39 UTC

                        Feet1st -- in the past, did the \"Time to completion\" also increase during the run? Or did this start happpening with the % complete \"fix\". I\'m a little puzzled by this behavior.


                        ...of course, has always increased. I had just assumed that with the increasing % completed that the result of a decreasing time to completion could be achieved. See question here.
                        ____________

                        Profile UBT - Janea
                        Avatar

                        Joined: Dec 17 06
                        Posts: 1
                        ID: 2397
                        Credit: 1,673
                        RAC: 0
                        Message 2947 - Posted 29 Mar 2007 19:20:01 UTC

                          I\'ve got rather a strange bug. The WU I\'m running appears to be continuing to run, even though it is showing as \"Waiting to run\". It is also showing that it\'s 208% complete and counting. I tried doing an update to the server but that hasn\'t stopped it. I would cut and paste the WU details, but it won\'t let me. I\'m running it on BOINC 5.8.15 on Windows XP.

                          ____________



                          Profile feet1st

                          Joined: Mar 7 06
                          Posts: 312
                          ID: 1028
                          Credit: 110,522
                          RAC: 0
                          Message 2949 - Posted 29 Mar 2007 21:04:54 UTC

                            Unable to rotate v5.56 graphic, unless full screen... or if you move window to upper left, just as reported by Teppo here.
                            ____________

                            Rhiju
                            Forum moderator
                            Project developer
                            Project scientist

                            Joined: Feb 14 06
                            Posts: 161
                            ID: 4
                            Credit: 3,725
                            RAC: 0
                            Message 2950 - Posted 30 Mar 2007 1:30:43 UTC - in response to Message 2949.

                              I had this issue too on Windows. Still don\'t know what causes it -- if there are any GLUT experts out there who use Visual Studio, please let me know if you have any insight. I need to know where the mouse starts clicking, and its like those coordinates (which actually come from a BOINC API) are messed up. This might actually be an issue with the BOINC/GLUT interface.

                              Graphics do work perfectly on Macs...

                              Unable to rotate v5.56 graphic, unless full screen... or if you move window to upper left, just as reported by Teppo here.


                              ____________

                              Rhiju
                              Forum moderator
                              Project developer
                              Project scientist

                              Joined: Feb 14 06
                              Posts: 161
                              ID: 4
                              Credit: 3,725
                              RAC: 0
                              Message 2951 - Posted 30 Mar 2007 1:41:43 UTC

                                Last modified: 30 Mar 2007 1:41:59 UTC

                                New stuff in 5.57
                                In principle this should be a new thread, but I started getting confused by all the simultaneous discussions!

                                1. I\'ve rebuilt the apps with the latest BOINC api. Cross your fingers -- let\'s see if this takes care of the \"Process not found\" problems upon preempting, and the fraction of Macs that can\'t seem to run anything properly.

                                Small note for aficionados: in previous apps, I was putting a call into the BOINC API code to increase default stack size for Rosetta on macs because they kept giving overflows on workunits with large RNAs. I removed this stack-size \"fix\", to check whether it might be causing any of the issues we\'ve been seeing; however, as a result, some of the RNA workunits will error out on Macs (both powerpc and intel). This is temporary, I\'ll put the fix back in after this test with 5.57.

                                2. Cured \"maximum disk space exceeded\" problem for RNA workunits.

                                3. Fixed a pretty subtle bug that was crashing symmetric FOLD_AND_DOCK workunits once in a while; the fix may also help further reduce the error rates on \"normal\" workunits too.

                                4. Percentage complete is updated differently. Thanks for all your input on this:

                                Historically, when a decoy has been completed, the % complete jumps up to

                                fraction complete = current_cpu_time/user_preferred_cpu_time.

                                Now the same simple formula is used every five seconds. If this works properly (I have high hopes!), the estimated time to completion should make sense, dropping every five seconds by five seconds. There may be some issues with BOINC trying to make this estimation in a \"smart\" way.

                                One more thing: if the estimated time to completion becomes 10 minutes, it won\'t go below 10 minutes.

                                fraction complete = current_cpu_time/(current_cpu_time + 10 minutes)

                                The idea is that there are certain runs that go a little overtime due to variance in how long it takes to make a decoy. In those cases, we don\'t want to artificially run up against 100%. What you\'ll see instead is that % complete will asymptotically approach 100% (but not get there until a decoy is completed), and estimated time to completion should stay around 10 minutes. Not perfect behavior, but its better than before!




                                ____________

                                Rhiju
                                Forum moderator
                                Project developer
                                Project scientist

                                Joined: Feb 14 06
                                Posts: 161
                                ID: 4
                                Credit: 3,725
                                RAC: 0
                                Message 2952 - Posted 30 Mar 2007 1:46:56 UTC - in response to Message 2932.

                                  Anders n, I\'ve been able to reproduce this preemption problem on my machine. Its completely puzzling to me. I\'m very intrigued by your idea that the problem is correlated with the OS X security update -- the new OS might get rid of processes that haven\'t been active for a while. We\'ll look into it!

                                  The issue on hanging after preemting did start in the middel of a version.
                                  I think there was some kind of security update on the OS about that time.
                                  Just a thought.
                                  Anders n


                                  ____________

                                  Profile feet1st

                                  Joined: Mar 7 06
                                  Posts: 312
                                  ID: 1028
                                  Credit: 110,522
                                  RAC: 0
                                  Message 2953 - Posted 30 Mar 2007 4:00:22 UTC

                                    Last modified: 30 Mar 2007 4:21:26 UTC

                                    You mean:
                                    fraction complete = current_cpu_time/(user_preferred_cpu_time + 10 minutes)

                                    don\'t you?

                                    No... you said *IF* the estimated time to completion becomes 10 minutes... so
                                    if:
                                    F = fraction complete = current_cpu_time/user_preferred_cpu_time
                                    if (1 - F) * user_preferred_cpu_time < 600 seconds
                                    then F = current_cpu_time/(current_cpu_time + 600 seconds)
                                    so once we hit that magic 600 seconds we might see a gap in % complete there.

                                    If that\'s the forumla, then why does the %complete gap up at the end of a model? It should just be another 5 seconds of runtime.

                                    And why does % complete begin at 1%? This doesn\'t follow the forumla.
                                    [edit] I now see the 5.57 version does NOT start at 1% anymore.

                                    And beyond that % complete figure, BOINC does some twiddling with my historical time per task? Perhaps this is why my estimated time to completion didn\'t count down? I intentionally left my RT pref at my normal 24hr figure that BOINC is accustomed to.

                                    So, if I have a rather slow machine, let\'s say it will take 6hrs to complete a single model (in fact, I had one the other day taking ~5hrs per model on a 3ghz machine), and a runtime pref. of just 1hr (I don\'t for the life of me know WHY people do that), such a task would show 90+% completed for like 5hrs? As you say, at least it\'s moving, and counting UP... but still has room to improve.

                                    [edit] I see with this new forumla that my estimated runtime now only increases for 5 seconds at a time. So, appears you\'ve successfully addressed my earlier question about the estimate increasing clear through to the end of the model... and I take it that once I complete a model here that I will NOT see the gap up that I saw in the prior release.

                                    Very nice. Now I will soon be ready to begin phase II of progress % testing (...sinister laugh here). Thanks for all the effort on improving the user experience.
                                    ____________

                                    Rhiju
                                    Forum moderator
                                    Project developer
                                    Project scientist

                                    Joined: Feb 14 06
                                    Posts: 161
                                    ID: 4
                                    Credit: 3,725
                                    RAC: 0
                                    Message 2954 - Posted 30 Mar 2007 8:34:31 UTC - in response to Message 2953.

                                      Feet1st, thanks much for your help so far with this and other bugs (or shall we call them \"features\"?). I\'ve been trying to put out about a dozen fires over the last three days, with the hopes of sending out some new science on Rosetta@home soon, and it would be impossible without user feedback. I think for the first time we have an app that has >98% success rate on all platforms. Sweet!

                                      Regarding your extreme example of the 6 hour per decoy workunit, actually the watchdog would kill it at 4 hours (if that\'s the CPU run time preference). So for the first hour everything would look fine, and then the users would probably get annoyed for the next three hours... of course, our solution to this problem is to be careful -- we do *not* plan to send workunits to Rosetta@home that take more than an hour per decoy!

                                      You mean:
                                      fraction complete = current_cpu_time/(user_preferred_cpu_time + 10 minutes)

                                      don\'t you?

                                      No... you said *IF* the estimated time to completion becomes 10 minutes... so
                                      if:
                                      F = fraction complete = current_cpu_time/user_preferred_cpu_time
                                      if (1 - F) * user_preferred_cpu_time < 600 seconds
                                      then F = current_cpu_time/(current_cpu_time + 600 seconds)
                                      so once we hit that magic 600 seconds we might see a gap in % complete there.

                                      If that\'s the forumla, then why does the %complete gap up at the end of a model? It should just be another 5 seconds of runtime.

                                      And why does % complete begin at 1%? This doesn\'t follow the forumla.
                                      [edit] I now see the 5.57 version does NOT start at 1% anymore.

                                      And beyond that % complete figure, BOINC does some twiddling with my historical time per task? Perhaps this is why my estimated time to completion didn\'t count down? I intentionally left my RT pref at my normal 24hr figure that BOINC is accustomed to.

                                      So, if I have a rather slow machine, let\'s say it will take 6hrs to complete a single model (in fact, I had one the other day taking ~5hrs per model on a 3ghz machine), and a runtime pref. of just 1hr (I don\'t for the life of me know WHY people do that), such a task would show 90+% completed for like 5hrs? As you say, at least it\'s moving, and counting UP... but still has room to improve.

                                      [edit] I see with this new forumla that my estimated runtime now only increases for 5 seconds at a time. So, appears you\'ve successfully addressed my earlier question about the estimate increasing clear through to the end of the model... and I take it that once I complete a model here that I will NOT see the gap up that I saw in the prior release.

                                      Very nice. Now I will soon be ready to begin phase II of progress % testing (...sinister laugh here). Thanks for all the effort on improving the user experience.


                                      ____________

                                      ramostol

                                      Joined: Mar 29 07
                                      Posts: 24
                                      ID: 2840
                                      Credit: 31,121
                                      RAC: 0
                                      Message 2955 - Posted 30 Mar 2007 9:05:49 UTC

                                        Congratulations (Rosetta 5.57), for the first time in a week I have had Rosetta keeping a wu for 17 minutes, still working.

                                        -- R. A. Mostol

                                        [B^S] sTrey
                                        Avatar

                                        Joined: Feb 15 06
                                        Posts: 58
                                        ID: 36
                                        Credit: 15,430
                                        RAC: 0
                                        Message 2956 - Posted 30 Mar 2007 9:17:30 UTC

                                          Last modified: 30 Mar 2007 9:18:04 UTC

                                          Ditto, I have two 5.57wu\'s whose reported progress looks right, are half way through their 4 hour preferred time without aborting, altogether look much better than .56 and .55. Thanks!

                                          Profile ashriel

                                          Joined: Mar 3 07
                                          Posts: 11
                                          ID: 2714
                                          Credit: 648
                                          RAC: 0
                                          Message 2957 - Posted 30 Mar 2007 10:01:37 UTC

                                            Last modified: 30 Mar 2007 10:01:58 UTC

                                            Running 5.57, default: 1 hour, WU 1zih__BOINC_SMOOTH_INCREASE_CYCLES10_RNA_ABINITIO-1zih_-_1882_35:

                                            Time: 30 Minutes - Percentage: 50 - Time left: 35 Minutes
                                            Time: 45 Minutes - Percentage: 75 - Time left: 16 Minutes
                                            Time: 59 Minutes - Percentage:100 - Time left: -

                                            Nice :D
                                            ____________

                                            Profile Conan
                                            Avatar

                                            Joined: Feb 16 06
                                            Posts: 344
                                            ID: 145
                                            Credit: 1,309,534
                                            RAC: 0
                                            Message 2958 - Posted 30 Mar 2007 10:36:46 UTC

                                              Work unit http://ralph.bakerlab.org/result.php?resultid=474413
                                              gives ERROR EXIT CODE 131. SIGSEGV ERROR.

                                              Was also posting the following Maximum Disk space usage ERROR from 5.55 and 5.56, but it may now be fixed in 5.57?

                                              http://ralph.bakerlab.org/result.php?resultid=474639
                                              http://ralph.bakerlab.org/result.php?resultid=474622
                                              http://ralph.bakerlab.org/result.php?resultid=475356
                                              http://ralph.bakerlab.org/result.php?resultid=475357.


                                              ____________

                                              Profile anders n

                                              Joined: Feb 16 06
                                              Posts: 166
                                              ID: 91
                                              Credit: 131,419
                                              RAC: 0
                                              Message 2961 - Posted 30 Mar 2007 14:21:06 UTC

                                                % issue
                                                I have a Wu that was at 40% and had started model no 4.

                                                I restarted Boinc and the Wu restarted at model no 4 but with 0% and
                                                started counting up.

                                                Anders n
                                                ____________

                                                Profile feet1st

                                                Joined: Mar 7 06
                                                Posts: 312
                                                ID: 1028
                                                Credit: 110,522
                                                RAC: 0
                                                Message 2962 - Posted 30 Mar 2007 14:31:01 UTC - in response to Message 2954.

                                                  ...we do *not* plan to send workunits to Rosetta@home that take more than an hour per decoy!


                                                  ...well, THAT would certainly be one approach to solving the problem :)

                                                  ...but... um... \"an hour\" on how fast of a machine? On the minimum required for the project, 500MHz machine?

                                                  I just note that while the science is obviously improving and the well-seasoned types of runs are generally 10-15min. per model... there always seems to be NEW types of runs as well (Docking, RNA, now FOLD_AND_DOCK) which always seem to take significantly longer then the normal. Assuming that trend continues, you will always have some new type of work that has very long crunchtime per model.

                                                  One possible way to address that would be if you could pick a mid-model point at which you define yourself to be x% done. Say pick three points, one near 25%, one near 50% and one near 75%. Then you could \"know\" in advance that a given single model will exceed the RT pref. and show a more linear progression on % completed. ...but, as you say, you\'ve got lots of other fish to fry. I think the current progress indication is a VAST improvement and will avoid a lot of confusion with new users.

                                                  ...now... about that checkpointing?? (we always gotta ask for more, it\'s our job!). Of course, if you change the models to average <1hr, this minimizes the need for more checkpoints as well.
                                                  ____________

                                                  Profile ashriel

                                                  Joined: Mar 3 07
                                                  Posts: 11
                                                  ID: 2714
                                                  Credit: 648
                                                  RAC: 0
                                                  Message 2963 - Posted 30 Mar 2007 15:09:49 UTC - in response to Message 2957.

                                                    Last modified: 30 Mar 2007 15:12:22 UTC

                                                    Running 5.57, default: 1 hour, WU 1fna__BOINC_NOFILTERS_ABRELAX_SAVE_ALL_OUT_NEWRELAXFLAGS_frags83__1881_7:

                                                    Time: 30 Minutes - Percentage: 50
                                                    Time: 42 Minutes - Percentage:100

                                                    Why do some WUs run shorter than they should (1 hour) - percentage then doesn\'t work, of course.

                                                    ____________

                                                    Profile feet1st

                                                    Joined: Mar 7 06
                                                    Posts: 312
                                                    ID: 1028
                                                    Credit: 110,522
                                                    RAC: 0
                                                    Message 2964 - Posted 30 Mar 2007 15:16:19 UTC - in response to Message 2961.

                                                      Last modified: 30 Mar 2007 15:54:18 UTC

                                                      % issue
                                                      I have a Wu that was at 40% and had started model no 4.

                                                      I restarted Boinc and the Wu restarted at model no 4 but with 0% and
                                                      started counting up.

                                                      Anders n


                                                      So, I decided to try that as well, end BOINC, restart, had two WUs running.

                                                      Upon restart this one went to 100% immediately.
                                                      Says: Completed 30 RNA decoys above the report that 62 decoys were generated.

                                                      This one is a \"FOLD_AND_DOCK\" and it went to zero % after a second or two, and then according to the msgs, 20 seconds later, it went to 100% as well. No indication of why it didn\'t continue to crunch, it completed 63 decoys and 159nstructs. (graphic showed the 63 as the \"model\").
                                                      ____________

                                                      Profile feet1st

                                                      Joined: Mar 7 06
                                                      Posts: 312
                                                      ID: 1028
                                                      Credit: 110,522
                                                      RAC: 0
                                                      Message 2965 - Posted 30 Mar 2007 15:18:29 UTC - in response to Message 2963.

                                                        Last modified: 30 Mar 2007 15:20:50 UTC

                                                        Why do some WUs run shorter than they should (1 hour) - percentage then doesn\'t work, of course.

                                                        You have to look at how many models you completed. Ralph estimated that it would take longer then an hour if it began another model, so it had to end a little early rather then keep you later then your preference. So, to the nearest model, your preference is met. When models take significant time, this can get to be an even more noticible difference as versus your expectation.

                                                        In your case, you crunched two models in 2546 seconds. And so the client should estimate that a third model would take about 1275 seconds more. Which would exceed your hour preference.
                                                        ____________

                                                        Profile ashriel

                                                        Joined: Mar 3 07
                                                        Posts: 11
                                                        ID: 2714
                                                        Credit: 648
                                                        RAC: 0
                                                        Message 2966 - Posted 30 Mar 2007 15:25:05 UTC - in response to Message 2965.

                                                          thx for that explanation :)
                                                          ____________

                                                          Profile anders n

                                                          Joined: Feb 16 06
                                                          Posts: 166
                                                          ID: 91
                                                          Credit: 131,419
                                                          RAC: 0
                                                          Message 2967 - Posted 30 Mar 2007 15:48:44 UTC

                                                            MAC
                                                            I tried to get Ralph to \"hang\" again by pause and resume then
                                                            by manually get it to swich between Einstein and Ralph... no success. :)
                                                            I\'ll let it run by it self hopfully starting to swich by itself to se if
                                                            it still works as it should.
                                                            Anders n
                                                            ____________

                                                            Rhiju
                                                            Forum moderator
                                                            Project developer
                                                            Project scientist

                                                            Joined: Feb 14 06
                                                            Posts: 161
                                                            ID: 4
                                                            Credit: 3,725
                                                            RAC: 0
                                                            Message 2968 - Posted 30 Mar 2007 18:10:41 UTC - in response to Message 2962.

                                                              Last modified: 30 Mar 2007 18:12:51 UTC

                                                              Hi everybody:

                                                              For the first time in a long time, we have outstanding rates of success:

                                                              Version OS Total Results Pass Rate Fail Rate
                                                              557 Darwin 282 98.58 1.42
                                                              557 Linux 72 98.61 1.39
                                                              557 Unknown 17 100.00 0.00
                                                              557 Windows 1517 96.97 2.70

                                                              (Sorry for the formatting.)

                                                              Many of the failures are due to dowload errors, so the true error rate is quite low.

                                                              There\'s still an issue on Mac of preempted workunits not returning to memory -- thanks Anders n for pointing this out. Another developer (David K) and I can both reproduce this on our mac laptops. We wonder if its a OS X issue; it may also be a BOINC client issue. We\'ll keep you posted. The good news is that the Macs that were having consistent problems with all workunits are running again!

                                                              Feet1st, I agree that checkpointing is the best solution for potentially long workunits. I\'ve figured out a way to easily put in checkpoints for most of our code, but its going to take a couple weeks of development to write the appropriate helper code and test. Stay tuned!

                                                              Finally, you can expect a couple more ralph updates today and this weekend. I would like to try out a stack overflow fix for Macs, which allows them to carry out work on larger RNAs (interestingly Windows works fine, as do Macs when compiled without graphics). So I\'ll give that a shot today, and if it doesn\'t work, take out the code tomorrow. Anyway looks like we\'re on our way to a Rosetta@home update with some cool new science and several useful bug fixes around Sunday or Monday.

                                                              Thanks to everybody!


                                                              ...we do *not* plan to send workunits to Rosetta@home that take more than an hour per decoy!


                                                              ...well, THAT would certainly be one approach to solving the problem :)

                                                              ...but... um... \"an hour\" on how fast of a machine? On the minimum required for the project, 500MHz machine?

                                                              I just note that while the science is obviously improving and the well-seasoned types of runs are generally 10-15min. per model... there always seems to be NEW types of runs as well (Docking, RNA, now FOLD_AND_DOCK) which always seem to take significantly longer then the normal. Assuming that trend continues, you will always have some new type of work that has very long crunchtime per model.

                                                              One possible way to address that would be if you could pick a mid-model point at which you define yourself to be x% done. Say pick three points, one near 25%, one near 50% and one near 75%. Then you could \"know\" in advance that a given single model will exceed the RT pref. and show a more linear progression on % completed. ...but, as you say, you\'ve got lots of other fish to fry. I think the current progress indication is a VAST improvement and will avoid a lot of confusion with new users.

                                                              ...now... about that checkpointing?? (we always gotta ask for more, it\'s our job!). Of course, if you change the models to average <1hr, this minimizes the need for more checkpoints as well.


                                                              ____________

                                                              Rhiju
                                                              Forum moderator
                                                              Project developer
                                                              Project scientist

                                                              Joined: Feb 14 06
                                                              Posts: 161
                                                              ID: 4
                                                              Credit: 3,725
                                                              RAC: 0
                                                              Message 2969 - Posted 30 Mar 2007 21:08:29 UTC

                                                                Update to 5.58
                                                                This is basically the same app as 5.57, except for a small science fix to symmetric and docking (those workunits have been running beautifully otherwise) and a change in the Macs.

                                                                I\'m now trying to set stack sizes based on the maximum allowed by your system (typically 32-64 Mb). This is necessary for the large RNA jobs, as well as for future work that involves, e.g., designs of transcription factors that bind DNA and could be used for gene therapy. I\'m also reporting the stack sizes in stderr.txt which is returned from your clients to our server, so I can get some info. This may crash some Macs, in which case, I\'ll revert the change, and test again.

                                                                ____________

                                                                Profile feet1st

                                                                Joined: Mar 7 06
                                                                Posts: 312
                                                                ID: 1028
                                                                Credit: 110,522
                                                                RAC: 0
                                                                Message 2970 - Posted 30 Mar 2007 22:15:43 UTC - in response to Message 2954.

                                                                  Last modified: 30 Mar 2007 22:16:21 UTC

                                                                  ...our solution to this problem is to be careful -- we do *not* plan to send workunits to Rosetta@home that take more than an hour per decoy!

                                                                  It\'s funny you should say that RIGHT when I\'ve got 4 Ralph WUs and all 4 are taking more then an hour for their first model. (but TRUE, you said \"on Rosetta\" not \"on Ralph\").

                                                                  I ended and started BOINC again (I\'ve got 5.57), this time they didn\'t end and report in the way they did before... but the completion % of the two tasks is the same, even though one had completed two models, and the other lost all of it\'s work on model 1. So, right now one has 2:20:xx of CPU, and is showing same %completed as the one that just restarted 5min ago. So it isn\'t taking total CPU time for the WU in to account, just CPU since last BOINC start up.

                                                                  See also anders n\'s post where they observed similar behavior.
                                                                  ____________

                                                                  Rhiju
                                                                  Forum moderator
                                                                  Project developer
                                                                  Project scientist

                                                                  Joined: Feb 14 06
                                                                  Posts: 161
                                                                  ID: 4
                                                                  Credit: 3,725
                                                                  RAC: 0
                                                                  Message 2971 - Posted 31 Mar 2007 3:20:54 UTC - in response to Message 2970.

                                                                    Yes, I did send out some massively long workunits -- just testing out the system!

                                                                    Hmm, I hadn\'t carefully thought about what would happen if two models were completed on the first pass. Let me see if I can figure out a fix...

                                                                    ...our solution to this problem is to be careful -- we do *not* plan to send workunits to Rosetta@home that take more than an hour per decoy!

                                                                    It\'s funny you should say that RIGHT when I\'ve got 4 Ralph WUs and all 4 are taking more then an hour for their first model. (but TRUE, you said \"on Rosetta\" not \"on Ralph\").

                                                                    I ended and started BOINC again (I\'ve got 5.57), this time they didn\'t end and report in the way they did before... but the completion % of the two tasks is the same, even though one had completed two models, and the other lost all of it\'s work on model 1. So, right now one has 2:20:xx of CPU, and is showing same %completed as the one that just restarted 5min ago. So it isn\'t taking total CPU time for the WU in to account, just CPU since last BOINC start up.

                                                                    See also anders n\'s post where they observed similar behavior.


                                                                    ____________

                                                                    Rhiju
                                                                    Forum moderator
                                                                    Project developer
                                                                    Project scientist

                                                                    Joined: Feb 14 06
                                                                    Posts: 161
                                                                    ID: 4
                                                                    Credit: 3,725
                                                                    RAC: 0
                                                                    Message 2972 - Posted 31 Mar 2007 5:18:58 UTC - in response to Message 2970.

                                                                      OK, just talked to David K about this. Right now we keep track of time crunched based on a call to the BOINC API ... i.e. the BOINC manager keeps track of how much time was spent on each workunit. If you preempt after an hour and resume later, the BOINC manager will tell Rosetta about the hour already spent.

                                                                      But if you shut BOINC down and restart that could cause a problem in a lot of estimates... we can try to make the Rosetta app more self-sufficient, keeping track of cpu time spent so far, but that might be a can of worms. Worth the time? I think its a better use of our time to figure out what\'s going wrong with Mac\'s preempt/resume so that most users will not need to shut down BOINC and restart like Anders n has been doing! And we\'ll spend time getting in those checkpoints...


                                                                      ...our solution to this problem is to be careful -- we do *not* plan to send workunits to Rosetta@home that take more than an hour per decoy!

                                                                      It\'s funny you should say that RIGHT when I\'ve got 4 Ralph WUs and all 4 are taking more then an hour for their first model. (but TRUE, you said \"on Rosetta\" not \"on Ralph\").

                                                                      I ended and started BOINC again (I\'ve got 5.57), this time they didn\'t end and report in the way they did before... but the completion % of the two tasks is the same, even though one had completed two models, and the other lost all of it\'s work on model 1. So, right now one has 2:20:xx of CPU, and is showing same %completed as the one that just restarted 5min ago. So it isn\'t taking total CPU time for the WU in to account, just CPU since last BOINC start up.

                                                                      See also anders n\'s post where they observed similar behavior.


                                                                      ____________

                                                                      alexpoon

                                                                      Joined: Sep 9 06
                                                                      Posts: 4
                                                                      ID: 1824
                                                                      Credit: 87
                                                                      RAC: 0
                                                                      Message 2973 - Posted 31 Mar 2007 9:11:15 UTC

                                                                        I found out that after suspending the wu(not leave in memery), if I start it,
                                                                        it will recount the %finish but the work is still continuing.(start at model 5 as an example)

                                                                        Profile anders n

                                                                        Joined: Feb 16 06
                                                                        Posts: 166
                                                                        ID: 91
                                                                        Credit: 131,419
                                                                        RAC: 0
                                                                        Message 2974 - Posted 31 Mar 2007 10:52:30 UTC

                                                                          Update on my MAC
                                                                          Ralph 5.57 and Einstein has been swiching all-night without any errors.

                                                                          On to 5.58 :)
                                                                          Anders n
                                                                          ____________

                                                                          Profile anders n

                                                                          Joined: Feb 16 06
                                                                          Posts: 166
                                                                          ID: 91
                                                                          Credit: 131,419
                                                                          RAC: 0
                                                                          Message 2975 - Posted 31 Mar 2007 14:24:23 UTC - in response to Message 2972.

                                                                            OK, just talked to David K about this. Right now we keep track of time crunched based on a call to the BOINC API ... i.e. the BOINC manager keeps track of how much time was spent on each workunit. If you preempt after an hour and resume later, the BOINC manager will tell Rosetta about the hour already spent.

                                                                            But if you shut BOINC down and restart that could cause a problem in a lot of estimates... we can try to make the Rosetta app more self-sufficient, keeping track of cpu time spent so far, but that might be a can of worms. Worth the time?


                                                                            Just so we have all the facts right. When a Ralph Wu 5.58 is resumed after preemt the % done goes back to 0 and time to complete goes very high.
                                                                            I just had one it preemted at 2 H and when restarted time to complete was
                                                                            at nearly 6 H (rapidly going down as % was going up). I have a 4 H setting for
                                                                            Ralph on that computer.

                                                                            Anders n

                                                                            ____________

                                                                            Rhiju
                                                                            Forum moderator
                                                                            Project developer
                                                                            Project scientist

                                                                            Joined: Feb 14 06
                                                                            Posts: 161
                                                                            ID: 4
                                                                            Credit: 3,725
                                                                            RAC: 0
                                                                            Message 2976 - Posted 1 Apr 2007 0:04:18 UTC - in response to Message 2975.

                                                                              Last modified: 1 Apr 2007 0:06:29 UTC

                                                                              Anders n, I think the behavior you observe is partly due to an additional \"correction\" that the BOINC API applies when estimating time to completion -- it should never really be over 4 hours, right? We really don\'t have any control over that extra \"correction\".

                                                                              But we do have control over percent complete, and that shouldn\'t go to zero upon resuming ralph! So I\'m still worried. On my mac intel machine, I just tried to suspend a ralph WU, and ran einstein@home for a few minutes; then suspended the einstein@home workunit, and resumed the ralph WU. Everything was fine (pct complete never dropped to zero)... when you try this, does pct complete drop to zero?

                                                                              [edit]
                                                                              Another question: you posted that 5.57 was fine; are you seeing an issue only with 5.58? If so, this is totally puzzling, since the small change I made to the Mac app shouldn\'t affect behacior of pct complete.


                                                                              OK, just talked to David K about this. Right now we keep track of time crunched based on a call to the BOINC API ... i.e. the BOINC manager keeps track of how much time was spent on each workunit. If you preempt after an hour and resume later, the BOINC manager will tell Rosetta about the hour already spent.

                                                                              But if you shut BOINC down and restart that could cause a problem in a lot of estimates... we can try to make the Rosetta app more self-sufficient, keeping track of cpu time spent so far, but that might be a can of worms. Worth the time?


                                                                              Just so we have all the facts right. When a Ralph Wu 5.58 is resumed after preemt the % done goes back to 0 and time to complete goes very high.
                                                                              I just had one it preemted at 2 H and when restarted time to complete was
                                                                              at nearly 6 H (rapidly going down as % was going up). I have a 4 H setting for
                                                                              Ralph on that computer.

                                                                              Anders n


                                                                              ____________

                                                                              Beans and Pulses

                                                                              Joined: Feb 16 06
                                                                              Posts: 1
                                                                              ID: 92
                                                                              Credit: 200,001
                                                                              RAC: 0
                                                                              Message 2977 - Posted 1 Apr 2007 5:17:00 UTC

                                                                                Last modified: 1 Apr 2007 5:17:39 UTC

                                                                                On AMD 2000XP running winxp, symm_fold wus, Rosetta 5.57, stick at 97.672%, aborted them after 7+hrs run time. Other types of wus have run ok on this box, run time is set in preferences to 2 hrs, ie:

                                                                                http://ralph.bakerlab.org/result.php?resultid=479508
                                                                                ____________

                                                                                Profile anders n

                                                                                Joined: Feb 16 06
                                                                                Posts: 166
                                                                                ID: 91
                                                                                Credit: 131,419
                                                                                RAC: 0
                                                                                Message 2978 - Posted 1 Apr 2007 6:01:42 UTC - in response to Message 2976.

                                                                                  Last modified: 1 Apr 2007 6:07:42 UTC

                                                                                  Anders n, I think the behavior you observe is partly due to an additional \"correction\" that the BOINC API applies when estimating time to completion -- it should never really be over 4 hours, right? We really don\'t have any control over that extra \"correction\".

                                                                                  But we do have control over percent complete, and that shouldn\'t go to zero upon resuming ralph! So I\'m still worried. On my mac intel machine, I just tried to suspend a ralph WU, and ran einstein@home for a few minutes; then suspended the einstein@home workunit, and resumed the ralph WU. Everything was fine (pct complete never dropped to zero)... when you try this, does pct complete drop to zero?

                                                                                  [edit]
                                                                                  Another question: you posted that 5.57 was fine; are you seeing an issue only with 5.58? If so, this is totally puzzling, since the small change I made to the Mac app shouldn\'t affect behacior of pct complete.


                                                                                  OK, just talked to David K about this. Right now we keep track of time crunched based on a call to the BOINC API ... i.e. the BOINC manager keeps track of how much time was spent on each workunit. If you preempt after an hour and resume later, the BOINC manager will tell Rosetta about the hour already spent.

                                                                                  But if you shut BOINC down and restart that could cause a problem in a lot of estimates... we can try to make the Rosetta app more self-sufficient, keeping track of cpu time spent so far, but that might be a can of worms. Worth the time?


                                                                                  Just so we have all the facts right. When a Ralph Wu 5.58 is resumed after preemt the % done goes back to 0 and time to complete goes very high.
                                                                                  I just had one it preemted at 2 H and when restarted time to complete was
                                                                                  at nearly 6 H (rapidly going down as % was going up). I have a 4 H setting for
                                                                                  Ralph on that computer.

                                                                                  Anders n




                                                                                  Oups sorry I should have said that it was on a windows XP computer
                                                                                  the % went to 0. It works ok on the MAC.
                                                                                  As a side effect it does not happen when I suspend and resume in the middel of a model, it only happens when at model swich by Boinc it self.
                                                                                  (I have \"Leave applications in memory while suspended\" set to yes)

                                                                                  [edit] The MAC has done one more night swiching with Einstein without any trouble now with 5.58 [/edit]

                                                                                  Anders n
                                                                                  ____________

                                                                                  Profile feet1st

                                                                                  Joined: Mar 7 06
                                                                                  Posts: 312
                                                                                  ID: 1028
                                                                                  Credit: 110,522
                                                                                  RAC: 0
                                                                                  Message 2979 - Posted 1 Apr 2007 23:10:36 UTC

                                                                                    My observations of % complete resetting to zero upon restart are from Windows as well. You have to remove from memory. I did so by ending BOINC completely rather then changing my settings. Crunch 2 models, then end BOINC and restart.
                                                                                    ____________

                                                                                    Rhiju
                                                                                    Forum moderator
                                                                                    Project developer
                                                                                    Project scientist

                                                                                    Joined: Feb 14 06
                                                                                    Posts: 161
                                                                                    ID: 4
                                                                                    Credit: 3,725
                                                                                    RAC: 0
                                                                                    Message 2980 - Posted 2 Apr 2007 1:19:52 UTC

                                                                                      Last modified: 2 Apr 2007 1:21:36 UTC

                                                                                      Updates in 5.59
                                                                                      I think this is the last update. Everything ran pretty smoothly in 5.58. This just has
                                                                                      some small updates in the science, to get back some useful scores for each decoy and
                                                                                      a small set of fixes for the symmetric FOLD_AND_DOCK workunits.

                                                                                      ____________

                                                                                      Profile ashriel

                                                                                      Joined: Mar 3 07
                                                                                      Posts: 11
                                                                                      ID: 2714
                                                                                      Credit: 648
                                                                                      RAC: 0
                                                                                      Message 2982 - Posted 2 Apr 2007 14:05:59 UTC

                                                                                        Last modified: 2 Apr 2007 14:58:13 UTC

                                                                                        5.59, default: 1 hour, WU s029__BOINC_SYMM_FOLD_AND_DOCK_RELAX-s029_-truncate_hom014__1906_17, Win2000

                                                                                        Time: 06 Minutes - Percentage: 10 - Time left: 4h 16m
                                                                                        Time: 30 Minutes - Percentage: 50 - Time left: 1h 33m
                                                                                        Time: 50 Minutes - Percentage: 83 - Time left: 0h 17m
                                                                                        Time: 60 Minutes - Percentage: 86 - Time left: 0h 15m (Model 1, Step 67622)
                                                                                        Time: 75 Minutes - Percentage: 88 - Time left: 0h 15m (Model 1, Step 67717)
                                                                                        Time: 80 Minutes - Percentage:100 - Time left: - (Model ?, Step ?)

                                                                                        a) The remaining time is strange - it was mostly ok in 5.57/5.58.
                                                                                        b) The steps are very slow (sorry, started to watch them after 60 minutes only)
                                                                                        c) Model 1 takes very long

                                                                                        Profile ashriel

                                                                                        Joined: Mar 3 07
                                                                                        Posts: 11
                                                                                        ID: 2714
                                                                                        Credit: 648
                                                                                        RAC: 0
                                                                                        Message 2983 - Posted 2 Apr 2007 14:06:00 UTC

                                                                                          Last modified: 2 Apr 2007 14:55:30 UTC

                                                                                          5.59, default: 1 hour, WU 1fkaA_BOINC_INCREASECYCLES10_RNA_ABINITIO-1fkaA-chunk005__1901_4, Win2000

                                                                                          Time: 15 Minutes - Percentage: 25 - Time left: 2h 55m (Model 1, Step 271.000)
                                                                                          Time: 40 Minutes - Percentage: 67 - Time left: 0h 45m (Model 2, Step 235.000)
                                                                                          Time: 50 Minutes - Percentage: 83 - Time left: 0h 17m (Model 2, Step 409.000)
                                                                                          Time: 55 Minutes - Percentage:100 - Time left: - (Model ?, Step ?)

                                                                                          <1h and more then 1 model, but remaining time strange
                                                                                          ____________

                                                                                          Profile feet1st

                                                                                          Joined: Mar 7 06
                                                                                          Posts: 312
                                                                                          ID: 1028
                                                                                          Credit: 110,522
                                                                                          RAC: 0
                                                                                          Message 2985 - Posted 2 Apr 2007 17:02:01 UTC

                                                                                            Last modified: 2 Apr 2007 17:32:10 UTC

                                                                                            Maion, I believe your time remaining is working just the way Rhiju intended for it to. Once the remaining time estimate gets <10min. then time starts moving slower. This is avoid exceeding 100%. So, basically, once you get below a 10 minute estimated time remaining, the estimate is not on track anymore. Basically, the client is unsure exactly when it will finish, but in each case, the 15 and 17 minutes estimates were not far from right.

                                                                                            ...But Rhiju assures us they won\'t be sending WUs which take more then an hour per model on Rosetta. And so on Rosetta, with shorter WUs, the estimates should appear better. The 1hr time preference is always going to be the toughest to provide a good estimate on. As it is the time preference that will see the most variation (in percentage terms) between the actual time and the preference.
                                                                                            ____________

                                                                                            Rhiju
                                                                                            Forum moderator
                                                                                            Project developer
                                                                                            Project scientist

                                                                                            Joined: Feb 14 06
                                                                                            Posts: 161
                                                                                            ID: 4
                                                                                            Credit: 3,725
                                                                                            RAC: 0
                                                                                            Message 2986 - Posted 2 Apr 2007 18:22:50 UTC - in response to Message 2985.

                                                                                              Thanks, Feet1st, that\'s a great explanation. We indeed try to keep the avg time per model at less than one hour; actually our ralph runs help us calibrate this!

                                                                                              Maion, I believe your time remaining is working just the way Rhiju intended for it to. Once the remaining time estimate gets <10min. then time starts moving slower. This is avoid exceeding 100%. So, basically, once you get below a 10 minute estimated time remaining, the estimate is not on track anymore. Basically, the client is unsure exactly when it will finish, but in each case, the 15 and 17 minutes estimates were not far from right.

                                                                                              ...But Rhiju assures us they won\'t be sending WUs which take more then an hour per model on Rosetta. And so on Rosetta, with shorter WUs, the estimates should appear better. The 1hr time preference is always going to be the toughest to provide a good estimate on. As it is the time preference that will see the most variation (in percentage terms) between the actual time and the preference.


                                                                                              ____________

                                                                                              Profile feet1st

                                                                                              Joined: Mar 7 06
                                                                                              Posts: 312
                                                                                              ID: 1028
                                                                                              Credit: 110,522
                                                                                              RAC: 0
                                                                                              Message 2987 - Posted 2 Apr 2007 18:37:16 UTC

                                                                                                Last modified: 2 Apr 2007 18:38:59 UTC

                                                                                                Still seems to end WUs prematurely. If you restart an RNA task that\'s already completed 30 models... then it will end, regardless of preferred runtime.

                                                                                                This is that Completed 30 RNA decoys. additional message I\'ve been mentioing.

                                                                                                Here\'s a v5.58 example.
                                                                                                ____________

                                                                                                Profile anders n

                                                                                                Joined: Feb 16 06
                                                                                                Posts: 166
                                                                                                ID: 91
                                                                                                Credit: 131,419
                                                                                                RAC: 0
                                                                                                Message 2988 - Posted 2 Apr 2007 18:41:52 UTC

                                                                                                  Last modified: 2 Apr 2007 18:42:22 UTC

                                                                                                  Can anyone explain the new text on MAC results?

                                                                                                  It looks like this

                                                                                                  Rosetta@home Macintosh Stack Size checker.
                                                                                                  Original size: 8388608.
                                                                                                  Maximum size: 0.
                                                                                                  RLIM_INFINITY 67108864

                                                                                                  Anders n
                                                                                                  ____________

                                                                                                  Profile feet1st

                                                                                                  Joined: Mar 7 06
                                                                                                  Posts: 312
                                                                                                  ID: 1028
                                                                                                  Credit: 110,522
                                                                                                  RAC: 0
                                                                                                  Message 2989 - Posted 2 Apr 2007 18:43:58 UTC

                                                                                                    I did confirm this morning that even when the % completed resets to zero, when you restart the task it does seem to know to move time ahead quicker. I had some 10 or 12 hrs in to a task on my 24hr preference and every 5 second tick it was subtracting 15 seconds from the estimated time remaining. So, even though the estimate went to 24+10hrs, if you study it for a minute you can see that it knows better then that. This must be due to the BOINC correction factor applied to the % completed and the current CPU time in to the task.
                                                                                                    ____________

                                                                                                    Profile feet1st

                                                                                                    Joined: Mar 7 06
                                                                                                    Posts: 312
                                                                                                    ID: 1028
                                                                                                    Credit: 110,522
                                                                                                    RAC: 0
                                                                                                    Message 2990 - Posted 2 Apr 2007 19:06:56 UTC - in response to Message 2988.

                                                                                                      Last modified: 2 Apr 2007 19:08:21 UTC

                                                                                                      Can anyone explain the new text on MAC results?

                                                                                                      It looks like this

                                                                                                      Rosetta@home Macintosh Stack Size checker.
                                                                                                      Original size: 8388608.
                                                                                                      Maximum size: 0.
                                                                                                      RLIM_INFINITY 67108864

                                                                                                      Anders n


                                                                                                      Rhiju explained:
                                                                                                      ...I\'m also reporting the stack sizes in stderr.txt which is returned from your clients to our server, so I can get some info.


                                                                                                      I think he was trying to determine if stack size had any correlation to Mac failures.
                                                                                                      ____________

                                                                                                      Profile anders n

                                                                                                      Joined: Feb 16 06
                                                                                                      Posts: 166
                                                                                                      ID: 91
                                                                                                      Credit: 131,419
                                                                                                      RAC: 0
                                                                                                      Message 2991 - Posted 2 Apr 2007 19:15:00 UTC

                                                                                                        @feet1st thanks :)
                                                                                                        ____________

                                                                                                        Profile ashriel

                                                                                                        Joined: Mar 3 07
                                                                                                        Posts: 11
                                                                                                        ID: 2714
                                                                                                        Credit: 648
                                                                                                        RAC: 0
                                                                                                        Message 2992 - Posted 2 Apr 2007 21:09:21 UTC

                                                                                                          Last modified: 2 Apr 2007 21:10:26 UTC

                                                                                                          Because I don\'t really know what information could be helpful I post it so detailed.
                                                                                                          But I believe they are no real help.
                                                                                                          ____________

                                                                                                          Profile feet1st

                                                                                                          Joined: Mar 7 06
                                                                                                          Posts: 312
                                                                                                          ID: 1028
                                                                                                          Credit: 110,522
                                                                                                          RAC: 0
                                                                                                          Message 2993 - Posted 2 Apr 2007 21:19:54 UTC

                                                                                                            Rhiju, on this issue of the % completed resetting when a task is restarted after being kicked out of memory...

                                                                                                            I\'m puzzled. Before the progress% changes, a restart would not have impacted the calculations. Why does it now? I mean it seems Rosetta used to know the correct total CPU time spent so far when it recomputed progress at end of each model. So... where did it get that number? ...and isn\'t THAT the number to use now? Rather then the one that resets upon restart?
                                                                                                            ____________

                                                                                                            Profile UBT - Terry
                                                                                                            Avatar

                                                                                                            Joined: Nov 13 06
                                                                                                            Posts: 2
                                                                                                            ID: 2219
                                                                                                            Credit: 68,467
                                                                                                            RAC: 0
                                                                                                            Message 2997 - Posted 6 Apr 2007 12:27:56 UTC

                                                                                                              Got this error message for this wu 06/04/2007 11:15:05|ralph@home|Reason: Unrecoverable error for result 1mhk__BOINC_RNA_ABINITIO-1mhk_-_1918_27_1 ( - exit code -1073741819 (0xc0000005))

                                                                                                              ____________

                                                                                                              Profile Conan
                                                                                                              Avatar

                                                                                                              Joined: Feb 16 06
                                                                                                              Posts: 344
                                                                                                              ID: 145
                                                                                                              Credit: 1,309,534
                                                                                                              RAC: 0
                                                                                                              Message 2998 - Posted 6 Apr 2007 15:50:08 UTC

                                                                                                                Had 2 WU\'s fail with

                                                                                                                stderr out
                                                                                                                <core_client_version>5.8.15</core_client_version>
                                                                                                                <![CDATA[
                                                                                                                <message>
                                                                                                                Incorrect function. (0x1) - exit code 1 (0x1)
                                                                                                                </message>
                                                                                                                <stderr_txt>
                                                                                                                # cpu_run_time_pref: 21600
                                                                                                                # random seed: 2693807
                                                                                                                ERROR:: Exit at: .\\loop_relax.cc line:1688

                                                                                                                </stderr_txt>
                                                                                                                ]]>

                                                                                                                http://ralph.bakerlab.org/result.php?resultid=486602
                                                                                                                http://ralph.bakerlab.org/result.php?resultid=486603

                                                                                                                Both only ran for 16 minutes, BAK workunit type.
                                                                                                                ____________

                                                                                                                Profile UBT - Terry
                                                                                                                Avatar

                                                                                                                Joined: Nov 13 06
                                                                                                                Posts: 2
                                                                                                                ID: 2219
                                                                                                                Credit: 68,467
                                                                                                                RAC: 0
                                                                                                                Message 2999 - Posted 6 Apr 2007 18:27:03 UTC

                                                                                                                  Last modified: 6 Apr 2007 18:32:41 UTC

                                                                                                                  Ive also had a couple likethis one 06/04/2007 19:19:53|ralph@home|Computation for task te00_1_NMRREF_1_te00_1_idid_model_06_core_0001IGNORE_THE_REST_idl_1917_44_0 finished
                                                                                                                  jump from 53% or there abouts upto 100% finishing in only 38 mins ???
                                                                                                                  Not sure if this is a bug or it\'s meant to do that
                                                                                                                  I\'m running at 1.86 ghz using BOINC 5.8.15 WIN XP
                                                                                                                  ____________

                                                                                                                  Profile feet1st

                                                                                                                  Joined: Mar 7 06
                                                                                                                  Posts: 312
                                                                                                                  ID: 1028
                                                                                                                  Credit: 110,522
                                                                                                                  RAC: 0
                                                                                                                  Message 3000 - Posted 6 Apr 2007 22:59:18 UTC

                                                                                                                    Terry, looks like you have a 1 hour runtime preference??

                                                                                                                    You are completing the first model in something over 30 minutes, and so your % complete shows the fraction, say 35min/60min preference = 58% complete... and then it hits the end of the model and determins that you don\'t have time to start a second one, so it completes it.

                                                                                                                    In short, the estimate doesn\'t predict if you will cut out early, and until you complete model 1, it really doesn\'t have any way to know if you are likely to or not.
                                                                                                                    ____________

                                                                                                                    Thomas Leibold

                                                                                                                    Joined: Feb 25 07
                                                                                                                    Posts: 27
                                                                                                                    ID: 2684
                                                                                                                    Credit: 77,464
                                                                                                                    RAC: 0
                                                                                                                    Message 3001 - Posted 7 Apr 2007 3:20:38 UTC - in response to Message 2998.

                                                                                                                      Last modified: 7 Apr 2007 3:22:09 UTC

                                                                                                                      Had 2 WU\'s fail ...


                                                                                                                      Got one of those too:

                                                                                                                      <core_client_version>5.8.15</core_client_version>
                                                                                                                      <![CDATA[
                                                                                                                      <message>
                                                                                                                      process exited with code 1 (0x1)
                                                                                                                      </message>
                                                                                                                      <stderr_txt>
                                                                                                                      Graphics are disabled due to configuration...
                                                                                                                      # cpu_run_time_pref: 14400
                                                                                                                      # random seed: 2693814
                                                                                                                      ERROR:: Exit at: loop_relax.cc line:1688

                                                                                                                      </stderr_txt>
                                                                                                                      ]]>

                                                                                                                      Workunit 430740 on Linux Server.

                                                                                                                      Thomas Leibold

                                                                                                                      Joined: Feb 25 07
                                                                                                                      Posts: 27
                                                                                                                      ID: 2684
                                                                                                                      Credit: 77,464
                                                                                                                      RAC: 0
                                                                                                                      Message 3002 - Posted 7 Apr 2007 3:29:13 UTC

                                                                                                                        Workunit 431258 had problems with downloading two of its parts:

                                                                                                                        Fri 06 Apr 2007 04:33:04 AM PDT|ralph@home|[file_xfer] Started download of file 1mhk_.fasta.gz
                                                                                                                        Fri 06 Apr 2007 04:33:04 AM PDT|ralph@home|[file_xfer] Started download of file 1mhk__1ffk.fragments.gz
                                                                                                                        Fri 06 Apr 2007 04:33:06 AM PDT|ralph@home|Incomplete read of 66.000000 < 5KB for 1mhk_.fasta.gz - truncating
                                                                                                                        Fri 06 Apr 2007 04:33:06 AM PDT|ralph@home|[file_xfer] Finished download of file 1mhk_.fasta.gz
                                                                                                                        Fri 06 Apr 2007 04:33:06 AM PDT|ralph@home|[file_xfer] Throughput 623 bytes/sec
                                                                                                                        Fri 06 Apr 2007 04:33:06 AM PDT|ralph@home|[file_xfer] Started download of file 1mhk_RNA.pdb.gz
                                                                                                                        Fri 06 Apr 2007 04:33:06 AM PDT|ralph@home|[error] Checksum or signature error for 1mhk_.fasta.gz
                                                                                                                        Fri 06 Apr 2007 04:33:12 AM PDT|ralph@home|[file_xfer] Finished download of file 1mhk_RNA.pdb.gz
                                                                                                                        Fri 06 Apr 2007 04:33:12 AM PDT|ralph@home|[file_xfer] Throughput 3265 bytes/sec
                                                                                                                        Fri 06 Apr 2007 04:33:12 AM PDT|ralph@home|[file_xfer] Started download of file 1mhk__pairing.pdat.gz
                                                                                                                        Fri 06 Apr 2007 04:33:12 AM PDT|ralph@home|[error] MD5 check failed for 1mhk_RNA.pdb.gz
                                                                                                                        Fri 06 Apr 2007 04:33:12 AM PDT|ralph@home|[error] expected 43fa6b24e2ed0b12d7d949aaa6952085, got 398a6a6e30c8d9493c75a549173bcd93
                                                                                                                        Fri 06 Apr 2007 04:33:12 AM PDT|ralph@home|[error] Checksum or signature error for 1mhk_RNA.pdb.gz
                                                                                                                        Fri 06 Apr 2007 04:33:13 AM PDT|ralph@home|[file_xfer] Finished download of file 1mhk__1ffk.fragments.gz
                                                                                                                        Fri 06 Apr 2007 04:33:13 AM PDT|ralph@home|[file_xfer] Throughput 159807 bytes/sec
                                                                                                                        Fri 06 Apr 2007 04:33:13 AM PDT|ralph@home|[file_xfer] Finished download of file 1mhk__pairing.pdat.gz
                                                                                                                        Fri 06 Apr 2007 04:33:13 AM PDT|ralph@home|[file_xfer] Throughput 162 bytes/sec
                                                                                                                        Fri 06 Apr 2007 04:33:13 AM PDT|ralph@home|[error] Checksum or signature error for 1mhk__1ffk.fragments.gz
                                                                                                                        Fri 06 Apr 2007 04:33:13 AM PDT|ralph@home|[error] MD5 check failed for 1mhk__pairing.pdat.gz
                                                                                                                        Fri 06 Apr 2007 04:33:13 AM PDT|ralph@home|[error] expected 6a8599df2728416df250dcde0449ece6, got 4b92756f68af0bf0c557fdb008fb878c
                                                                                                                        Fri 06 Apr 2007 04:33:13 AM PDT|ralph@home|[error] Checksum or signature error for 1mhk__pairing.pdat.gz


                                                                                                                        <core_client_version>5.8.15</core_client_version>
                                                                                                                        <![CDATA[
                                                                                                                        <message>
                                                                                                                        WU download error: couldn\'t get input files:
                                                                                                                        <file_xfer_error>
                                                                                                                        <file_name>1mhk_.fasta.gz</file_name>
                                                                                                                        <error_code>-200</error_code>
                                                                                                                        </file_xfer_error>

                                                                                                                        </message>
                                                                                                                        ]]>

                                                                                                                        Profile Inais
                                                                                                                        Avatar

                                                                                                                        Joined: Jul 30 06
                                                                                                                        Posts: 12
                                                                                                                        ID: 1634
                                                                                                                        Credit: 13,115
                                                                                                                        RAC: 0
                                                                                                                        Message 3003 - Posted 10 Apr 2007 7:30:08 UTC

                                                                                                                          Same problem on 4 WU\'s

                                                                                                                          491751 431342 10 Apr 2007 6:23:00 UTC 10 Apr 2007 6:35:14 UTC Over Client error Downloading 0.00 0.00 ---
                                                                                                                          491750 431340 10 Apr 2007 6:23:00 UTC 10 Apr 2007 6:35:14 UTC Over Client error Downloading 0.00 0.00 ---
                                                                                                                          491745 431276 10 Apr 2007 6:18:50 UTC 10 Apr 2007 6:23:00 UTC Over Client error Downloading 0.00 0.00 ---
                                                                                                                          491743 431275 10 Apr 2007 6:18:50 UTC 10 Apr 2007 6:23:00 UTC Over Client error Downloading 0.00 0.00 ---

                                                                                                                          ____________
                                                                                                                          I wish I can fly like a bird in the sky

                                                                                                                          Profile Conan
                                                                                                                          Avatar

                                                                                                                          Joined: Feb 16 06
                                                                                                                          Posts: 344
                                                                                                                          ID: 145
                                                                                                                          Credit: 1,309,534
                                                                                                                          RAC: 0
                                                                                                                          Message 3005 - Posted 17 Apr 2007 5:21:02 UTC

                                                                                                                            Last modified: 17 Apr 2007 5:23:50 UTC

                                                                                                                            >> My first 2 WU\'s returned from then current run have both errored out with the same error on an AMD Opteron 285 running Linux Fedora Core 6.

                                                                                                                            http://ralph.bakerlab.org/result.php?resultid=493286
                                                                                                                            http://ralph.bakerlab.org/result.php?resultid=493287

                                                                                                                            stderr out

                                                                                                                            <core_client_version>5.8.11</core_client_version>
                                                                                                                            <![CDATA[
                                                                                                                            <stderr_txt>
                                                                                                                            Graphics are disabled due to configuration...
                                                                                                                            # cpu_run_time_pref: 21600
                                                                                                                            # random seed: 2687106
                                                                                                                            ======================================================
                                                                                                                            DONE :: 1 starting structures built 5 (nstruct) times
                                                                                                                            This process generated 0 decoys from 0 attempts
                                                                                                                            ======================================================


                                                                                                                            BOINC :: Watchdog shutting down...
                                                                                                                            BOINC :: BOINC support services shutting down...

                                                                                                                            </stderr_txt>
                                                                                                                            <message>
                                                                                                                            <file_xfer_error>
                                                                                                                            <file_name>BAK_1e0s_GTPase_1945_6_0_0</file_name>
                                                                                                                            <error_code>-161</error_code>
                                                                                                                            </file_xfer_error>

                                                                                                                            </message>
                                                                                                                            ]]>

                                                                                                                            Validate state Invalid
                                                                                                                            Claimed credit 87.5107890336075
                                                                                                                            Granted credit 0
                                                                                                                            application version 5.59


                                                                                                                            ____________

                                                                                                                            Profile Conan
                                                                                                                            Avatar

                                                                                                                            Joined: Feb 16 06
                                                                                                                            Posts: 344
                                                                                                                            ID: 145
                                                                                                                            Credit: 1,309,534
                                                                                                                            RAC: 0
                                                                                                                            Message 3006 - Posted 22 Apr 2007 2:55:23 UTC

                                                                                                                              Last modified: 22 Apr 2007 2:56:21 UTC

                                                                                                                              >> 2 errors with the JK_DOCKFUNNEL workunits

                                                                                                                              http://ralph.bakerlab.org/result.php?resultid=496387
                                                                                                                              http://ralph,bakerlab.org/result.php?resultid=496388

                                                                                                                              stderr out

                                                                                                                              <core_client_version>5.8.15</core_client_version>
                                                                                                                              <![CDATA[
                                                                                                                              <message>
                                                                                                                              Incorrect function. (0x1) - exit code 1 (0x1)
                                                                                                                              </message>
                                                                                                                              <stderr_txt>
                                                                                                                              # cpu_run_time_pref: 21600
                                                                                                                              ERROR:: Exit at: .\\input_pdb.cc line:2998



                                                                                                                              ____________

                                                                                                                              Thomas Leibold

                                                                                                                              Joined: Feb 25 07
                                                                                                                              Posts: 27
                                                                                                                              ID: 2684
                                                                                                                              Credit: 77,464
                                                                                                                              RAC: 0
                                                                                                                              Message 3007 - Posted 22 Apr 2007 18:38:15 UTC

                                                                                                                                Last modified: 22 Apr 2007 18:38:52 UTC

                                                                                                                                Compute error on workunit 439946 (JK_DOCKFUNNEL_DE_NOVO_INTERFACE_1958_1):


                                                                                                                                <core_client_version>5.8.15</core_client_version>
                                                                                                                                <![CDATA[
                                                                                                                                <message>
                                                                                                                                process exited with code 1 (0x1)
                                                                                                                                </message>
                                                                                                                                <stderr_txt>
                                                                                                                                Graphics are disabled due to configuration...
                                                                                                                                ERROR:: Exit at: input_pdb.cc line:2998
                                                                                                                                # cpu_run_time_pref: 14400

                                                                                                                                </stderr_txt>
                                                                                                                                ]]>


                                                                                                                                Has failed for others as well, not just on my system.

                                                                                                                                Profile Trog Dog
                                                                                                                                Avatar

                                                                                                                                Joined: Aug 8 06
                                                                                                                                Posts: 38
                                                                                                                                ID: 1670
                                                                                                                                Credit: 41,996
                                                                                                                                RAC: 0
                                                                                                                                Message 3009 - Posted 23 Apr 2007 13:37:24 UTC

                                                                                                                                  Seems there is a problem with this wu .
                                                                                                                                  ____________

                                                                                                                                  Pepo
                                                                                                                                  Avatar

                                                                                                                                  Joined: Sep 8 06
                                                                                                                                  Posts: 104
                                                                                                                                  ID: 1812
                                                                                                                                  Credit: 36,890
                                                                                                                                  RAC: 0
                                                                                                                                  Message 3010 - Posted 23 Apr 2007 21:21:59 UTC - in response to Message 3009.

                                                                                                                                    Last modified: 23 Apr 2007 21:24:00 UTC

                                                                                                                                    Seems there is a problem with this wu .

                                                                                                                                    It\'s WU BAK_4hvp_sym_model1d5w_FixJ_loop_model_1961_6:

                                                                                                                                    (on Linux)
                                                                                                                                    <message>
                                                                                                                                    process exited with code 1 (0x1)
                                                                                                                                    </message>
                                                                                                                                    <stderr_txt>
                                                                                                                                    ERROR:: Exit at: pose.cc line:761
                                                                                                                                    </stderr_txt>


                                                                                                                                    (on Windows)
                                                                                                                                    <message>
                                                                                                                                    Mindestens ein Argument ist ung├╝ltig. (0x80000003) - exit code -2147483645 (0x80000003)
                                                                                                                                    </message>
                                                                                                                                    <stderr_txt>
                                                                                                                                    Invalid parameter detected in function (null). File: (null) Line: 0 Expression: (null)
                                                                                                                                    Unhandled Exception Detected...
                                                                                                                                    - Unhandled Exception Record -
                                                                                                                                    Reason: Breakpoint Encountered (0x80000003) at address 0x7C911230


                                                                                                                                    ---

                                                                                                                                    The same is happening with BAK_4hvp_sym_model1d5w_FixJ_loop_model_1961_7.

                                                                                                                                    Peter

                                                                                                                                    Message boards : RALPH@home bug list : Bug reports for 5.56-5.59


                                                                                                                                    Home | Join | About | Participants | Community | Statistics

                                                                                                                                    Copyright © 2017 University of Washington

                                                                                                                                    Last Modified: 20 Nov 2008 19:41:56 UTC
                                                                                                                                    Back to top ^