RALPH@home

Bug reports for Ralph 5.05 and higher

  UW Seal
 
[ Home ] [ Join ] [ About ] [ Participants ] [ Community ] [ Statistics ]
  [ login/out ]


Advanced search

Message boards : RALPH@home bug list : Bug reports for Ralph 5.05 and higher

AuthorMessage
Rhiju
Forum moderator
Project developer
Project scientist

Joined: Feb 14 06
Posts: 161
ID: 4
Credit: 3,725
RAC: 0
Message 1367 - Posted 26 Apr 2006 5:55:54 UTC

    Last modified: 27 Apr 2006 3:51:57 UTC

    Really, we think this is the last one before Rosetta@home is updated! This is mainly to fix a silly, small bug that got introduced with the latest checkpointing.

    For those interested, the watchdog is still not exiting gracefully every time -- if no data was created, there\'s still a file transfer error. We\'re trying to figure out why, but will likely need help from the Boinc team to fix it. Fortunately, the jobs that give these errors are rare -- and produce no data anyway. Of course, we will continue to grant credit every week for errored jobs when the app gets updated on Rosetta@home.
    ____________

    tralala

    Joined: Apr 12 06
    Posts: 52
    ID: 1266
    Credit: 15,257
    RAC: 0
    Message 1368 - Posted 26 Apr 2006 7:45:06 UTC - in response to Message 1367.

      If I try to fetch work I get a \"Project is down\" message.
      ____________

      tralala

      Joined: Apr 12 06
      Posts: 52
      ID: 1266
      Credit: 15,257
      RAC: 0
      Message 1369 - Posted 26 Apr 2006 8:29:55 UTC

        Now it\'s:

        26/04/2006 10:48:22|ralph@home|Message from server: Server has software problem
        26/04/2006 10:48:22|ralph@home|Project is down

        ____________

        Profile Carlos_Pfitzner
        Avatar

        Joined: Feb 16 06
        Posts: 182
        ID: 296
        Credit: 22,792
        RAC: 0
        Message 1375 - Posted 26 Apr 2006 11:38:41 UTC

          Alpha testers: Abort any 5.04 WU may be sitting up on u cache/queue
          So, u can start testing 5.05 asap -:)

          rbpeake

          Joined: Feb 16 06
          Posts: 19
          ID: 218
          Credit: 3,370
          RAC: 0
          Message 1376 - Posted 26 Apr 2006 11:41:15 UTC - in response to Message 1367.

            Really, we think this is the last one before Rosetta@home is updated!

            Hey, take as much time as you need! You are so close now, might as well wrap it up in style with a bulletproof application! :)
            ____________

            tralala

            Joined: Apr 12 06
            Posts: 52
            ID: 1266
            Credit: 15,257
            RAC: 0
            Message 1377 - Posted 26 Apr 2006 13:04:09 UTC

              Last modified: 26 Apr 2006 13:05:03 UTC

              Both WU I tried finished valid but both results show a warning:

              WARNING! error deleting file .\\aah002.out

              However no such file is present on my computer any longer.

              http://ralph.bakerlab.org/results.php?userid=1266
              ____________

              Profile feet1st

              Joined: Mar 7 06
              Posts: 312
              ID: 1028
              Credit: 110,522
              RAC: 0
              Message 1379 - Posted 26 Apr 2006 14:11:32 UTC

                WUs failed in under 1 minute... and I\'ll tell you why...

                I was playing around suspending R@H WUs and trying to prevent downloads of more and getting Ralph to get some new WUs, and kill those of the previous version, etc. suffice it ta say I had about 8 WUs suspended and left in memory. This caused Windows to entend it\'s paging file, and two other WUs failed immediately, the failures attempted to bring in debug code, which furthered the requirements for memory.

                97682
                97672
                97670

                Here\'s the msg you see in Windows:

                ____________

                tralala

                Joined: Apr 12 06
                Posts: 52
                ID: 1266
                Credit: 15,257
                RAC: 0
                Message 1382 - Posted 26 Apr 2006 17:07:23 UTC - in response to Message 1379.

                  Last modified: 26 Apr 2006 17:48:19 UTC


                  The watchdog aborted this although overall runtime was ony a couple of minutes. After a few minutes runtime it was for a few hours preempted by another WU and after resuming the watchdog probably assumed it run for over an hour with no progress. It seems the Watchdog is only comparing two points in time without checking what happened inbetween.

                  04/26/06 18:59:19||Rescheduling CPU: application exited
                  04/26/06 18:59:19|ralph@home|Computation for task AB_CASP6_u272__444_4_0 finished


                  335.453125
                  stderr out

                  <core_client_version>5.4.6</core_client_version>
                  <stderr_txt>
                  # cpu_run_time_pref: 10800
                  # random seed: 3882530
                  **********************************************************************
                  Rosetta score is stuck or going too long. Watchdog is killing the run!
                  Stuck at score 33.7964 for 3600 seconds
                  **********************************************************************
                  GZIP SILENT FILE: .\\xxu272.out
                  WARNING! error deleting file .\\xxu272.out

                  </stderr_txt>

                  ____________

                  Profile Carlos_Pfitzner
                  Avatar

                  Joined: Feb 16 06
                  Posts: 182
                  ID: 296
                  Credit: 22,792
                  RAC: 0
                  Message 1385 - Posted 26 Apr 2006 19:28:16 UTC

                    Linux 256 MB RAM, plenty of swap space, WU Freeze at 100% Done !

                    Wed Apr 26 16:41:36 BRT 2006
                    crobertp [/home/boinc/BOINC] > cat stdoutdae.txt | grep CASP6
                    2006-04-26 12:31:54 [ralph@home] Starting result FA_CASP6_v272__435_19_0 using rosetta_beta version 505
                    2006-04-26 12:34:07 [ralph@home] Pausing result FA_CASP6_v272__435_19_0 (left in memory)
                    2006-04-26 12:48:20 [ralph@home] Resuming result FA_CASP6_v272__435_19_0 using rosetta_beta version 505
                    2006-04-26 13:55:38 [ralph@home] Pausing result FA_CASP6_v272__435_19_0 (left in memory)
                    2006-04-26 15:46:56 [ralph@home] Resuming result FA_CASP6_v272__435_19_0 using rosetta_beta version 505
                    Wed Apr 26 16:44:41 BRT 2006

                    CPU usage 0.0000%

                    What should I do next ?

                    Perhaps kill app with some special signal to force a core dump
                    and then e-mail that core dump to ???
                    Thanks
                    ____________
                    Click signature for global team stats

                    Jose
                    Avatar

                    Joined: Apr 25 06
                    Posts: 7
                    ID: 1317
                    Credit: 77
                    RAC: 0
                    Message 1386 - Posted 26 Apr 2006 20:35:09 UTC

                      This Unit was aborted after less than one hour of runing ( My time preference is 2 hours)

                      http://ralph.bakerlab.org/result.php?resultid=97305

                      AB_CASP6_t216__438_3_0

                      Workunit 86138

                      CPU time 3180.21875
                      stderr out <core_client_version>5.2.13</core_client_version>
                      <stderr_txt>
                      # random seed: 3882811
                      # cpu_run_time_pref: 7200
                      **********************************************************************
                      Rosetta score is stuck or going too long. Watchdog is killing the run!
                      Stuck at score 71.0875 for 3600 seconds
                      **********************************************************************
                      GZIP SILENT FILE: .\\xxt216.out
                      WARNING! attempt to gzip file .\\xxt216.out failed: file does not exist.

                      </stderr_txt>
                      <message><file_xfer_error>
                      <file_name>AB_CASP6_t216__438_3_0_0</file_name>
                      <error_code>-161</error_code>
                      <error_message></error_message>
                      </file_xfer_error>

                      </message>
                      Validate state Invalid
                      Claimed credit 11.1502124855248
                      Granted credit 0
                      application version 5.05

                      ____________

                      Profile feet1st

                      Joined: Mar 7 06
                      Posts: 312
                      ID: 1028
                      Credit: 110,522
                      RAC: 0
                      Message 1391 - Posted 27 Apr 2006 2:34:12 UTC

                        Just got home to see how my PC did, I think the watchdog ralphed all over my work. This is on my same host as I posted about earlier where I was short on VM swap space and lost 3 WUs. Problems all over, watchdog kicking in (I\'ve never had any hung WUs before, so seems unlikely it was required), failing after <1 min. I did not abort any v5.05 WUs. Very few successes.

                        Now my Ralph WUs are completed, and I\'m crunching R@H again. Got a WU with FAST in the name, the thing has ripped 607 models in under 14hrs (I have 24hr preference, the output file is 4.6MB now... might come close to 10 once we\'re done!). I didn\'t reboot or restart BOINC since the memory issues about 13hrs ago.
                        ____________

                        Rhiju
                        Forum moderator
                        Project developer
                        Project scientist

                        Joined: Feb 14 06
                        Posts: 161
                        ID: 4
                        Credit: 3,725
                        RAC: 0
                        Message 1392 - Posted 27 Apr 2006 3:47:55 UTC - in response to Message 1385.

                          This almost looks like a BOINC manager problem. Go ahead and abort it; then can you post a link to the result here? Thanks. No need to send us the core dump.

                          Linux 256 MB RAM, plenty of swap space, WU Freeze at 100% Done !

                          Wed Apr 26 16:41:36 BRT 2006
                          crobertp [/home/boinc/BOINC] > cat stdoutdae.txt | grep CASP6
                          2006-04-26 12:31:54 [ralph@home] Starting result FA_CASP6_v272__435_19_0 using rosetta_beta version 505
                          2006-04-26 12:34:07 [ralph@home] Pausing result FA_CASP6_v272__435_19_0 (left in memory)
                          2006-04-26 12:48:20 [ralph@home] Resuming result FA_CASP6_v272__435_19_0 using rosetta_beta version 505
                          2006-04-26 13:55:38 [ralph@home] Pausing result FA_CASP6_v272__435_19_0 (left in memory)
                          2006-04-26 15:46:56 [ralph@home] Resuming result FA_CASP6_v272__435_19_0 using rosetta_beta version 505
                          Wed Apr 26 16:44:41 BRT 2006

                          CPU usage 0.0000%

                          What should I do next ?

                          Perhaps kill app with some special signal to force a core dump
                          and then e-mail that core dump to ???
                          Thanks


                          ____________

                          Rhiju
                          Forum moderator
                          Project developer
                          Project scientist

                          Joined: Feb 14 06
                          Posts: 161
                          ID: 4
                          Credit: 3,725
                          RAC: 0
                          Message 1393 - Posted 27 Apr 2006 3:50:11 UTC - in response to Message 1391.

                            Last modified: 27 Apr 2006 3:53:12 UTC

                            Feet1st, whoa you\'ve got a really fast client....

                            As for ralph, I share your concerns. I\'m trying another fix on 5.06, can you attach to ralph now? If the watchdog is still too aggressive, we\'ll have to keep it off for rosetta@home until we avoid this error. Two questions for you:

                            How often do you switch between apps (if at all)?
                            When ralph is pre-empted, do you \"Keep in Memory\"?

                            My prediction is that the clients that have been having trouble with the watchdog switch occasionally between apps, and keep in memory. That was the case for tralala below, and I\'ve put in the fix for that case. Let me know.


                            Just got home to see how my PC did, I think the watchdog ralphed all over my work. This is on my same host as I posted about earlier where I was short on VM swap space and lost 3 WUs. Problems all over, watchdog kicking in (I\'ve never had any hung WUs before, so seems unlikely it was required), failing after <1 min. I did not abort any v5.05 WUs. Very few successes.

                            Now my Ralph WUs are completed, and I\'m crunching R@H again. Got a WU with FAST in the name, the thing has ripped 607 models in under 14hrs (I have 24hr preference, the output file is 4.6MB now... might come close to 10 once we\'re done!). I didn\'t reboot or restart BOINC since the memory issues about 13hrs ago.


                            ____________

                            simpe73

                            Joined: Feb 20 06
                            Posts: 2
                            ID: 619
                            Credit: 36,752
                            RAC: 0
                            Message 1394 - Posted 27 Apr 2006 4:05:05 UTC

                              5.05 works fine, but.... In every result I\'ve checked there is \"WARNING! error deleting file .\\xxv272.out\". Name of file varies, but they are allways .out -files.
                              ____________

                              [B^S] sTrey
                              Avatar

                              Joined: Feb 15 06
                              Posts: 58
                              ID: 36
                              Credit: 15,430
                              RAC: 0
                              Message 1397 - Posted 27 Apr 2006 6:41:18 UTC

                                Last modified: 27 Apr 2006 6:52:38 UTC

                                wu 79004 killed by watchdog just after 2 hrs\' runtime (Stuck at score -115.914 for 3600 seconds)
                                (Sorry this was 5.05, can\'t get any 5.06 wus)

                                tralala

                                Joined: Apr 12 06
                                Posts: 52
                                ID: 1266
                                Credit: 15,257
                                RAC: 0
                                Message 1398 - Posted 27 Apr 2006 9:32:05 UTC

                                  I just finished one WU with several deliberate switchings inbetween (long and short) and it seems 5.06 has solved the issue. The warning message about the file deletion error is also gone. :-)

                                  http://ralph.bakerlab.org/result.php?resultid=98261

                                  Now Rhiju please have a look at this and this.

                                  ____________

                                  Jose
                                  Avatar

                                  Joined: Apr 25 06
                                  Posts: 7
                                  ID: 1317
                                  Credit: 77
                                  RAC: 0
                                  Message 1400 - Posted 27 Apr 2006 12:20:19 UTC

                                    Okies I have been running the following RALPH Work Unit:
                                    ID 6204
                                    Name AB_CASP6_t198__438_5_0

                                    It worked for around 57 minutes and then was preempted ( keeping the record of the CPU Time in my work record) and the corresponding Rosetta Work Unit restarted. Once the Rosetta Work Unit stopped, the application switched to RALPH and Work Unit ID 6204 restarted , it started DE NOVO , that is from CPU time of 0 and not from the accumulated 57+ minutes it had when it preempted and the application switch happened. My preferences are set so that work is kept in memory and this did not happened in this case.

                                    So to make the story short: the 57+ CPU time for the Work Unit that have been stored in memory disappeared into the big void in the sky. :)






                                    ____________

                                    Profile feet1st

                                    Joined: Mar 7 06
                                    Posts: 312
                                    ID: 1028
                                    Credit: 110,522
                                    RAC: 0
                                    Message 1401 - Posted 27 Apr 2006 14:07:52 UTC - in response to Message 1393.

                                      How often do you switch between apps (if at all)?
                                      When ralph is pre-empted, do you \"Keep in Memory\"?

                                      My prediction is that the clients that have been having trouble with the watchdog switch occasionally between apps, and keep in memory. That was the case for tralala below, and I\'ve put in the fix for that case. Let me know.

                                      Sorry, I\'m not positive. I changed my settings to actually try and stress Ralph a little bit, but I am not certain if they took effect on THAT PC or not, depends when it updated. I believe it had a 360 min (6hrs) switch between projects, and leave in memory at the time of failures. My other Ralph host was updated to have 20min switch time, and remove from memory, and it seems to be going well, but it hasn\'t had other projects interrupting it either.

                                      ____________

                                      Rhiju
                                      Forum moderator
                                      Project developer
                                      Project scientist

                                      Joined: Feb 14 06
                                      Posts: 161
                                      ID: 4
                                      Credit: 3,725
                                      RAC: 0
                                      Message 1403 - Posted 27 Apr 2006 17:48:22 UTC - in response to Message 1401.

                                        Thanks for all the advice. I think we\'ve largely killed the watchdog timer problem and are ready to release. (Feet1st, you noticed how bad the problem was with 5.05; has your client tried any 5.06?) We haven\'t seen any evidence for jobs being aborted prematurely by the watchdog, except for the tests where we forced an infinite loop.

                                        A few quick replies:

                                        I\'ll bring the debate about shorter/longer deadlines (or a mix) to the attention of the other project scientists.

                                        I really do like Feet1st\'s idea to ask ralph users to lower the fraction of time their client spends on ralph. That will distribute the jobs to as many different cpus as possible. I can make a note of it on the news page next time we release.


                                        How often do you switch between apps (if at all)?
                                        When ralph is pre-empted, do you \"Keep in Memory\"?

                                        My prediction is that the clients that have been having trouble with the watchdog switch occasionally between apps, and keep in memory. That was the case for tralala below, and I\'ve put in the fix for that case. Let me know.

                                        Sorry, I\'m not positive. I changed my settings to actually try and stress Ralph a little bit, but I am not certain if they took effect on THAT PC or not, depends when it updated. I believe it had a 360 min (6hrs) switch between projects, and leave in memory at the time of failures. My other Ralph host was updated to have 20min switch time, and remove from memory, and it seems to be going well, but it hasn\'t had other projects interrupting it either.


                                        ____________

                                        Rhiju
                                        Forum moderator
                                        Project developer
                                        Project scientist

                                        Joined: Feb 14 06
                                        Posts: 161
                                        ID: 4
                                        Credit: 3,725
                                        RAC: 0
                                        Message 1404 - Posted 27 Apr 2006 17:53:19 UTC - in response to Message 1385.

                                          A quick reply to Carlos... it seems like all your ralph jobs have been erroring out. The error message we\'re seeing is something about a lost heartbeat from the core client. That doesn\'t sound good. Have you had this issue with any workunits from rosetta@home?

                                          Also, do you have a new version of the BOINC app? Thanks.

                                          Linux 256 MB RAM, plenty of swap space, WU Freeze at 100% Done !

                                          Wed Apr 26 16:41:36 BRT 2006
                                          crobertp [/home/boinc/BOINC] > cat stdoutdae.txt | grep CASP6
                                          2006-04-26 12:31:54 [ralph@home] Starting result FA_CASP6_v272__435_19_0 using rosetta_beta version 505
                                          2006-04-26 12:34:07 [ralph@home] Pausing result FA_CASP6_v272__435_19_0 (left in memory)
                                          2006-04-26 12:48:20 [ralph@home] Resuming result FA_CASP6_v272__435_19_0 using rosetta_beta version 505
                                          2006-04-26 13:55:38 [ralph@home] Pausing result FA_CASP6_v272__435_19_0 (left in memory)
                                          2006-04-26 15:46:56 [ralph@home] Resuming result FA_CASP6_v272__435_19_0 using rosetta_beta version 505
                                          Wed Apr 26 16:44:41 BRT 2006

                                          CPU usage 0.0000%

                                          What should I do next ?

                                          Perhaps kill app with some special signal to force a core dump
                                          and then e-mail that core dump to ???
                                          Thanks


                                          ____________

                                          tralala

                                          Joined: Apr 12 06
                                          Posts: 52
                                          ID: 1266
                                          Credit: 15,257
                                          RAC: 0
                                          Message 1405 - Posted 27 Apr 2006 17:59:06 UTC - in response to Message 1403.

                                            Last modified: 27 Apr 2006 18:05:14 UTC

                                            Thanks for all the advice. I think we\'ve largely killed the watchdog timer problem and are ready to release. (Feet1st, you noticed how bad the problem was with 5.05; has your client tried any 5.06?) We haven\'t seen any evidence for jobs being aborted prematurely by the watchdog, except for the tests where we forced an infinite loop.


                                            So you are going to release it on Rosetta today? Good luck! ;-)


                                            A few quick replies:

                                            I\'ll bring the debate about shorter/longer deadlines (or a mix) to the attention of the other project scientists.

                                            I really do like Feet1st\'s idea to ask ralph users to lower the fraction of time their client spends on ralph. That will distribute the jobs to as many different cpus as possible. I can make a note of it on the news page next time we release.


                                            Asking is one thing making sure jobs will be distributed in the most useful manner is another. I really don\'t think one needs to rely on aware testers for that. Just lower the quota and shorten the deadlines and you get what you want. Probably a one-week deadline and a quota of 10 WU\'s is a first step and a compromise.

                                            You can even make the WU/day quota editable by the participants. At least I saw it editable in one project not sure if this is still possible with the latest BOINC version. If you can I\'d recommend to set the quota to 3/day and make it editable for those who want to continue testing for more than three Wu/s per day. That will prevent ignorant users to hijack the WUs which just join the project which their usual 3-day-cache and load 20 WUs at once (and returning them after 10 days or so).

                                            ____________

                                            Rhiju
                                            Forum moderator
                                            Project developer
                                            Project scientist

                                            Joined: Feb 14 06
                                            Posts: 161
                                            ID: 4
                                            Credit: 3,725
                                            RAC: 0
                                            Message 1407 - Posted 27 Apr 2006 19:32:58 UTC - in response to Message 1405.

                                              Tralala, nice advice. I\'m lowering the RALPH deadline from 14 days to 4 days. We still value results that come back after two or three days, but you\'re right that its ridiculous to get back a job as ancient as two weeks old.

                                              At least with the current BOINC system, I can\'t seem to set the max WU sent to a client per day. Can you post here which project allowed you to set that as a preference?



                                              Asking is one thing making sure jobs will be distributed in the most useful manner is another. I really don\'t think one needs to rely on aware testers for that. Just lower the quota and shorten the deadlines and you get what you want. Probably a one-week deadline and a quota of 10 WU\'s is a first step and a compromise.

                                              You can even make the WU/day quota editable by the participants. At least I saw it editable in one project not sure if this is still possible with the latest BOINC version. If you can I\'d recommend to set the quota to 3/day and make it editable for those who want to continue testing for more than three Wu/s per day. That will prevent ignorant users to hijack the WUs which just join the project which their usual 3-day-cache and load 20 WUs at once (and returning them after 10 days or so).


                                              ____________

                                              Profile Carlos_Pfitzner
                                              Avatar

                                              Joined: Feb 16 06
                                              Posts: 182
                                              ID: 296
                                              Credit: 22,792
                                              RAC: 0
                                              Message 1408 - Posted 27 Apr 2006 19:53:22 UTC

                                                Last modified: 27 Apr 2006 20:04:35 UTC

                                                I dont think 5.06 is good for Linux, However for Windows 5.06 is OK

                                                May be u can trap that signal 11 to make it exit with 0 but no finshed file ?
                                                So, boinc will restart that WUs again ... and possible finish OK.

                                                These signal 11 are caused by a timing problem ... Not by an unallocated aray.
                                                No heartbeat from core client for 31 sec - exiting
                                                *too much network traffic ! 127.0.0.1 unserviced!

                                                2006-04-27 11:58:14 [ralph@home] Finished download of 1tul__alltopologycodes.bar
                                                2006-04-27 11:58:14 [ralph@home] Throughput 21465 bytes/sec
                                                2006-04-27 11:58:15 [ralph@home] Starting result FACONTACTS_NOFILTERS_1tul__381_3_1 using rosetta_beta version 506
                                                2006-04-27 12:02:48 [ralph@home] Pausing result FACONTACTS_NOFILTERS_1tul__381_3_1 (left in memory)
                                                2006-04-27 12:02:49 [ralph@home] Unrecoverable error for result FACONTACTS_NOFILTERS_1tul__381_3_1 (process exited with code 131 (0x83))
                                                2006-04-27 12:02:49 [ralph@home] Computation for result FACONTACTS_NOFILTERS_1tul__381_3_1 finished
                                                2006-04-27 12:03:50 [ralph@home] Sending scheduler request to http://ralph.bakerlab.org/ralph_cgi/cgi
                                                2006-04-27 12:03:50 [ralph@home] Reason: To report results
                                                2006-04-27 12:03:50 [ralph@home] Requesting 0.864 seconds of new work, and reporting 1 results
                                                2006-04-27 12:04:00 [ralph@home] Scheduler request to http://ralph.bakerlab.org/ralph_cgi/cgi succeeded
                                                2006-04-27 12:04:02 [ralph@home] Started download of casp6_aat216_03_05.200_v1_3.gz
                                                2006-04-27 12:04:02 [ralph@home] Started download of casp6_aat216_09_05.200_v1_3.gz
                                                2006-04-27 12:07:47 [ralph@home] Finished download of casp6_aat216_03_05.200_v1_3.gz
                                                2006-04-27 12:07:47 [ralph@home] Throughput 17163 bytes/sec
                                                2006-04-27 12:07:47 [ralph@home] Started download of casp6_t216_.fasta.gz
                                                2006-04-27 12:07:48 [ralph@home] Finished download of casp6_t216_.fasta.gz
                                                2006-04-27 12:07:48 [ralph@home] Throughput 548 bytes/sec
                                                2006-04-27 12:07:48 [ralph@home] Started download of casp6_t216.pdb.gz
                                                2006-04-27 12:07:51 [ralph@home] Finished download of casp6_t216.pdb.gz
                                                2006-04-27 12:07:51 [ralph@home] Throughput 21820 bytes/sec
                                                2006-04-27 12:07:51 [ralph@home] Started download of casp6_t216_.psipred_ss2.gz
                                                2006-04-27 12:07:52 [ralph@home] Finished download of casp6_t216_.psipred_ss2.gz
                                                2006-04-27 12:07:52 [ralph@home] Throughput 8188 bytes/sec
                                                2006-04-27 12:11:04 [ralph@home] Finished download of casp6_aat216_09_05.200_v1_3.gz
                                                2006-04-27 12:11:04 [ralph@home] Throughput 26233 bytes/sec
                                                2006-04-27 12:11:06 [ralph@home] Starting result FA_CASP6_t216__451_30_0 using rosetta_beta version 506
                                                2006-04-27 12:12:41 [ralph@home] Pausing result FA_CASP6_t216__451_30_0 (left in memory)
                                                2006-04-27 12:12:42 [ralph@home] Unrecoverable error for result FA_CASP6_t216__451_30_0 (process exited with code 131 (0x83))
                                                2006-04-27 12:12:42 [ralph@home] Computation for result FA_CASP6_t216__451_30_0 finished
                                                2006-04-27 12:13:42 [ralph@home] Sending scheduler request to http://ralph.bakerlab.org/ralph_cgi/cgi
                                                2006-04-27 12:13:42 [ralph@home] Reason: To report results
                                                2006-04-27 12:13:42 [ralph@home] Requesting 0.864 seconds of new work, and reporting 1 results
                                                2006-04-27 12:13:48 [ralph@home] Scheduler request to http://ralph.bakerlab.org/ralph_cgi/cgi succeeded
                                                2006-04-27 12:13:48 [ralph@home] Message from server: No work sent
                                                2006-04-27 12:13:48 [ralph@home] Message from server: (reached daily quota of 6 results)
                                                http://ralph.bakerlab.org/result.php?resultid=98808
                                                http://ralph.bakerlab.org/result.php?resultid=98790
                                                http://ralph.bakerlab.org/result.php?resultid=98787
                                                http://ralph.bakerlab.org/result.php?resultid=98747
                                                http://ralph.bakerlab.org/result.php?resultid=98658
                                                http://ralph.bakerlab.org/result.php?resultid=98658
                                                http://ralph.bakerlab.org/result.php?resultid=98613


                                                and there is still the problem of WU freezing at 100% done
                                                and other % too ... witout using CPU that I asked here what to do, to help fixing the problem

                                                but get no answer ... so I aborted these WUs


                                                ____________
                                                Click signature for global team stats

                                                tralala

                                                Joined: Apr 12 06
                                                Posts: 52
                                                ID: 1266
                                                Credit: 15,257
                                                RAC: 0
                                                Message 1409 - Posted 27 Apr 2006 19:58:56 UTC - in response to Message 1407.

                                                  Last modified: 27 Apr 2006 20:03:10 UTC


                                                  At least with the current BOINC system, I can\'t seem to set the max WU sent to a client per day. Can you post here which project allowed you to set that as a preference?


                                                  I think I saw that over at CPDN but it\'s no longer setable there as well. Perhaps I remembered it wrong perhaps it has been disabled in more recent BOINC releases.

                                                  I\'d still think about 10 WU/day is sufficient and this will further prevent people from building up big caches.
                                                  ____________

                                                  Profile Carlos_Pfitzner
                                                  Avatar

                                                  Joined: Feb 16 06
                                                  Posts: 182
                                                  ID: 296
                                                  Credit: 22,792
                                                  RAC: 0
                                                  Message 1410 - Posted 27 Apr 2006 20:16:47 UTC

                                                    Last modified: 27 Apr 2006 20:22:37 UTC

                                                    I\'d still think about 10 WU/day is sufficient and this will further prevent people from building up big caches.


                                                    I use to abort all WUs of previous version, When I notice a new version,
                                                    I know not everyone do this ...

                                                    However what is wrong if the boinc concept of limiting WUs by day

                                                    What should be limited is \"cache\" of unreturned WUs ... may be on 2

                                                    Once a client exceed the quota of 2 it does not get more WUs,
                                                    however if it return 1 it can download more 1 , even after quota exceeded.
                                                    *Ops forget that a project reset does not return any WUs ...
                                                    So, who do a project reset w/o aborting WUs first will have to wait next day

                                                    ____________
                                                    Click signature for global team stats

                                                    [B^S] sTrey
                                                    Avatar

                                                    Joined: Feb 15 06
                                                    Posts: 58
                                                    ID: 36
                                                    Credit: 15,430
                                                    RAC: 0
                                                    Message 1411 - Posted 27 Apr 2006 20:29:43 UTC

                                                      4 days? Ouch, I\'d hoped for 6 or 7, and definitely with smaller quotas.

                                                      seti beta has a painfully small return rate due to huge quotas. Shorter deadlines aren\'t as direct, well I guess they are here but if you\'re running a quorum of more than 1, short deadlines drag things out having to resend after earlier results time out...

                                                      Meanwhile I\'ve preferred to test with 16-hr runtimes, and I do run other projects. With my current mix I can probably just make 4 days. Of course when you want really fast returns those hit-and-quit wu\'s you\'ve been sending, do the job.

                                                      So I\'m wondering is there little value for testing longer time settings here? Easy enough to drop back to 2 or 4 hour runtimes.

                                                      p.s.
                                                      If this discussion continues, maybe it\'s better moved out of the bug-report thread?

                                                      rbpeake

                                                      Joined: Feb 16 06
                                                      Posts: 19
                                                      ID: 218
                                                      Credit: 3,370
                                                      RAC: 0
                                                      Message 1412 - Posted 27 Apr 2006 20:36:14 UTC - in response to Message 1411.

                                                        Last modified: 27 Apr 2006 20:37:05 UTC

                                                        Meanwhile I\'ve preferred to test with 16-hr runtimes, and I do run other projects....

                                                        So I\'m wondering is there little value for testing longer time settings here? Easy enough to drop back to 2 or 4 hour runtimes.


                                                        I wonder this, too. Maybe for each run, Rhiju, you could advise us testers what settings you would like us to use to achieve your goals for that particular run. In other words, what runtime setting would you like, would you also like us to run other projects at the same time, or just run Ralph by itself to get some results back really quickly, etc., etc.

                                                        In this way we can more directly assist you in achieving your testing objectives.

                                                        Thanks!

                                                        ____________

                                                        Profile Carlos_Pfitzner
                                                        Avatar

                                                        Joined: Feb 16 06
                                                        Posts: 182
                                                        ID: 296
                                                        Credit: 22,792
                                                        RAC: 0
                                                        Message 1413 - Posted 27 Apr 2006 22:22:31 UTC

                                                          Maximum disk usage excedeed Linux
                                                          http://ralph.bakerlab.org/result.php?resultid=98187

                                                          May be is difficult wipping out from disk the files of previous version
                                                          before sending out a new version to test ?

                                                          Thanks
                                                          ____________
                                                          Click signature for global team stats

                                                          Yeti
                                                          Avatar

                                                          Joined: Feb 19 06
                                                          Posts: 30
                                                          ID: 581
                                                          Credit: 49,557
                                                          RAC: 0
                                                          Message 1414 - Posted 28 Apr 2006 1:00:46 UTC

                                                            Last modified: 28 Apr 2006 1:05:51 UTC

                                                            Back to possible bugs:

                                                            Rosetta 5.06

                                                            using 161 MB of memory, 542 MB of virtuel memory

                                                            The box is a very old one, the WU has run 11 hours now, sitting with 1,04 %

                                                            I guess, it will never finish :-(

                                                            Oh, my setting for RALPH Target CPU time is 4 hours ...

                                                            This is the box: http://ralph.bakerlab.org/show_host_detail.php?hostid=1911

                                                            This is the result: http://ralph.bakerlab.org/result.php?resultid=98748

                                                            Abort or stay a little bit longer ?
                                                            ____________


                                                            Supporting BOINC, a great concept !

                                                            Robert Everly

                                                            Joined: Feb 16 06
                                                            Posts: 10
                                                            ID: 276
                                                            Credit: 2,333
                                                            RAC: 0
                                                            Message 1415 - Posted 28 Apr 2006 1:28:05 UTC

                                                              Just my 2 cents worth here. I\'ve said this on other project boards as well.

                                                              There should be two cache settings in each project.

                                                              1) Max Wu/cpu/day (current cache)

                                                              2) Max outstanding WU/CPU.

                                                              I\'d love to see #2 added. Personally I find it silly that some people and systems download hundreds of WUs to only return a portion of them. Just look at host 3755 on seti beta. Yes, the daily quota is down to 1 per day, but there were over 1000 outstanding WUs on the machine.

                                                              My thought for #2 would be this. Project defines how many outstanding WUs/cpu is acceptable. You can download up to this amount over any number of days with #1. Once you hit the limit in #2, the server refuses to send you more work until you return work.

                                                              Why keep sending work to hosts that are not returning work.
                                                              ____________

                                                              casio7131

                                                              Joined: Mar 20 06
                                                              Posts: 15
                                                              ID: 1151
                                                              Credit: 12,660
                                                              RAC: 0
                                                              Message 1416 - Posted 28 Apr 2006 2:47:34 UTC

                                                                28/04/2006 10:47:13 AM|ralph@home|Resuming task FA_CASP6_t198__435_26_1 using rosetta_beta version 505
                                                                http://ralph.bakerlab.org/result.php?resultid=97816

                                                                last night i quit boinc and this workunit was at about 1.0427% after 2h45m15s (model=1, step=340905, full atom relax) when i quit boinc. i\'ve now restarted boinc today, and it\'s now at 1.0424% after 2h10m10s (model=1, step=340558, full atom relax) and still runnning. so it has started redoing the same work again as it had done already last night.

                                                                it seems that the new checkpointing didn\'t work (since it was redone today). or, did it just not reach a \"checkpointable stage\" last night (since this seems like a rather large structure)?
                                                                ____________

                                                                tralala

                                                                Joined: Apr 12 06
                                                                Posts: 52
                                                                ID: 1266
                                                                Credit: 15,257
                                                                RAC: 0
                                                                Message 1417 - Posted 28 Apr 2006 7:50:57 UTC

                                                                  I think there is little value to crunch Ralph WU for more than 8 hours. I would suggest to deactivate this feature in Ralph and to send out WUs with fixed runtimes and to send out a mix most appropriate for the tested app/wu. But maybe Rhiju can give his opinion on this. Nevertheless if one can only crunch one WU in 4 days due to the ressource share of Ralph and runtime preference that is okay. I think the goal of Ralph is nto throughput but diversity. It is better to have 10 hosts trying 1 WU than 1 host trying 10. But perhaps Rhiju can give his opinion on that and post some advice in the news section (at least not to download 20 WUs at once).

                                                                  \"Max outstanding WU/CPU\"

                                                                  This would be a cool feature but that is something BOINC has to implement. It would certainly enable much better distribution of WU without restricting hosts on the maximum wu per day.
                                                                  ____________

                                                                  tralala

                                                                  Joined: Apr 12 06
                                                                  Posts: 52
                                                                  ID: 1266
                                                                  Credit: 15,257
                                                                  RAC: 0
                                                                  Message 1418 - Posted 28 Apr 2006 7:54:31 UTC - in response to Message 1414.

                                                                    Last modified: 28 Apr 2006 8:01:27 UTC

                                                                    Back to possible bugs:

                                                                    Rosetta 5.06

                                                                    using 161 MB of memory, 542 MB of virtuel memory

                                                                    The box is a very old one, the WU has run 11 hours now, sitting with 1,04 %

                                                                    I guess, it will never finish :-(

                                                                    Oh, my setting for RALPH Target CPU time is 4 hours ...

                                                                    This is the box: http://ralph.bakerlab.org/show_host_detail.php?hostid=1911

                                                                    This is the result: http://ralph.bakerlab.org/result.php?resultid=98748

                                                                    Abort or stay a little bit longer ?


                                                                    This t216 protein is really big. It used up to 250 MB on my box and needed over an hour for the first model to finish (on AMD 64 @ 2400 MHz). So I suggest not to abort but to see whether it will finish on your old machine.

                                                                    ____________

                                                                    Profile Carlos_Pfitzner
                                                                    Avatar

                                                                    Joined: Feb 16 06
                                                                    Posts: 182
                                                                    ID: 296
                                                                    Credit: 22,792
                                                                    RAC: 0
                                                                    Message 1419 - Posted 28 Apr 2006 11:30:32 UTC

                                                                      Last modified: 28 Apr 2006 11:35:08 UTC

                                                                      Rosetta beta 5.06 Linux Success
                                                                      http://ralph.bakerlab.org/result.php?resultid=98212

                                                                      I had success completing above job on a Linux PC with 256 MB ram.

                                                                      *All other jobs on above PC get some sort of error !

                                                                      What I did ...

                                                                      1) suspended all other projects running on this pc, left only ralph
                                                                      2) opened some disk space by deleting some old stuff
                                                                      3) shutdown one of my 10 mbps Internet links, and the load balancing stuff
                                                                      4) cruched after midnight, while majority of my users are asleeping

                                                                      So, the 5.06 must be OK for Linux too
                                                                      However is weak ... any disturbance ... as big network traffic,
                                                                      or running multiple projects (even keeping in RAM) causes job ops WU to fail.

                                                                      suggestion:
                                                                      *Signal 11 needs be trapped to exit with 0 instead of with 183
                                                                      So, the job will exit with 0 , but no finished file
                                                                      and next, boinc re-starts it. until it finish ...
                                                                      ____________
                                                                      Click signature for global team stats

                                                                      Yeti
                                                                      Avatar

                                                                      Joined: Feb 19 06
                                                                      Posts: 30
                                                                      ID: 581
                                                                      Credit: 49,557
                                                                      RAC: 0
                                                                      Message 1420 - Posted 28 Apr 2006 12:01:04 UTC - in response to Message 1418.

                                                                        Back to possible bugs:

                                                                        Rosetta 5.06

                                                                        using 161 MB of memory, 542 MB of virtuel memory

                                                                        The box is a very old one, the WU has run 11 hours now, sitting with 1,04 %

                                                                        I guess, it will never finish :-(

                                                                        Oh, my setting for RALPH Target CPU time is 4 hours ...

                                                                        This is the box: http://ralph.bakerlab.org/show_host_detail.php?hostid=1911

                                                                        This is the result: http://ralph.bakerlab.org/result.php?resultid=98748

                                                                        Abort or stay a little bit longer ?


                                                                        This t216 protein is really big. It used up to 250 MB on my box and needed over an hour for the first model to finish (on AMD 64 @ 2400 MHz). So I suggest not to abort but to see whether it will finish on your old machine.


                                                                        okay, it seems, as if it finished without error :-)
                                                                        ____________


                                                                        Supporting BOINC, a great concept !

                                                                        [B^S] suguruhirahara

                                                                        Joined: Mar 5 06
                                                                        Posts: 40
                                                                        ID: 992
                                                                        Credit: 6,001
                                                                        RAC: 0
                                                                        Message 1421 - Posted 28 Apr 2006 13:22:18 UTC

                                                                          Last modified: 28 Apr 2006 13:22:51 UTC

                                                                          Workunits are done well also on WindowsXP x64 Edition, Pentium D 2.8Ghz and 1GB RAM, using 129MB and 75MB of it.

                                                                          At this version, my computer doesn\'t experience the error, crashing workunits when graphics are shown on screen. very great.

                                                                          But completion time is not expected well. For example, before a workunit start, to completion was \"01:51:20\". But it is \"01:57:00\" even 35% of the work was already done. I\'ve not noticed such a great difference at former version.
                                                                          ____________

                                                                          Mike Gelvin
                                                                          Avatar

                                                                          Joined: Feb 17 06
                                                                          Posts: 50
                                                                          ID: 468
                                                                          Credit: 55,397
                                                                          RAC: 0
                                                                          Message 1422 - Posted 28 Apr 2006 14:49:47 UTC

                                                                            Last modified: 28 Apr 2006 14:51:31 UTC

                                                                            4/28/2006 12:53:48 AM||Rescheduling CPU: files downloaded
                                                                            4/28/2006 3:15:49 AM||Rescheduling CPU: application exited
                                                                            4/28/2006 3:15:49 AM|ralph@home|Computation for task WATCHDOG_KILL_VERY_LONG_JOBS_424_9_2 finished
                                                                            4/28/2006 3:15:50 AM|ralph@home|Unrecoverable error for result WATCHDOG_KILL_VERY_LONG_JOBS_424_9_2 (<file_xfer_error> <file_name>WATCHDOG_KILL_VERY_LONG_JOBS_424_9_2_0</file_name> <error_code>-161</error_code></file_xfer_error>)


                                                                            result: http://ralph.bakerlab.org/result.php?resultid=97709

                                                                            Win 2000 SP4 Intel Pentium 4 @ 2.4GHz w/ 512Meg RAM


                                                                            There was is an additional message in the result about a non-existant file:
                                                                            GZIP SILENT FILE: .\\xx1enh.out
                                                                            WARNING! attempt to gzip file .\\xx1enh.out failed: file does not exist.

                                                                            ____________

                                                                            Profile feet1st

                                                                            Joined: Mar 7 06
                                                                            Posts: 312
                                                                            ID: 1028
                                                                            Credit: 110,522
                                                                            RAC: 0
                                                                            Message 1424 - Posted 28 Apr 2006 17:19:02 UTC - in response to Message 1412.

                                                                              Maybe for each run, Rhiju, you could advise us testers what settings you would like us to use to achieve your goals for that particular run. In other words, what runtime setting would you like, would you also like us to run other projects at the same time, or just run Ralph by itself to get some results back really quickly, etc., etc.

                                                                              In this way we can more directly assist you in achieving your testing objectives.


                                                                              If they are adding checkpointing and want more frequent switch between jobs, then that makes sense... once we\'re over these hurdles, I think the best test would be for everyone to have their Ralph preferences match their R@H preference... and the randomness of how we all have these set is the best beta test, the most similar to the user base of Rosetta.

                                                                              I guess what I\'m saying is, if necessary, instruct us on preference changes you\'d like to see... but then let\'s test same version another couple (several) days back on or normal settings.
                                                                              ____________

                                                                              Profile feet1st

                                                                              Joined: Mar 7 06
                                                                              Posts: 312
                                                                              ID: 1028
                                                                              Credit: 110,522
                                                                              RAC: 0
                                                                              Message 1425 - Posted 28 Apr 2006 17:23:25 UTC - in response to Message 1403.

                                                                                Feet1st, you noticed how bad the problem was with 5.05; has your client tried any 5.06?

                                                                                I\'m not positive, but I believe (irony is cruel sometimes) all the 5.06 WUs were gone by the time I got home to that PC to notice and abort 5.05 WUs.

                                                                                This is ironic for two reasons. One, I\'ve been discussing the merits of getting WUs to more hosts by limiting WUs per day or resource share, or other means of assuring some WUs remain available for at least 24hrs. Two, I asked why no application version shows on an unreturned WU on the website, and was told it\'s because it\'s flexible, so from work, I can\'t see if the WUs on my PC at home are for 5.05 or 5.06 :) Even though we all know that the Work tab of that PC has a specific version associated with the WU.

                                                                                ____________

                                                                                Profile [AF>France>Est>Lorraine]Le Zam
                                                                                Avatar

                                                                                Joined: Mar 2 06
                                                                                Posts: 9
                                                                                ID: 929
                                                                                Credit: 3,278
                                                                                RAC: 0
                                                                                Message 1430 - Posted 29 Apr 2006 10:49:09 UTC

                                                                                  Hello, i have some problem with this Wu : 5.06
                                                                                  FA_CASP6_t216__444_2_2
                                                                                  50\' for 1.02%
                                                                                  So i aborted it
                                                                                  Bye and go on...

                                                                                  ____________

                                                                                  Profile anders n

                                                                                  Joined: Feb 16 06
                                                                                  Posts: 166
                                                                                  ID: 91
                                                                                  Credit: 131,419
                                                                                  RAC: 0
                                                                                  Message 1431 - Posted 29 Apr 2006 10:54:13 UTC - in response to Message 1430.

                                                                                    Hello, i have some problem with this Wu : 5.06
                                                                                    FA_CASP6_t216__444_2_2
                                                                                    50\' for 1.02%
                                                                                    So i aborted it
                                                                                    Bye and go on...



                                                                                    They are so big that it takes more than 1 H on a fast computer to complete 1 decoy.

                                                                                    Anders n
                                                                                    ____________

                                                                                    Dotsch
                                                                                    Avatar

                                                                                    Joined: Mar 4 06
                                                                                    Posts: 11
                                                                                    ID: 975
                                                                                    Credit: 9,825
                                                                                    RAC: 0
                                                                                    Message 1434 - Posted 29 Apr 2006 21:10:15 UTC

                                                                                      I have some problems with 5.06 on Windows 98 :


                                                                                      <core_client_version>5.2.13</core_client_version>
                                                                                      <message> - exit code -164 (0xffffff5c)
                                                                                      </message>
                                                                                      <stderr_txt>
                                                                                      LoadLibraryA( dbghelp95.dll ): GetLastError = 1157
                                                                                      LoadLibraryA( dbghelp.dll ): GetLastError = 1157

                                                                                      </stderr_txt>


                                                                                      Result ID : http://ralph.bakerlab.org/result.php?resultid=100666
                                                                                      ____________

                                                                                      Rhiju
                                                                                      Forum moderator
                                                                                      Project developer
                                                                                      Project scientist

                                                                                      Joined: Feb 14 06
                                                                                      Posts: 161
                                                                                      ID: 4
                                                                                      Credit: 3,725
                                                                                      RAC: 0
                                                                                      Message 1436 - Posted 30 Apr 2006 2:30:46 UTC - in response to Message 1425.

                                                                                        Hi Feet1st, these are great suggestions, as usual! We\'ve come to expect them.
                                                                                        I\'m about to post 5.08, and I\'ll ask that ralph users use similar preferences to their r@h preferences, as you suggest. I think the checkpointing and watchdog issues have largely been resolved, thankfully, and we\'ve moved on to testing real science.

                                                                                        As for keeping work on ralph, we haven\'t quite got that figured out. We\'d like to have jobs go out instantly to clients when we post the new app or test a new scientific mode on ralph, so that we get feedback ASAP. The problem is that if we\'ve flooded the clients with jobs with the previous app or previous jobs, there\'s typically a wait for those clients to free up again. In the future, if we can get trickle-messages implemented, we could send out a purge request. Still, I hear you ... I\'ll keep sending out work and ask others to do the same.

                                                                                        Feet1st, you noticed how bad the problem was with 5.05; has your client tried any 5.06?

                                                                                        I\'m not positive, but I believe (irony is cruel sometimes) all the 5.06 WUs were gone by the time I got home to that PC to notice and abort 5.05 WUs.

                                                                                        This is ironic for two reasons. One, I\'ve been discussing the merits of getting WUs to more hosts by limiting WUs per day or resource share, or other means of assuring some WUs remain available for at least 24hrs. Two, I asked why no application version shows on an unreturned WU on the website, and was told it\'s because it\'s flexible, so from work, I can\'t see if the WUs on my PC at home are for 5.05 or 5.06 :) Even though we all know that the Work tab of that PC has a specific version associated with the WU.


                                                                                        ____________

                                                                                        Rhiju
                                                                                        Forum moderator
                                                                                        Project developer
                                                                                        Project scientist

                                                                                        Joined: Feb 14 06
                                                                                        Posts: 161
                                                                                        ID: 4
                                                                                        Credit: 3,725
                                                                                        RAC: 0
                                                                                        Message 1437 - Posted 30 Apr 2006 2:33:32 UTC - in response to Message 1422.

                                                                                          Hi Mike: this is a silly thing that we haven\'t quite been able to fix, but should happen rarely on rosetta@home. That ralph workunit was a test that our watchdog timer properly aborts really long running jobs. So we\'re very glad to see it worked on your computer! If you ever run into similar super-long workunits on Rosetta@home (hopefully not!), you\'ll eventually get credit granted to it, because that\'s our policy. Thanks for posting!

                                                                                          4/28/2006 12:53:48 AM||Rescheduling CPU: files downloaded
                                                                                          4/28/2006 3:15:49 AM||Rescheduling CPU: application exited
                                                                                          4/28/2006 3:15:49 AM|ralph@home|Computation for task WATCHDOG_KILL_VERY_LONG_JOBS_424_9_2 finished
                                                                                          4/28/2006 3:15:50 AM|ralph@home|Unrecoverable error for result WATCHDOG_KILL_VERY_LONG_JOBS_424_9_2 (<file_xfer_error> <file_name>WATCHDOG_KILL_VERY_LONG_JOBS_424_9_2_0</file_name> <error_code>-161</error_code></file_xfer_error>)


                                                                                          result: http://ralph.bakerlab.org/result.php?resultid=97709

                                                                                          Win 2000 SP4 Intel Pentium 4 @ 2.4GHz w/ 512Meg RAM


                                                                                          There was is an additional message in the result about a non-existant file:
                                                                                          GZIP SILENT FILE: .\\xx1enh.out
                                                                                          WARNING! attempt to gzip file .\\xx1enh.out failed: file does not exist.


                                                                                          ____________

                                                                                          tralala

                                                                                          Joined: Apr 12 06
                                                                                          Posts: 52
                                                                                          ID: 1266
                                                                                          Credit: 15,257
                                                                                          RAC: 0
                                                                                          Message 1439 - Posted 30 Apr 2006 7:19:42 UTC - in response to Message 1436.

                                                                                            Last modified: 30 Apr 2006 7:26:39 UTC

                                                                                            As for keeping work on ralph, we haven\'t quite got that figured out. We\'d like to have jobs go out instantly to clients when we post the new app or test a new scientific mode on ralph, so that we get feedback ASAP. The problem is that if we\'ve flooded the clients with jobs with the previous app or previous jobs, there\'s typically a wait for those clients to free up again.


                                                                                            That\'s easy to solve: limit the daily quota to five or less. That means clients grab new jobs instantly but can\'t pile up big caches.
                                                                                            At the moment it works as follows the first 20 clients pile up 20 WUs each and no more work is available. These hosts are busy with them several days so you get your work returned late. With 5WU/day the first 80 clients grab 5 WU each and are busy with them only for a day or less. I\'d even say 3WU/day is a good quota.

                                                                                            Short deadlines have a similar effect but it seems you reset them to match those of Rosetta.
                                                                                            ____________

                                                                                            Profile anders n

                                                                                            Joined: Feb 16 06
                                                                                            Posts: 166
                                                                                            ID: 91
                                                                                            Credit: 131,419
                                                                                            RAC: 0
                                                                                            Message 1440 - Posted 30 Apr 2006 8:02:49 UTC

                                                                                              Yes a quota of 3-5 would keep most of the host with work and

                                                                                              if you need fast answers to a test batch set the return date to 1-3 days

                                                                                              and they will be cruched first.

                                                                                              Anders n
                                                                                              ____________

                                                                                              Profile JKeck {pirate}
                                                                                              Avatar

                                                                                              Joined: Feb 16 06
                                                                                              Posts: 14
                                                                                              ID: 182
                                                                                              Credit: 131,758
                                                                                              RAC: 7
                                                                                              Message 1441 - Posted 30 Apr 2006 11:16:15 UTC

                                                                                                I would think for the daily quota 2 would be the minimum and the max 4 or 8. You would want to have a chance at getting multiple tasks running on multi-CPU hosts.
                                                                                                ____________
                                                                                                BOINC WIKI

                                                                                                BOINCing since 2002/12/8

                                                                                                tralala

                                                                                                Joined: Apr 12 06
                                                                                                Posts: 52
                                                                                                ID: 1266
                                                                                                Credit: 15,257
                                                                                                RAC: 0
                                                                                                Message 1442 - Posted 30 Apr 2006 12:57:49 UTC - in response to Message 1441.

                                                                                                  I would think for the daily quota 2 would be the minimum and the max 4 or 8. You would want to have a chance at getting multiple tasks running on multi-CPU hosts.


                                                                                                  The daily quota is per CPU. So if you have a dual-core or a Hyperthreading-enabled P4 you get 6 WU/day if the daily quote is 3WU/Day.
                                                                                                  ____________

                                                                                                  Mike Gelvin
                                                                                                  Avatar

                                                                                                  Joined: Feb 17 06
                                                                                                  Posts: 50
                                                                                                  ID: 468
                                                                                                  Credit: 55,397
                                                                                                  RAC: 0
                                                                                                  Message 1467 - Posted 3 May 2006 20:11:30 UTC

                                                                                                    ROM,
                                                                                                    I currently have a rosetta_beta_5.06 that has been running 14 hours+ with 1.04% for progress. I have debug capability on this computer, any suggestions, or just Abort?

                                                                                                    its labeled: WATCHDOG_KILL_VERY_LONG_JOBS_414_3

                                                                                                    I notice that 2 others ran this unit and it died at 1.5 hours and 1.8 hours

                                                                                                    Running on Win2000 SP4, leave in memory is set.

                                                                                                    ____________

                                                                                                    Profile feet1st

                                                                                                    Joined: Mar 7 06
                                                                                                    Posts: 312
                                                                                                    ID: 1028
                                                                                                    Credit: 110,522
                                                                                                    RAC: 0
                                                                                                    Message 1468 - Posted 3 May 2006 22:35:50 UTC - in response to Message 1467.

                                                                                                      its labeled: WATCHDOG_KILL_VERY_LONG_JOBS_414_3

                                                                                                      I\'ve seen other posts that this WU was specially designed to TEST the watchdog. It is INTENDED to have the watchdog step in and end it for you. So if you abort, you essentially leave the watchdog less proven. He\'ll get it!

                                                                                                      But that SHOULD be the reason why the others \"failed\".

                                                                                                      ____________

                                                                                                      Mike Gelvin
                                                                                                      Avatar

                                                                                                      Joined: Feb 17 06
                                                                                                      Posts: 50
                                                                                                      ID: 468
                                                                                                      Credit: 55,397
                                                                                                      RAC: 0
                                                                                                      Message 1469 - Posted 4 May 2006 5:53:44 UTC - in response to Message 1467.

                                                                                                        Last modified: 4 May 2006 5:54:14 UTC

                                                                                                        ROM,
                                                                                                        I currently have a rosetta_beta_5.06 that has been running 14 hours+ with 1.04% for progress. I have debug capability on this computer, any suggestions, or just Abort?

                                                                                                        its labeled: WATCHDOG_KILL_VERY_LONG_JOBS_414_3

                                                                                                        I notice that 2 others ran this unit and it died at 1.5 hours and 1.8 hours

                                                                                                        Running on Win2000 SP4, leave in memory is set.


                                                                                                        http://ralph.bakerlab.org/workunit.php?wuid=83793

                                                                                                        Now at 24 hours and still stuck at 1.04%.
                                                                                                        ____________

                                                                                                        Profile William Senn
                                                                                                        Avatar

                                                                                                        Joined: Feb 16 06
                                                                                                        Posts: 4
                                                                                                        ID: 183
                                                                                                        Credit: 30,895
                                                                                                        RAC: 0
                                                                                                        Message 1470 - Posted 4 May 2006 10:46:03 UTC

                                                                                                          Hi,

                                                                                                          Got two erroneous results, but did not report them here, yet, sorry for being so late....

                                                                                                          resultid=98902
                                                                                                          resultid=99919

                                                                                                          App version 5.06 (both)...

                                                                                                          Other 2 earlier workunits completed succesfully....

                                                                                                          greetings,

                                                                                                          William Senn...


                                                                                                          ____________

                                                                                                          Mike Gelvin
                                                                                                          Avatar

                                                                                                          Joined: Feb 17 06
                                                                                                          Posts: 50
                                                                                                          ID: 468
                                                                                                          Credit: 55,397
                                                                                                          RAC: 0
                                                                                                          Message 1471 - Posted 4 May 2006 18:45:25 UTC - in response to Message 1469.

                                                                                                            ROM,
                                                                                                            I currently have a rosetta_beta_5.06 that has been running 14 hours+ with 1.04% for progress. I have debug capability on this computer, any suggestions, or just Abort?

                                                                                                            its labeled: WATCHDOG_KILL_VERY_LONG_JOBS_414_3

                                                                                                            I notice that 2 others ran this unit and it died at 1.5 hours and 1.8 hours

                                                                                                            Running on Win2000 SP4, leave in memory is set.


                                                                                                            http://ralph.bakerlab.org/workunit.php?wuid=83793

                                                                                                            Now at 24 hours and still stuck at 1.04%.


                                                                                                            36 hours and still stuck at 1.04%... the watchdog is NOT working... is anyone out there?

                                                                                                            ____________

                                                                                                            Profile anders n

                                                                                                            Joined: Feb 16 06
                                                                                                            Posts: 166
                                                                                                            ID: 91
                                                                                                            Credit: 131,419
                                                                                                            RAC: 0
                                                                                                            Message 1472 - Posted 4 May 2006 19:02:41 UTC - in response to Message 1471.

                                                                                                              36 hours and still stuck at 1.04%... the watchdog is NOT working... is anyone out there?


                                                                                                              Hi Mike

                                                                                                              Have you checked the grafics to se if the steps or % has changed?

                                                                                                              The % should show with 1.04?? and not as on boinc manager with only 1,04.

                                                                                                              Anders n

                                                                                                              ____________

                                                                                                              Mike Gelvin
                                                                                                              Avatar

                                                                                                              Joined: Feb 17 06
                                                                                                              Posts: 50
                                                                                                              ID: 468
                                                                                                              Credit: 55,397
                                                                                                              RAC: 0
                                                                                                              Message 1473 - Posted 4 May 2006 19:20:30 UTC - in response to Message 1472.

                                                                                                                36 hours and still stuck at 1.04%... the watchdog is NOT working... is anyone out there?


                                                                                                                Hi Mike

                                                                                                                Have you checked the grafics to se if the steps or % has changed?

                                                                                                                The % should show with 1.04?? and not as on boinc manager with only 1,04.

                                                                                                                Anders n


                                                                                                                This computer is headless. Remote access only. Hence no screensaver.



                                                                                                                ____________

                                                                                                                Profile feet1st

                                                                                                                Joined: Mar 7 06
                                                                                                                Posts: 312
                                                                                                                ID: 1028
                                                                                                                Credit: 110,522
                                                                                                                RAC: 0
                                                                                                                Message 1474 - Posted 4 May 2006 19:32:08 UTC

                                                                                                                  Last modified: 4 May 2006 19:34:48 UTC

                                                                                                                  Looks like your normal WUs are the 4hrs default... so we\'re now well passed the 4x preference guideline I\'ve seen posted elsewhere... so it is time to abort. Since we\'re here on Ralph, the diagnostic info. should prove useful for study. Hopefully it\'s something they fixed in the versions after 5.06.

                                                                                                                  Ironic... given your photo that your computer is \"headless\" :):)
                                                                                                                  ____________

                                                                                                                  Profile Astro

                                                                                                                  Joined: Feb 16 06
                                                                                                                  Posts: 141
                                                                                                                  ID: 48
                                                                                                                  Credit: 32,977
                                                                                                                  RAC: 0
                                                                                                                  Message 1475 - Posted 4 May 2006 21:57:41 UTC - in response to Message 1473.

                                                                                                                    Last modified: 4 May 2006 21:59:17 UTC

                                                                                                                    [This computer is headless. Remote access only. Hence no screensaver.

                                                                                                                    Mike, I use VNC to see the graphics on my remote monitorless, keyboardless, and mouseless puter. I click on the WU from the task tab and then view graphics. No screensaver here either. If it\'s a service install your hosed.

                                                                                                                    tony

                                                                                                                    Mike Gelvin
                                                                                                                    Avatar

                                                                                                                    Joined: Feb 17 06
                                                                                                                    Posts: 50
                                                                                                                    ID: 468
                                                                                                                    Credit: 55,397
                                                                                                                    RAC: 0
                                                                                                                    Message 1476 - Posted 4 May 2006 22:13:47 UTC - in response to Message 1475.

                                                                                                                      Last modified: 4 May 2006 22:21:10 UTC

                                                                                                                      [This computer is headless. Remote access only. Hence no screensaver.

                                                                                                                      Mike, I use VNC to see the graphics on my remote monitorless, keyboardless, and mouseless puter. I click on the WU from the task tab and then view graphics. No screensaver here either. If it\'s a service install your hosed.

                                                                                                                      tony


                                                                                                                      It is a service install. I forgot about the \"View Graphics button\" I do VN into this computer. OK... 1.041% complete after 40 hours. Stage Full atom relax, Mode 1, Step 100, Accepted RMSD 50.36, Accepted Energy -19.40622 whatever this all means.

                                                                                                                      ____________

                                                                                                                      Moderator9
                                                                                                                      Forum moderator

                                                                                                                      Joined: Feb 16 06
                                                                                                                      Posts: 251
                                                                                                                      ID: 210
                                                                                                                      Credit: 0
                                                                                                                      RAC: 0
                                                                                                                      Message 1478 - Posted 5 May 2006 2:17:59 UTC - in response to Message 1476.

                                                                                                                        Last modified: 5 May 2006 2:27:12 UTC

                                                                                                                        [This computer is headless. Remote access only. Hence no screensaver.

                                                                                                                        Mike, I use VNC to see the graphics on my remote monitorless, keyboardless, and mouseless puter. I click on the WU from the task tab and then view graphics. No screensaver here either. If it\'s a service install your hosed.

                                                                                                                        tony


                                                                                                                        It is a service install. I forgot about the \"View Graphics button\" I do VN into this computer. OK... 1.041% complete after 40 hours. Stage Full atom relax, Mode 1, Step 100, Accepted RMSD 50.36, Accepted Energy -19.40622 whatever this all means.


                                                                                                                        If it is a BIG protein you may have to wait for some time to see the steps advance, but you may be able to detect the slightest motion in the searching window image. If you see either the steps counting up or the movement in the searching window, it is still processing. On some of the large Work Units, it is possible for them to run very long times past your time setting. I would note however that yours is running way too long over the time setting. I have had a few lately that went 14 hours with a time setting of 2 hours.

                                                                                                                        The point being this. Unless the Workunit is either swapped out for project switching, or boinc is turned on and off four times the watchdog will never wake up and abort the work unit. Failing that the work unit will be aborted when it hits a limit preset by the project which SHOULD be 24 hours of CPU time.

                                                                                                                        My understanding is that it is designed to look at the Work unit each time it starts to process and determine of progress has been made since the last time it started up. This presuposes that the process was stopped for some reason. It does not just sit there checking the work unit all the time. If it never stops processing the workunit it will not check it. With luck Rhiju will chime in here and correct me if I am wrong about this, but I am going on the last explanation I had for all this.

                                                                                                                        Now let me add a caution here. If you restart BOINC before the workunit reaches a percent complete of greater than 2%, the Work unit WILL START OVER FROM THE BEGINNING AND THE CPU TIME WILL RESET TO ZERO!

                                                                                                                        So if you are going to play with starting and stopping. You should have keep in memory set to yes, and then suspend the Work unit or start another project long enough for another process to run for a while.

                                                                                                                        The watch dog is supposed to do 4 of these checks which show no progress before it will abort the workunit. That is part of how they worked out the \"four times your time setting\" concept for manual aborts.

                                                                                                                        So the short of this is, if the workunit is simply running uninterrupted, it could run forever, or until it hits the Max time setting. This is the risk of running a single project setup. If you don\'t see movement in the graphic, try suspending the Work unit and letting the system run a different one for 5 min. Then restart the first Work unit again for 5 min. Repeat this process 4 -5 times and it should abort the workunit if it was stuck. If it is not stuck it should let it keep running. Either that or we have a watchdog bug.

                                                                                                                        ____________
                                                                                                                        Moderator9
                                                                                                                        RALPH@home FAQs
                                                                                                                        RALPH@home Guidelines
                                                                                                                        Moderator Contact

                                                                                                                        Rhiju
                                                                                                                        Forum moderator
                                                                                                                        Project developer
                                                                                                                        Project scientist

                                                                                                                        Joined: Feb 14 06
                                                                                                                        Posts: 161
                                                                                                                        ID: 4
                                                                                                                        Credit: 3,725
                                                                                                                        RAC: 0
                                                                                                                        Message 1480 - Posted 5 May 2006 2:34:00 UTC - in response to Message 1478.

                                                                                                                          Hi Mike: thanks very much for posting. This sounds weird. The job should have been killed by the watchdog. In fact we sent out these workunits to test that infinite loops are aborted by the watchdog, and they\'ve been \"successful\" in that they\'ve mostly returned without keeping computers in infinite loops. For now, please either abort or follow mod9\'s suggestion of suspending and restarting a few times. If this occurs again, please post!

                                                                                                                          [This computer is headless. Remote access only. Hence no screensaver.

                                                                                                                          Mike, I use VNC to see the graphics on my remote monitorless, keyboardless, and mouseless puter. I click on the WU from the task tab and then view graphics. No screensaver here either. If it\'s a service install your hosed.

                                                                                                                          tony


                                                                                                                          It is a service install. I forgot about the \"View Graphics button\" I do VN into this computer. OK... 1.041% complete after 40 hours. Stage Full atom relax, Mode 1, Step 100, Accepted RMSD 50.36, Accepted Energy -19.40622 whatever this all means.


                                                                                                                          If it is a BIG protein you may have to wait for some time to see the steps advance, but you may be able to detect the slightest motion in the searching window image. If you see either the steps counting up or the movement in the searching window, it is still processing. On some of the large Work Units, it is possible for them to run very long times past your time setting. I would note however that yours is running way too long over the time setting. I have had a few lately that went 14 hours with a time setting of 2 hours.

                                                                                                                          The point being this. Unless the Workunit is either swapped out for project switching, or boinc is turned on and off four times the watchdog will never wake up and abort the work unit. Failing that the work unit will be aborted when it hits a limit preset by the project which SHOULD be 24 hours of CPU time.

                                                                                                                          My understanding is that it is designed to look at the Work unit each time it starts to process and determine of progress has been made since the last time it started up. This presuposes that the process was stopped for some reason. It does not just sit there checking the work unit all the time. If it never stops processing the workunit it will not check it. With luck Rhiju will chime in here and correct me if I am wrong about this, but I am going on the last explanation I had for all this.

                                                                                                                          Now let me add a caution here. If you restart BOINC before the workunit reaches a percent complete of greater than 2%, the Work unit WILL START OVER FROM THE BEGINNING AND THE CPU TIME WILL RESET TO ZERO!

                                                                                                                          So if you are going to play with starting and stopping. You should have keep in memory set to yes, and then suspend the Work unit or start another project long enough for another process to run for a while.

                                                                                                                          The watch dog is supposed to do 4 of these checks which show no progress before it will abort the workunit. That is part of how they worked out the \"four times your time setting\" concept for manual aborts.

                                                                                                                          So the short of this is, if the workunit is simply running uninterrupted, it could run forever, or until it hits the Max time setting. This is the risk of running a single project setup. If you don\'t see movement in the graphic, try suspending the Work unit and letting the system run a different one for 5 min. Then restart the first Work unit again for 5 min. Repeat this process 4 -5 times and it should abort the workunit if it was stuck. If it is not stuck it should let it keep running. Either that or we have a watchdog bug.


                                                                                                                          ____________

                                                                                                                          Moderator9
                                                                                                                          Forum moderator

                                                                                                                          Joined: Feb 16 06
                                                                                                                          Posts: 251
                                                                                                                          ID: 210
                                                                                                                          Credit: 0
                                                                                                                          RAC: 0
                                                                                                                          Message 1482 - Posted 5 May 2006 3:00:28 UTC

                                                                                                                            Last modified: 5 May 2006 11:39:15 UTC

                                                                                                                            Version 5.09 has been released. If you have errors in Version 5.09 please report them in the 5.09 Bug thread.
                                                                                                                            ____________
                                                                                                                            Moderator9
                                                                                                                            RALPH@home FAQs
                                                                                                                            RALPH@home Guidelines
                                                                                                                            Moderator Contact

                                                                                                                            Mike Gelvin
                                                                                                                            Avatar

                                                                                                                            Joined: Feb 17 06
                                                                                                                            Posts: 50
                                                                                                                            ID: 468
                                                                                                                            Credit: 55,397
                                                                                                                            RAC: 0
                                                                                                                            Message 1484 - Posted 5 May 2006 3:42:28 UTC - in response to Message 1476.

                                                                                                                              [This computer is headless. Remote access only. Hence no screensaver.

                                                                                                                              Mike, I use VNC to see the graphics on my remote monitorless, keyboardless, and mouseless puter. I click on the WU from the task tab and then view graphics. No screensaver here either. If it\'s a service install your hosed.

                                                                                                                              tony


                                                                                                                              It is a service install. I forgot about the \"View Graphics button\" I do VN into this computer. OK... 1.041% complete after 40 hours. Stage Full atom relax, Mode 1, Step 100, Accepted RMSD 50.36, Accepted Energy -19.40622 whatever this all means.


                                                                                                                              Starting and stopping did indeed reset the time to 0 (I had to reboot for other reasons). I am going to allow it to build back up... at over 24 I will report back. Its the Max Time Setting (24 hrs) that appears to not be working.

                                                                                                                              ____________

                                                                                                                              Mike Gelvin
                                                                                                                              Avatar

                                                                                                                              Joined: Feb 17 06
                                                                                                                              Posts: 50
                                                                                                                              ID: 468
                                                                                                                              Credit: 55,397
                                                                                                                              RAC: 0
                                                                                                                              Message 1487 - Posted 5 May 2006 14:52:26 UTC - in response to Message 1480.

                                                                                                                                So the short of this is, if the workunit is simply running uninterrupted, it could run forever, or until it hits the Max time setting. This is the risk of running a single project setup. If you don\'t see movement in the graphic, try suspending the Work unit and letting the system run a different one for 5 min. Then restart the first Work unit again for 5 min. Repeat this process 4 -5 times and it should abort the workunit if it was stuck. If it is not stuck it should let it keep running. Either that or we have a watchdog bug.



                                                                                                                                Does the \"Max time\" get checked even if the app is not swapped out? That could be it, as my computer was running in EDF mode, hence it NEVER got swapped.

                                                                                                                                May I suggest that these items, (flavors of the watchdogs) get checked whenever BOINC requests a checkpoint? I understand this is every hour or so. I realize that Rosetta doesn’t perform the checkpoint, but it could process watchdog duties.



                                                                                                                                ____________

                                                                                                                                Moderator9
                                                                                                                                Forum moderator

                                                                                                                                Joined: Feb 16 06
                                                                                                                                Posts: 251
                                                                                                                                ID: 210
                                                                                                                                Credit: 0
                                                                                                                                RAC: 0
                                                                                                                                Message 1490 - Posted 5 May 2006 17:56:09 UTC - in response to Message 1487.

                                                                                                                                  So the short of this is, if the workunit is simply running uninterrupted, it could run forever, or until it hits the Max time setting. This is the risk of running a single project setup. If you don\'t see movement in the graphic, try suspending the Work unit and letting the system run a different one for 5 min. Then restart the first Work unit again for 5 min. Repeat this process 4 -5 times and it should abort the workunit if it was stuck. If it is not stuck it should let it keep running. Either that or we have a watchdog bug.



                                                                                                                                  Does the \"Max time\" get checked even if the app is not swapped out? That could be it, as my computer was running in EDF mode, hence it NEVER got swapped.

                                                                                                                                  May I suggest that these items, (flavors of the watchdogs) get checked whenever BOINC requests a checkpoint? I understand this is every hour or so. I realize that Rosetta doesn’t perform the checkpoint, but it could process watchdog duties.

                                                                                                                                  Well it is really two separate functions that are fallbacks to one another. If the watchdog never has the opportunity to work (i.e. the work unit is never stopped and started for the check to occur) then the Work Unit will hit a wall for maximum time to process. The Max time function is independent of the watchdog and works on a different set of criteria and variables. he Max time is hard coded by the project before the Work unit is sent out.

                                                                                                                                  Right now that max time on Rosetta is 24 hours. I think it is the asme for Ralph but Rhiju would have to verify that, because it could be different for each set of Work Units.

                                                                                                                                  In any case you are correct. If you system was in EDF mode, the watchdog would not likely have kicked in. Perhaps that is a good reason to revisit how the checking is done.

                                                                                                                                  ____________
                                                                                                                                  Moderator9
                                                                                                                                  RALPH@home FAQs
                                                                                                                                  RALPH@home Guidelines
                                                                                                                                  Moderator Contact

                                                                                                                                  Profile feet1st

                                                                                                                                  Joined: Mar 7 06
                                                                                                                                  Posts: 312
                                                                                                                                  ID: 1028
                                                                                                                                  Credit: 110,522
                                                                                                                                  RAC: 0
                                                                                                                                  Message 1495 - Posted 5 May 2006 22:16:17 UTC - in response to Message 1487.

                                                                                                                                    May I suggest that these items, (flavors of the watchdogs) get checked whenever BOINC requests a checkpoint? I understand this is every hour or so. I realize that Rosetta doesn’t perform the checkpoint, but it could process watchdog duties.


                                                                                                                                    Perhaps he\'s on to something there, could watchdog code be evaluated at the areas in the model where checkpoints are possible? Or is that part of the problem? We don\'t reach the checkpointable stage in the model?

                                                                                                                                    I just wanted to point out that BOINC doesn\'t request checkpoints. It is up to the application to do so when appropriate. Rosetta now does checkpoints about every 20 minutes or so. So it was not that previously Rosetta was ignoring any requests from BOINC. It\'s just that the architecture of BOINC is such that the manager cannot signal the application to do a checkpoint, indeed most applications have to complete a certain phase of processing before they can do so, and in that sense Rosetta is no different. With the new changes, they have actually created new places in their crunching where checkpoints may be performed... and performed efficiently. You don\'t want to waste time doing too much checkpointing either, so it\'s a balance.

                                                                                                                                    What happens every hour or so is BOINC reevaluating if the application being run should be switched (60min is the default \"switch between applications every...\" time). And if, at the point of that switch, the application is removed from memory, then the work done since last checkpoint is all lost. This is how BOINC works. This is why the more frequent checkpointing was such a great thing for productivity. And now if we can just get BOINC to ONLY preempt an application after it does a checkpoint, then we\'ll REALLY be cruisin\'!
                                                                                                                                    ____________

                                                                                                                                    Profile Astro

                                                                                                                                    Joined: Feb 16 06
                                                                                                                                    Posts: 141
                                                                                                                                    ID: 48
                                                                                                                                    Credit: 32,977
                                                                                                                                    RAC: 0
                                                                                                                                    Message 1497 - Posted 5 May 2006 23:15:28 UTC - in response to Message 1495.

                                                                                                                                      Last modified: 5 May 2006 23:15:53 UTC

                                                                                                                                      And now if we can just get BOINC to ONLY preempt an application after it does a checkpoint, then we\'ll REALLY be cruisin\'!

                                                                                                                                      This was posted to the boinc alpha mail list yesterday by JM7 (the creator of the scheduler)

                                                                                                                                      John.McLeod@xxxxxxxxxx.com to boinc_dev
                                                                                                                                      More options May 4 (1 day ago)

                                                                                                                                      I have been working on the CPU scheduler to see what I can do to make it
                                                                                                                                      work as the doc says it should.

                                                                                                                                      What I have at the moment:

                                                                                                                                      The CPU scheduler checks the necessity to preempt:
                                                                                                                                      1) If one of the events that could cause entry to EDF occurs.
                                                                                                                                      (Checkpoint after process swap time, files downloaded, task exit, ...).
                                                                                                                                      2) At least once every 10 minutes. (Just to be safe). What should this
                                                                                                                                      frequency be? 10 minutes? an hour? the time between allowed checkpoints?

                                                                                                                                      The CPU scheduler select tasks to run if:
                                                                                                                                      1) There are not enough runnable tasks scheduled to meet 1 per CPU
                                                                                                                                      allowed. (Startup / task complete / running task suspended ...).
                                                                                                                                      2) A checkpoint has been reached after the process swap time.
                                                                                                                                      3) One or more results has recently entered the state of requiring EDF.

                                                                                                                                      Enforcement is immediate. If a result has reached its checkpoint after
                                                                                                                                      process swap time, and the CPU scheduler has scheduled it for another
                                                                                                                                      process time, then it gets the full time allotted to it (default another
                                                                                                                                      hour + time to checkpoint).

                                                                                                                                      AND

                                                                                                                                      John.McLeod@xxxxxxxxx.com to elst93, boinc_dev
                                                                                                                                      More options May 4 (1 day ago)

                                                                                                                                      How often to check to see if pre-emption is needed may not want to be user
                                                                                                                                      configurable because someone is going to set the number to way too large.

                                                                                                                                      If the process doesn\'t checkpoint, it will either complete (and the system
                                                                                                                                      will fall under 1 - not enough runable results running) OR another process
                                                                                                                                      will require attention in order to meet deadline in which case, that
                                                                                                                                      process will start running.

                                                                                                                                      One further note, if a process does actually make it to a checkpoint, it
                                                                                                                                      will then be removed from memory when it suspends - this suspend will
                                                                                                                                      happen within a second or two of the checkpoint.

                                                                                                                                      jm7

                                                                                                                                      seems from this, it\'s already being looked into

                                                                                                                                      Message boards : RALPH@home bug list : Bug reports for Ralph 5.05 and higher


                                                                                                                                      Home | Join | About | Participants | Community | Statistics

                                                                                                                                      Copyright © 2017 University of Washington

                                                                                                                                      Last Modified: 20 Nov 2008 19:41:56 UTC
                                                                                                                                      Back to top ^