Message boards : RALPH@home bug list : Bug reports for Ralph 5.42 and 5.43
Author | Message |
---|---|
Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0 |
We are trying to increase stability in this release... We have turned off mouse rotation and sidechains temporarily. Please let us know if you can force a crash by playing with the "show graphics" option from the boinc manager, or with your screensaver! |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
Might I suggest more WUs then average?? So people have time to read your message and perhaps attach to Ralph, and get some work to run? Otherwise I fear those first 1000 WUs all were consumed by the folks that run as a service and don't run the screensaver. Also, just got a new error updating to the project: 12/11/2006 9:12:36 PM|ralph@home|Scheduler request succeeded 12/11/2006 9:12:36 PM|ralph@home|Message from server: Project encountered internal error: shared memory 12/11/2006 9:12:36 PM|ralph@home|Project is down At time of this post (just 25min after Rhiju's) there are zero WUs available and just over 1000 in progress. |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
It should be pointed out that if you wish to know with some certainty that a given screensaver problem occured on Ralph, you're going to have to suspend Rosetta while you run Ralph this time. That way you will KNOW which was driving the screensaver. |
FluffyChicken Send message Joined: 17 Feb 06 Posts: 54 Credit: 710 RAC: 0 |
Maybe you should also note in the front page news say in bold that it is graphics you are testing and please play or turn on the graphics/screensaver. Hint big time. Also put a request in the Rosetta news that anyone who uses the screensaver and has seen problems with crashing/failed tasks then please attach to Ralph to help in the testing (link to ralph as well) |
Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0 |
Hi. I queued a ton of WUs, but there's a problem with one of the work daemons. I'm working on it. Thanks for the post! Might I suggest more WUs then average?? So people have time to read your message and perhaps attach to Ralph, and get some work to run? Otherwise I fear those first 1000 WUs all were consumed by the folks that run as a service and don't run the screensaver. |
Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0 |
OK, I think I fixed the feeder; work should be sent out smoothly. Hi. I queued a ton of WUs, but there's a problem with one of the work daemons. I'm working on it. Thanks for the post! |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
Thanks Rhiju, I got 2 work units about 50 minutes before posting this. I've suspended Rosetta, enabled my screensaver and she'll run all day (we hope). I also enabled the new version in my firewall, just in case it tries to report in with the debugger info. Also wanted to point out that timestamp shown on the homepage just above the number of WUs seems to be in the future. By as much as 30min. |
Carolyn and Michael Bowen Send message Joined: 8 Dec 06 Posts: 1 Credit: 5,818 RAC: 0 |
For some reason, while 5.42 was downloading for the first time onto my XP SP2 system, my computer froze up for about 60 seconds. I was given back control once the download finished. The message log taken from around this time (posted below) looks normal, and there seem to be no long-lasting side effects; just a slight annoyance at not being able to work for a bit. The communication failure at the beginning of the listing was because my DSL connection was still in the process of setting itself up during Ralph's initial request. Times are U.S. PST. 2006/12/12 06:46:46|ralph@home|Sending scheduler request to https://ralph.bakerlab.org/ralph_cgi/cgi 2006/12/12 06:46:46|ralph@home|Reason: To fetch work 2006/12/12 06:46:46|ralph@home|Requesting 34560 seconds of new work 2006/12/12 06:46:47||Project communication failed: attempting access to reference site 2006/12/12 06:46:50||Access to reference site failed - check network connection or proxy configuration. 2006/12/12 06:46:52|ralph@home|Scheduler request failed: couldn't resolve host name 2006/12/12 06:46:52|ralph@home|Deferring scheduler requests for 1 minutes and 0 seconds 2006/12/12 06:47:53|ralph@home|Fetching scheduler list 2006/12/12 06:47:58|ralph@home|Scheduler list download succeeded 2006/12/12 06:48:05|ralph@home|Sending scheduler request to https://ralph.bakerlab.org/ralph_cgi/cgi 2006/12/12 06:48:05|ralph@home|Reason: To fetch work 2006/12/12 06:48:05|ralph@home|Requesting 34560 seconds of new work 2006/12/12 06:48:09|ralph@home|Scheduler request succeeded 2006/12/12 06:48:11|ralph@home|Started download of file rosetta_beta_5.42_windows_intelx86.exe 2006/12/12 06:48:11|ralph@home|Started download of file frags83_2chf_.fasta.gz 2006/12/12 06:48:12|ralph@home|Finished download of file frags83_2chf_.fasta.gz 2006/12/12 06:48:12|ralph@home|Throughput 1300 bytes/sec 2006/12/12 06:48:12|ralph@home|Started download of file frags83_2chf_.psipred_ss2.gz 2006/12/12 06:48:13|ralph@home|Finished download of file frags83_2chf_.psipred_ss2.gz 2006/12/12 06:48:13|ralph@home|Throughput 5046 bytes/sec 2006/12/12 06:48:13|ralph@home|Started download of file boinc_frags83_aa2chf_03_05.200_v1_3.gz 2006/12/12 06:48:26|ralph@home|Finished download of file boinc_frags83_aa2chf_03_05.200_v1_3.gz 2006/12/12 06:48:26|ralph@home|Throughput 86148 bytes/sec 2006/12/12 06:48:26|ralph@home|Started download of file boinc_frags83_aa2chf_09_05.200_v1_3.gz 2006/12/12 06:48:31|ralph@home|Finished download of file boinc_frags83_aa2chf_09_05.200_v1_3.gz 2006/12/12 06:48:31|ralph@home|Throughput 73067 bytes/sec 2006/12/12 06:48:31|ralph@home|Started download of file frags83_2chf.pdb.gz 2006/12/12 06:48:33|ralph@home|Finished download of file frags83_2chf.pdb.gz 2006/12/12 06:48:33|ralph@home|Throughput 27867 bytes/sec 2006/12/12 06:48:33|ralph@home|Started download of file casp7.description.shorter.txt 2006/12/12 06:48:35|ralph@home|Finished download of file casp7.description.shorter.txt 2006/12/12 06:48:35|ralph@home|Throughput 398 bytes/sec 2006/12/12 06:49:24|ralph@home|Finished download of file rosetta_beta_5.42_windows_intelx86.exe 2006/12/12 06:49:24|ralph@home|Throughput 133687 bytes/sec 2006/12/12 06:49:25||Rescheduling CPU: files downloaded 2006/12/12 06:49:26|ralph@home|Starting task 2chf__BOINC_ABINITIO_SAVE_ALL_OUT_frags83__1552_21_0 using rosetta_beta version 542 |
sslickerson Send message Joined: 15 Feb 06 Posts: 17 Credit: 4,006 RAC: 0 |
I've put the new app through the paces regarding the screensaver and graphics window. I've tried all fps speeds and all seems well. --Tim |
sslickerson Send message Joined: 15 Feb 06 Posts: 17 Credit: 4,006 RAC: 0 |
I tried the same with Rosetta and had the *graphics window* open when the WU froze completely after about a minute of moving the graphics around. This has been typical of my computer. The WU progressed per usual with the screensaver on and no problems this time (although there has been in the past). I should note that with Rosetta WU I see more of this happen at 30 fps than any other number higher or lower than 30. Also, I didn't need to force the WU to error out as BOINC was visible on the task bar and so I just exited out. It later continued at its last checkpoint. |
genes Send message Joined: 16 Feb 06 Posts: 45 Credit: 43,706 RAC: 20 |
I got a few WU's on my home machine this morning, and it didn't have any Rosetta, and the Ralph WU's appear to be trouble-free so far, screensaver-wise. When I get home later, I'll see if it crashed. Here's the machine: hostid=2016 |
Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0 |
Things sound good so far -- I'll wait for a few more replies (e.g. fro Feet1st) before doing the update on rosetta@home. Thanks to all who have checked in! I tried the same with Rosetta and had the *graphics window* open when the WU froze completely after about a minute of moving the graphics around. This has been typical of my computer. The WU progressed per usual with the screensaver on and no problems this time (although there has been in the past). I should note that with Rosetta WU I see more of this happen at 30 fps than any other number higher or lower than 30. Also, I didn't need to force the WU to error out as BOINC was visible on the task bar and so I just exited out. It later continued at its last checkpoint. |
Bjarke Send message Joined: 25 Feb 06 Posts: 5 Credit: 5,523 RAC: 0 |
I crashed a WU during graphics. It either happened because I showed two RALPH-graphics at the same time (i have two cores), or because I tried to zoom in/out on the left-picture. result |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
I haven't been at the proper PC all day, but the WUs still show as not reported yet, so that's a good indication that they are still happily crunching (24hr time preference). Previously I wasn't ever able to crunch ANY WU for the full 24hrs without a hang or failure or watchdog kicking in. ...I suppose it COULD ALSO mean that a squirrel has found a creative new way to combine the electic transformer outside my home with a latent desire to pursue the afterlife... as was the cause of some WUs not reporting YESTERDAY :) |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
Well, I'm now back to my problem PC, and was about to blow the train whistle --Whoo hooo!-- when I saw my screensaver MOVING, and it was on model 94. But then I noticed it only crunched the first WU for almost exactly 3hrs. Now that I've updated to project it shows watchdog ended it. And I commonly saw that same symptom on Rosetta once I activated the screensaver on the same host. Because the watchdog ended it, the messages tab just shows that the very long WU name "finished". The second WU I received is crunching happily now in to it's 10th hour. I could NEVER have lasted that long on v5.41. So, I can definately say "more stable". It is supposed to run for 24hrs, and that means if all goes well it will complete just after I leave for work in the AM. I should note that my hyperthreaded CPU is set in BOINC to only use 1 at the moment. That measure did not seem to make Rosetta any more stable. Now I should set it back to two, but the only other work I have is for Rosetta :) and if I have both running, I don't know how BOINC chooses the screensaver, but I think I'll botch my test. The "docking" WUs seemed to be the most graphically intense. Have some of those been queued up for test here? |
Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0 |
By a weird coincidence, we also had a ralph WU stuck on our test PC that was shut down by the watchdog. I've checked the results returned so far, though; most of the jobs are coming back fine. Your and my WUs are a little scary; we're going to work hard in the new year to trap these WUs and to track down where the jobs are getting stuck. But they look like the exception, rather than the rule. Thanks to Bjarke as well. I'm guessing that Bjarke's hypothesis about opening two ralph windows simultaneously is correct. Well, I'm now back to my problem PC, and was about to blow the train whistle --Whoo hooo!-- when I saw my screensaver MOVING, and it was on model 94. But then I noticed it only crunched the first WU for almost exactly 3hrs. Now that I've updated to project it shows watchdog ended it. And I commonly saw that same symptom on Rosetta once I activated the screensaver on the same host. |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
Rhiju, I know you have some means of running the WUs with the same starting seed value, or heck if you know it died on model X, then start it on that model's starting value ...but if you run the WUs again, do they again require the watchdog on the same model? If so, then there's some better "sniffing" possible for your Rosetta blood hounds. If not, then there is something about the rest of the environment that causes the failure (like the screensaver, or memory issues, or what have you). I'm still finishing the last Ralph WU to see if it will go 24hrs. And, after all of an hour apparently, it looks like that v5.43 WUs are all gone already, and I didn't get any, so I won't have any feedback on v5.43 for you. I've also been puzzled why the watchdog shutting down the thread always seems to lose the completed models of that task? I thought it would preserve them. |
Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0 |
Yea, the good news is that we can rerun the jobs with the seed that was sent out, and thus reproduce the error (and see where the job gets stuck). It will probably take a day or two of debugging, and we'll get that going in the new year. There are a lot of WU's queued for 5.43; but I think I may need to increase the size of the buffer. Thanks for the suggestions! Rhiju, I know you have some means of running the WUs with the same starting seed value, or heck if you know it died on model X, then start it on that model's starting value ...but if you run the WUs again, do they again require the watchdog on the same model? If so, then there's some better "sniffing" possible for your Rosetta blood hounds. If not, then there is something about the rest of the environment that causes the failure (like the screensaver, or memory issues, or what have you). |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
Thanks. I just nabbed two more WUs! I'll suspend the 5.42 task so I can see how 5.43 does overnight... on both threads of my HT CPU. I guess the reason I brought up running on the same seed etc. was that it should be pretty easy to set up a machine to run that one model and see if it needs the watchdog to kill it. ...and if it does not, then it implies there are still some problems with the screensaver (or were at 5.42 anyway). If you find indeed it does invoke the watchdog, then you've got something for your 2007 "to do" list. But it would help assure that it is indeed the random event you mention... that just happened to occur with both our systems, or not. |
Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0 |
Great, let me know how those 5.43 WUs go. So far no errors. Regarding the stuck WUs, my hypothesis is that they're not graphics related; we'll be able to check this, of course, by running them locally with executables compiled without graphics. I like your idea of checking whether the problems on your machine correlate with the ones on our machines ... I'll see if I can get the seed and workunit for the latest result you posted. Thanks. I just nabbed two more WUs! I'll suspend the 5.42 task so I can see how 5.43 does overnight... on both threads of my HT CPU. |
Message boards :
RALPH@home bug list :
Bug reports for Ralph 5.42 and 5.43
©2024 University of Washington
http://www.bakerlab.org