Bug reports for Ralph 5.42 and 5.43

Message boards : RALPH@home bug list : Bug reports for Ralph 5.42 and 5.43

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 5 · Next

AuthorMessage
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 2586 - Posted: 12 Dec 2006, 1:53:32 UTC
Last modified: 13 Dec 2006, 3:37:02 UTC

We are trying to increase stability in this release... We have turned off mouse rotation and sidechains temporarily. Please let us know if you can force a crash by playing with the "show graphics" option from the boinc manager, or with your screensaver!
ID: 2586 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 2587 - Posted: 12 Dec 2006, 2:19:30 UTC
Last modified: 12 Dec 2006, 3:17:07 UTC

Might I suggest more WUs then average?? So people have time to read your message and perhaps attach to Ralph, and get some work to run? Otherwise I fear those first 1000 WUs all were consumed by the folks that run as a service and don't run the screensaver.

Also, just got a new error updating to the project:

12/11/2006 9:12:36 PM|ralph@home|Scheduler request succeeded
12/11/2006 9:12:36 PM|ralph@home|Message from server: Project encountered internal error: shared memory
12/11/2006 9:12:36 PM|ralph@home|Project is down

At time of this post (just 25min after Rhiju's) there are zero WUs available and just over 1000 in progress.
ID: 2587 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 2588 - Posted: 12 Dec 2006, 2:58:19 UTC

It should be pointed out that if you wish to know with some certainty that a given screensaver problem occured on Ralph, you're going to have to suspend Rosetta while you run Ralph this time. That way you will KNOW which was driving the screensaver.
ID: 2588 · Report as offensive    Reply Quote
FluffyChicken

Send message
Joined: 17 Feb 06
Posts: 54
Credit: 710
RAC: 0
Message 2590 - Posted: 12 Dec 2006, 7:44:45 UTC
Last modified: 12 Dec 2006, 8:31:57 UTC

Maybe you should also note in the front page news say in bold that it is graphics you are testing and please play or turn on the graphics/screensaver.

Hint big time.



Also put a request in the Rosetta news that anyone who uses the screensaver and has seen problems with crashing/failed tasks then please attach to Ralph to help in the testing (link to ralph as well)


ID: 2590 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 2591 - Posted: 12 Dec 2006, 8:34:28 UTC - in response to Message 2587.  

Hi. I queued a ton of WUs, but there's a problem with one of the work daemons. I'm working on it. Thanks for the post!



Might I suggest more WUs then average?? So people have time to read your message and perhaps attach to Ralph, and get some work to run? Otherwise I fear those first 1000 WUs all were consumed by the folks that run as a service and don't run the screensaver.

Also, just got a new error updating to the project:

12/11/2006 9:12:36 PM|ralph@home|Scheduler request succeeded
12/11/2006 9:12:36 PM|ralph@home|Message from server: Project encountered internal error: shared memory
12/11/2006 9:12:36 PM|ralph@home|Project is down

At time of this post (just 25min after Rhiju's) there are zero WUs available and just over 1000 in progress.


ID: 2591 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 2592 - Posted: 12 Dec 2006, 8:49:30 UTC - in response to Message 2591.  

OK, I think I fixed the feeder; work should be sent out smoothly.

Hi. I queued a ton of WUs, but there's a problem with one of the work daemons. I'm working on it. Thanks for the post!



Might I suggest more WUs then average?? So people have time to read your message and perhaps attach to Ralph, and get some work to run? Otherwise I fear those first 1000 WUs all were consumed by the folks that run as a service and don't run the screensaver.

Also, just got a new error updating to the project:

12/11/2006 9:12:36 PM|ralph@home|Scheduler request succeeded
12/11/2006 9:12:36 PM|ralph@home|Message from server: Project encountered internal error: shared memory
12/11/2006 9:12:36 PM|ralph@home|Project is down

At time of this post (just 25min after Rhiju's) there are zero WUs available and just over 1000 in progress.



ID: 2592 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 2593 - Posted: 12 Dec 2006, 13:28:25 UTC
Last modified: 12 Dec 2006, 14:16:29 UTC

Thanks Rhiju, I got 2 work units about 50 minutes before posting this. I've suspended Rosetta, enabled my screensaver and she'll run all day (we hope).

I also enabled the new version in my firewall, just in case it tries to report in with the debugger info.

Also wanted to point out that timestamp shown on the homepage just above the number of WUs seems to be in the future. By as much as 30min.
ID: 2593 · Report as offensive    Reply Quote
Carolyn and Michael Bowen

Send message
Joined: 8 Dec 06
Posts: 1
Credit: 5,818
RAC: 0
Message 2594 - Posted: 12 Dec 2006, 14:04:09 UTC

For some reason, while 5.42 was downloading for the first time onto my XP SP2 system, my computer froze up for about 60 seconds. I was given back control once the download finished. The message log taken from around this time (posted below) looks normal, and there seem to be no long-lasting side effects; just a slight annoyance at not being able to work for a bit. The communication failure at the beginning of the listing was because my DSL connection was still in the process of setting itself up during Ralph's initial request. Times are U.S. PST.



2006/12/12 06:46:46|ralph@home|Sending scheduler request to https://ralph.bakerlab.org/ralph_cgi/cgi
2006/12/12 06:46:46|ralph@home|Reason: To fetch work
2006/12/12 06:46:46|ralph@home|Requesting 34560 seconds of new work
2006/12/12 06:46:47||Project communication failed: attempting access to reference site
2006/12/12 06:46:50||Access to reference site failed - check network connection or proxy configuration.
2006/12/12 06:46:52|ralph@home|Scheduler request failed: couldn't resolve host name
2006/12/12 06:46:52|ralph@home|Deferring scheduler requests for 1 minutes and 0 seconds
2006/12/12 06:47:53|ralph@home|Fetching scheduler list
2006/12/12 06:47:58|ralph@home|Scheduler list download succeeded
2006/12/12 06:48:05|ralph@home|Sending scheduler request to https://ralph.bakerlab.org/ralph_cgi/cgi
2006/12/12 06:48:05|ralph@home|Reason: To fetch work
2006/12/12 06:48:05|ralph@home|Requesting 34560 seconds of new work
2006/12/12 06:48:09|ralph@home|Scheduler request succeeded
2006/12/12 06:48:11|ralph@home|Started download of file rosetta_beta_5.42_windows_intelx86.exe
2006/12/12 06:48:11|ralph@home|Started download of file frags83_2chf_.fasta.gz
2006/12/12 06:48:12|ralph@home|Finished download of file frags83_2chf_.fasta.gz
2006/12/12 06:48:12|ralph@home|Throughput 1300 bytes/sec
2006/12/12 06:48:12|ralph@home|Started download of file frags83_2chf_.psipred_ss2.gz
2006/12/12 06:48:13|ralph@home|Finished download of file frags83_2chf_.psipred_ss2.gz
2006/12/12 06:48:13|ralph@home|Throughput 5046 bytes/sec
2006/12/12 06:48:13|ralph@home|Started download of file boinc_frags83_aa2chf_03_05.200_v1_3.gz
2006/12/12 06:48:26|ralph@home|Finished download of file boinc_frags83_aa2chf_03_05.200_v1_3.gz
2006/12/12 06:48:26|ralph@home|Throughput 86148 bytes/sec
2006/12/12 06:48:26|ralph@home|Started download of file boinc_frags83_aa2chf_09_05.200_v1_3.gz
2006/12/12 06:48:31|ralph@home|Finished download of file boinc_frags83_aa2chf_09_05.200_v1_3.gz
2006/12/12 06:48:31|ralph@home|Throughput 73067 bytes/sec
2006/12/12 06:48:31|ralph@home|Started download of file frags83_2chf.pdb.gz
2006/12/12 06:48:33|ralph@home|Finished download of file frags83_2chf.pdb.gz
2006/12/12 06:48:33|ralph@home|Throughput 27867 bytes/sec
2006/12/12 06:48:33|ralph@home|Started download of file casp7.description.shorter.txt
2006/12/12 06:48:35|ralph@home|Finished download of file casp7.description.shorter.txt
2006/12/12 06:48:35|ralph@home|Throughput 398 bytes/sec
2006/12/12 06:49:24|ralph@home|Finished download of file rosetta_beta_5.42_windows_intelx86.exe
2006/12/12 06:49:24|ralph@home|Throughput 133687 bytes/sec
2006/12/12 06:49:25||Rescheduling CPU: files downloaded
2006/12/12 06:49:26|ralph@home|Starting task 2chf__BOINC_ABINITIO_SAVE_ALL_OUT_frags83__1552_21_0 using rosetta_beta version 542
ID: 2594 · Report as offensive    Reply Quote
Profile sslickerson

Send message
Joined: 15 Feb 06
Posts: 17
Credit: 4,006
RAC: 0
Message 2595 - Posted: 12 Dec 2006, 15:25:45 UTC

I've put the new app through the paces regarding the screensaver and graphics window. I've tried all fps speeds and all seems well.

--Tim
ID: 2595 · Report as offensive    Reply Quote
Profile sslickerson

Send message
Joined: 15 Feb 06
Posts: 17
Credit: 4,006
RAC: 0
Message 2596 - Posted: 12 Dec 2006, 15:50:22 UTC

I tried the same with Rosetta and had the *graphics window* open when the WU froze completely after about a minute of moving the graphics around. This has been typical of my computer. The WU progressed per usual with the screensaver on and no problems this time (although there has been in the past). I should note that with Rosetta WU I see more of this happen at 30 fps than any other number higher or lower than 30. Also, I didn't need to force the WU to error out as BOINC was visible on the task bar and so I just exited out. It later continued at its last checkpoint.
ID: 2596 · Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 16 Feb 06
Posts: 45
Credit: 43,706
RAC: 20
Message 2597 - Posted: 12 Dec 2006, 16:15:55 UTC

I got a few WU's on my home machine this morning, and it didn't have any Rosetta, and the Ralph WU's appear to be trouble-free so far, screensaver-wise. When I get home later, I'll see if it crashed.

Here's the machine: hostid=2016
ID: 2597 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 2598 - Posted: 12 Dec 2006, 17:25:40 UTC - in response to Message 2596.  

Things sound good so far -- I'll wait for a few more replies (e.g. fro Feet1st)
before doing the update on rosetta@home. Thanks to all who have checked in!

I tried the same with Rosetta and had the *graphics window* open when the WU froze completely after about a minute of moving the graphics around. This has been typical of my computer. The WU progressed per usual with the screensaver on and no problems this time (although there has been in the past). I should note that with Rosetta WU I see more of this happen at 30 fps than any other number higher or lower than 30. Also, I didn't need to force the WU to error out as BOINC was visible on the task bar and so I just exited out. It later continued at its last checkpoint.


ID: 2598 · Report as offensive    Reply Quote
Bjarke

Send message
Joined: 25 Feb 06
Posts: 5
Credit: 5,523
RAC: 0
Message 2599 - Posted: 12 Dec 2006, 19:30:11 UTC

I crashed a WU during graphics. It either happened because I showed two RALPH-graphics at the same time (i have two cores), or because I tried to zoom in/out on the left-picture.
result


ID: 2599 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 2600 - Posted: 12 Dec 2006, 22:27:16 UTC

I haven't been at the proper PC all day, but the WUs still show as not reported yet, so that's a good indication that they are still happily crunching (24hr time preference). Previously I wasn't ever able to crunch ANY WU for the full 24hrs without a hang or failure or watchdog kicking in.

...I suppose it COULD ALSO mean that a squirrel has found a creative new way to combine the electic transformer outside my home with a latent desire to pursue the afterlife... as was the cause of some WUs not reporting YESTERDAY :)
ID: 2600 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 2601 - Posted: 13 Dec 2006, 1:16:27 UTC
Last modified: 13 Dec 2006, 2:10:46 UTC

Well, I'm now back to my problem PC, and was about to blow the train whistle --Whoo hooo!-- when I saw my screensaver MOVING, and it was on model 94. But then I noticed it only crunched the first WU for almost exactly 3hrs. Now that I've updated to project it shows watchdog ended it. And I commonly saw that same symptom on Rosetta once I activated the screensaver on the same host.

Because the watchdog ended it, the messages tab just shows that the very long WU name "finished".

The second WU I received is crunching happily now in to it's 10th hour. I could NEVER have lasted that long on v5.41. So, I can definately say "more stable". It is supposed to run for 24hrs, and that means if all goes well it will complete just after I leave for work in the AM.

I should note that my hyperthreaded CPU is set in BOINC to only use 1 at the moment. That measure did not seem to make Rosetta any more stable. Now I should set it back to two, but the only other work I have is for Rosetta :) and if I have both running, I don't know how BOINC chooses the screensaver, but I think I'll botch my test.

The "docking" WUs seemed to be the most graphically intense. Have some of those been queued up for test here?
ID: 2601 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 2602 - Posted: 13 Dec 2006, 2:05:24 UTC - in response to Message 2601.  

By a weird coincidence, we also had a ralph WU stuck on our test PC that was shut down by the watchdog. I've checked the results returned so far, though; most of the jobs are coming back fine. Your and my WUs are a little scary; we're going to work hard in the new year to trap these WUs and to track down where the jobs are getting stuck. But they look like the exception, rather than the rule.


Thanks to Bjarke as well. I'm guessing that Bjarke's hypothesis about opening two ralph windows simultaneously is correct.


Well, I'm now back to my problem PC, and was about to blow the train whistle --Whoo hooo!-- when I saw my screensaver MOVING, and it was on model 94. But then I noticed it only crunched the first WU for almost exactly 3hrs. Now that I've updated to project it shows watchdog ended it. And I commonly saw that same symptom on Rosetta once I activated the screensaver on the same host.

Because the watchdog ended it, the messages tab just shows that the very long WU name "finished".

The second WU I received is crunching happily now in to it's 10th hour. I could NEVER have lasted that long on v5.41. So, I can definately say "more stable". It is supposed to run for 24hrs, and that means if all goes well it will complete just after I leave for work in the AM.

I should note that my hyperthreaded CPU is set in BOINC to only use 1 at the moment. That measure did not seem to make Rosetta any more stable. Now I should set it back to two, but the only other work I have is for Rosetta :) and if I have both running, I don't know how BOINC chooses the screensaver, but I think I'll botch my test.

The "docking" WUs seemed to be the most graphically intense. Have some of those been queued up for test here?


ID: 2602 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 2603 - Posted: 13 Dec 2006, 3:04:30 UTC
Last modified: 13 Dec 2006, 3:50:23 UTC

Rhiju, I know you have some means of running the WUs with the same starting seed value, or heck if you know it died on model X, then start it on that model's starting value ...but if you run the WUs again, do they again require the watchdog on the same model? If so, then there's some better "sniffing" possible for your Rosetta blood hounds. If not, then there is something about the rest of the environment that causes the failure (like the screensaver, or memory issues, or what have you).

I'm still finishing the last Ralph WU to see if it will go 24hrs. And, after all of an hour apparently, it looks like that v5.43 WUs are all gone already, and I didn't get any, so I won't have any feedback on v5.43 for you.

I've also been puzzled why the watchdog shutting down the thread always seems to lose the completed models of that task? I thought it would preserve them.
ID: 2603 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 2604 - Posted: 13 Dec 2006, 3:49:08 UTC - in response to Message 2603.  

Yea, the good news is that we can rerun the jobs with the seed that was sent out, and thus reproduce the error (and see where the job gets stuck). It will probably take a day or two of debugging, and we'll get that going in the new year.

There are a lot of WU's queued for 5.43; but I think I may need to increase the size of the buffer. Thanks for the suggestions!

Rhiju, I know you have some means of running the WUs with the same starting seed value, or heck if you know it died on model X, then start it on that model's starting value ...but if you run the WUs again, do they again require the watchdog on the same model? If so, then there's some better "sniffing" possible for your Rosetta blood hounds. If not, then there is something about the rest of the environment that causes the failure (like the screensaver, or memory issues, or what have you).

I'm still finishing the last Ralph WU to see if it will go 24hrs. And, after all of an hour apparently, it looks like that v5.43 WUs are all gone already, and I didn't get any, so I won't have any feedback on v5.43 for you.

I've also been puzzled why the watchdog shutting down the thread always seems to lose the completed models of that task? I thought it would preserve them.


ID: 2604 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 2605 - Posted: 13 Dec 2006, 4:22:39 UTC
Last modified: 13 Dec 2006, 5:15:18 UTC

Thanks. I just nabbed two more WUs! I'll suspend the 5.42 task so I can see how 5.43 does overnight... on both threads of my HT CPU.

I guess the reason I brought up running on the same seed etc. was that it should be pretty easy to set up a machine to run that one model and see if it needs the watchdog to kill it. ...and if it does not, then it implies there are still some problems with the screensaver (or were at 5.42 anyway). If you find indeed it does invoke the watchdog, then you've got something for your 2007 "to do" list. But it would help assure that it is indeed the random event you mention... that just happened to occur with both our systems, or not.
ID: 2605 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 2606 - Posted: 13 Dec 2006, 5:23:53 UTC - in response to Message 2605.  

Great, let me know how those 5.43 WUs go. So far no errors.

Regarding the stuck WUs, my hypothesis is that they're not graphics related; we'll be able to check this, of course, by running them locally with executables compiled without graphics. I like your idea of checking whether the problems on your machine correlate with the ones on our machines ... I'll see if I can get the seed and workunit for the latest result you posted.

Thanks. I just nabbed two more WUs! I'll suspend the 5.42 task so I can see how 5.43 does overnight... on both threads of my HT CPU.

I guess the reason I brought up running on the same seed etc. was that it should be pretty easy to set up a machine to run that one model and see if it needs the watchdog to kill it. ...and if it does not, then it implies there are still some problems with the screensaver (or were at 5.42 anyway). If you find indeed it does invoke the watchdog, then you've got something for your 2007 "to do" list. But it would help assure that it is indeed the random event you mention... that just happened to occur with both our systems, or not.


ID: 2606 · Report as offensive    Reply Quote
1 · 2 · 3 · 4 . . . 5 · Next

Message boards : RALPH@home bug list : Bug reports for Ralph 5.42 and 5.43



©2024 University of Washington
http://www.bakerlab.org