Bug reports for Ralph 5.42 and 5.43

Author	Message
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 2631 - Posted: 15 Dec 2006, 23:32:16 UTC I'm having good success with 5.43 on my host which previously had problems when screensaver was active. Hyperthreaded, two WUs ran all night with no ss (I forgot! had turned it off when Ralph ran out of work so R@H wouldn't hang), then all day with ss active, one just reported in with over 22hrs crunch time (on 24hr pref), and the avg time per model seems like that was a reasonable time to end. I've received some work on my other two hosts as well, but they do not use ss. All seems well there as well. ID: 2631 · Reply Quote

genes Send message Joined: 16 Feb 06 Posts: 45 Credit: 43,706 RAC: 20	Message 2632 - Posted: 16 Dec 2006, 3:54:43 UTC - in response to Message 2630. Last modified: 16 Dec 2006, 4:47:34 UTC Yes, that is correct, it did not freeze the computer. Edit: I can usually prevent frozen WU's from being terminated by pressing ctrl-alt-del to get the task manager to be displayed. I will also usually get the taskbar, so I can then close the BOINC manager on the taskbar, then the BOINC CC in the systray. If I do that (in that order), BOINC shuts down all the projects in an orderly manner, and restarting BOINC will then restart all the current WU's from their last checkpoint, even the frozen Rosetta. It will then usually pass the point at which it froze and complete normally. This kind of freezing is usually due to the graphics (running in screensaver mode). Edit: I've had this one fail recently. resultid=363974 It produced a lot of nice debug output. Hi gene, that job just crashed and did not freeze your computer, right? From users' report and my local test, it looks like that if a frozen WU is forced to be terminated, it reports error code as - exit code 1073807364 (0x40010004). If a WU just crashes itself without freezing the host computer, it will reports error code as -1073741819 (0xc0000005). I had a WU fail today, this message was in the log: 12/14/2006 9:14:07 PM\|ralph@home\|Unrecoverable error for result 1ten__BOINC_POSE_ABRELAX_VARY_ALL_BOND_ANGLES_VARY_ALL_BOND_DISTANCES_NEWRELAXFLAGS_frags83__1561_15_0 ( - exit code -1073741819 (0xc0000005)) This result: resultid=362757 I came back to the computer and had a Windows error message on the screen "Please tell Microsoft about this problem..." . I don't know if graphics were involved, since I was out, however I do have graphics enabled on this machine, and it is a multiprocessor machine. hostid=2016 ID: 2632 · Reply Quote

fastdude Send message Joined: 13 Dec 06 Posts: 4 Credit: 113 RAC: 0	Message 2633 - Posted: 16 Dec 2006, 7:52:51 UTC No further graphic issues noticed with rosetta 5.43, ralph is suspended & not getting tasks. the previous issue happened when alternating between ralph & rosetta. I will allow ralph to get new tasks again and check it out. btw the box is using a GF4 mx 64MB agp 4x video card. ID: 2633 · Reply Quote

feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 2634 - Posted: 16 Dec 2006, 15:18:32 UTC Well... I WAS having better success. Last night I had two ended by the watchdog. The ss this morning was just a BOINC logo, since I had suspended Rosetta, and the Ralph WUs completed. I have another end with Exit status -1073741819. Suggest you place a timestamp in stderr. That would help to see if the watchdog kicked both WUs out at about the same time (i.e. that they both had hung at the same time an hour before). ID: 2634 · Reply Quote

Leffe Send message Joined: 19 Feb 06 Posts: 10 Credit: 3,683 RAC: 0	Message 2635 - Posted: 16 Dec 2006, 17:59:03 UTC hereÂ´s 1 comp.error: 16/12/2006 03:27:17\|ralph@home\|Starting task 2reb__TREEJUMP_ABRELAX__NEWRELAXFLAGS_TOP1__1565_17_0 using rosetta_beta version 543 16/12/2006 03:27:19\|\|Suspending work fetch because computer is overcommitted. 16/12/2006 04:19:16\|ralph@home\|Unrecoverable error for result 2reb__TREEJUMP_ABRELAX__NEWRELAXFLAGS_TOP1__1565_17_0 ( - exit code -1073741819 (0xc0000005)) 16/12/2006 04:19:16\|ralph@home\|Deferring scheduler requests for 1 minutes and 0 seconds 16/12/2006 04:19:16\|\|Rescheduling CPU: application exited 16/12/2006 04:19:16\|ralph@home\|Computation for task 2reb__TREEJUMP_ABRELAX__NEWRELAXFLAGS_TOP1__1565_17_0 finished Leffe ID: 2635 · Reply Quote

FluffyChicken Send message Joined: 17 Feb 06 Posts: 54 Credit: 710 RAC: 0	Message 2636 - Posted: 17 Dec 2006, 8:42:04 UTC Have you given Jack Schonbrun a ring ? If I remember correctly he did the initial graphics setup, he may know an error or two ... (though David Kim did the rotation if memory serves me right) ID: 2636 · Reply Quote

FluffyChicken Send message Joined: 17 Feb 06 Posts: 54 Credit: 710 RAC: 0	Message 2637 - Posted: 17 Dec 2006, 8:54:04 UTC - in response to Message 2630. it reports error code as - exit code 1073807364 (0x40010004). I believe 0x40010004 means 'task is/was running' which would correspond to it being forced to close (via task manager or the hung program do you wish to kill it question.) Maybe you should give out the graphics code (and calling messages) see if anyone in that field can debug and help you. ID: 2637 · Reply Quote

feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 2638 - Posted: 17 Dec 2006, 16:56:44 UTC Last modified: 17 Dec 2006, 17:44:35 UTC My two docking WUs both ended prematurely (watchdog...20 credits). When I awoke this morning the screen saver was not hung, then I moved the mouse to take the ss down and it froze. Task manager shows one task getting 65% of my hyperthreaded CPU, and the other getting 35%. Odd thing is the WU that was getting the 35% was the one the graphic was being displayed for (you could tell by the elapsed time on it as shown in the graphic and in the CPU time shown in task manager). Now that I ended the application that was not responding, I crashed a WU (- exit code 1073807364 ...a positive number?), but it was the one that was getting the 65% of CPU. The threads seem confused about who is doing what. ID: 2638 · Reply Quote

feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 2639 - Posted: 17 Dec 2006, 19:16:48 UTC I just had that happen again. 65/35 split, with the one being displayed being the one that's getting 35%, and it didn't hang until I tried to use the computer. ID: 2639 · Reply Quote

FluffyChicken Send message Joined: 17 Feb 06 Posts: 54 Credit: 710 RAC: 0	Message 2640 - Posted: 17 Dec 2006, 20:22:32 UTC - in response to Message 2638. My two docking WUs both ended prematurely (watchdog...20 credits). When I awoke this morning the screen saver was not hung, then I moved the mouse to take the ss down and it froze. Task manager shows one task getting 65% of my hyperthreaded CPU, and the other getting 35%. Odd thing is the WU that was getting the 35% was the one the graphic was being displayed for (you could tell by the elapsed time on it as shown in the graphic and in the CPU time shown in task manager). Now that I ended the application that was not responding, I crashed a WU (- exit code 1073807364 ...a positive number?), but it was the one that was getting the 65% of CPU. The threads seem confused about who is doing what. Have you tested it with 5.8.0 yet to see if there is a difference ? ID: 2640 · Reply Quote

feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 2642 - Posted: 18 Dec 2006, 13:31:57 UTC Have you tested it with 5.8.0 yet to see if there is a difference ? I'm trying to change as few variables at a time as I can. Right now I feel I'm a pretty reliable litmas test for the straight, common, non-beta ('cepting Ralph) Windows installation. So, even if updating BOINC would help, that isn't the answer we need until it becomes the stable and "recommended" release. With my 24hr work units, and not knowing how much longer they plan to test, I'm never sure if I've got another 2 or 3 days to feedback or not. And I'd prefer to run a new Rosetta verison in that time if they release one, rather then a new BOINC version. But I'd be glad to do it if Chu or Rhiju feel it would be helpful to their study of the problems. ID: 2642 · Reply Quote

Andrew Leaver-Fay Send message Joined: 14 Apr 06 Posts: 2 Credit: 5,272 RAC: 0	Message 2643 - Posted: 18 Dec 2006, 14:08:36 UTC Hi Rhiju I've noticed that ralph is running on my mac during the day even when I set my BOINC preferences so that it should only run at night (on a single processor between 6pm and 8am). This is a lot like the problems I've had with boinc before which seemed to disappear when I updated my boinc client following DK's advice. It almost seems it's ralph and not the boinc manager that is the source of the problem; when there were no jobs to run from ralph for a while (and I was running Rosetta@home jobs only) things went smoothly. Ralph is now taking 163% of my processor time (from top -- I have two processors). Top thinks the executables name is "rosetta_be" but clips the name at the "be" -- probably rosetta_beta. The job has been running for 157 hours. I haven't earned any credits for the past week, though. I ran ps -aux and the job looks like: rosetta_beta_5.43_powerpc-apple-darwin xx 1rnb A -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -pose_abinitio -pose_relax -pose_relax_fragment_moves -out ps clips the command line, so it could have gone on for another page or two. Seems to be a worrying bug: boinc is burning clock cycles, but not dispensing credit. Best, Andrew ID: 2643 · Reply Quote

feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 2644 - Posted: 18 Dec 2006, 15:09:10 UTC Andrew, it sounds like you have a task that has slipped away from BOINC's control. We see this happen once and a while on Windows. It is almost like BOINC doesn't realize it is still running. And this may be why it's running during non-scheduled hours, cuz it doesn't respond when BOINC says "whoa!". But you say it has been running for 157hrs. And THAT sounds like a task that has slipped away from the Rosetta (or Ralph) watchdog. Because it appears your run time preference is only an hour. The watchdog should have kicked in some time ago. Ralph shows you've only got one task that is not yet completed and it was sent on Thursday. And by my math, 157hrs have not elapsed since Thursday, so now I'm really confused. In fact, with a Ralph 4 day deadline, times 24hrs... the deadline is 96hrs. I did see a case on Windows where a task managed to tally up more then one second per second of wall-clock time. This was due to the screensaver problems, and a hyperthreaded CPU. Suggest you take note of the process ID at a minimum and that way you can tell for certain if it does end and another starts up. If the task really has run for that long, I'd abort it. Or at least suspend it and resume it again. But, since BOINC doesn't seem in contol of that task... I guess I'd end BOINC for 5 minutes, the task should then end. If it doesn't then reboot or end that process. Then restart BOINC. ID: 2644 · Reply Quote

FluffyChicken Send message Joined: 17 Feb 06 Posts: 54 Credit: 710 RAC: 0	Message 2645 - Posted: 19 Dec 2006, 8:51:06 UTC - in response to Message 2644. Andrew, it sounds like you have a task that has slipped away from BOINC's control. We see this happen once and a while on Windows. It is almost like BOINC doesn't realize it is still running. And this may be why it's running during non-scheduled hours, cuz it doesn't respond when BOINC says "whoa!". But you say it has been running for 157hrs. And THAT sounds like a task that has slipped away from the Rosetta (or Ralph) watchdog. Because it appears your run time preference is only an hour. The watchdog should have kicked in some time ago. Ralph shows you've only got one task that is not yet completed and it was sent on Thursday. And by my math, 157hrs have not elapsed since Thursday, so now I'm really confused. In fact, with a Ralph 4 day deadline, times 24hrs... the deadline is 96hrs. I did see a case on Windows where a task managed to tally up more then one second per second of wall-clock time. This was due to the screensaver problems, and a hyperthreaded CPU. Suggest you take note of the process ID at a minimum and that way you can tell for certain if it does end and another starts up. If the task really has run for that long, I'd abort it. Or at least suspend it and resume it again. But, since BOINC doesn't seem in contol of that task... I guess I'd end BOINC for 5 minutes, the task should then end. If it doesn't then reboot or end that process. Then restart BOINC. Since 5.8.x is planned to be released shortly (shouldn't be to far away, but then they have said that before ;) The code is not going to change much, probably just sime simpleGUI fixes. side / anyways, why are you running 24hr tasks on Ralph I thought they wanted them short over here, would create more application swapping as well.# /side anyways, it was the communication problem (0x40010004) still running and boinc getting confused problem I was wondering about, the swithcing between the screensaver graphics. There where quite a few fixes going on with screensaver/graphics and starting/stopping/stalling of tasks. I thought a quick days testing on 3hr tasks should see if it is more stable. Seeing the rate you report them and seem to be able to cause it to happen ;-) The other error (0xc0000005) has been in many projects, Rosetta before, Einstien about a year ago, CPDN as well and was certainly always related to the graphics. Maybe a true test would be to replace the graphics with the default boinc one, see if it still happens . ID: 2645 · Reply Quote

Chu Volunteer moderator Project developer Project scientist Send message Joined: 26 Sep 06 Posts: 61 Credit: 12,545 RAC: 0	Message 2646 - Posted: 19 Dec 2006, 22:26:33 UTC We suspect it is a problem of thread synchronization. Basically Rosetta working thread does the simulation which changes all the atom coordinates ( which are saved in shared memory) while the graphic thread tries to read data from that place to draw the graphic or screensaver. Currently there is no locking mechanism to ensure the shared memory is accessed by one thread at a time and this could generate some conflicts or memory corruption and then trigger an error. On one of our local computers, when screensaver or graphic is turned on, it caught errors at a rate of at least one per day on average and without any graphics, it ran flawlessly. The errors which have been observed include crashing(0xc0000005), hung-up (0x40010004) and being stuck( watchdog ending). All the errors were not reproducable with same random number seeds and we think that is due to the radomness in graphic process. Another side proof was that showing sidechains requires accessing shared memory more often and intensively, and after turning off sidechains and rotating, the graphic error rates drop but the problem is not solved completely. There seems to be an correlation between two. Anyway, our plan is to add a thread locking mechanism in the next release to see if this helps. This will probably happen after the holiday season. I believe the new boinc 5.8.x should also help to reduce the error rate. Thank everyone for helping test on this issue. ID: 2646 · Reply Quote

genes Send message Joined: 16 Feb 06 Posts: 45 Credit: 43,706 RAC: 20	Message 2647 - Posted: 20 Dec 2006, 2:17:57 UTC Got an error on this WU today: resultid=375110 It was an 0xC0000005. Running with a Quad Xeon (Sossaman), XPSP2 and Boinc core 5.8.0. Graphics were enabled. No biggie, just sayin'. Errors seem to be somewhat less with 5.8.0 and 5.43, but they still happen. Not shown, of course, are the ones that I save by stopping and restarting Boinc instead of aborting them. Locking/mutexes are the way to go. ID: 2647 · Reply Quote

feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 2648 - Posted: 20 Dec 2006, 3:53:51 UTC Sounds like a lot of the "random failures" will be eliminated!! YIPPIE!!! I take it you can confirm that hyperthreaded CPUs saw more crashes as well? ...but now you're in for the next problem. PERFORMANCE! The crunching thread will basically be on hold until the graphic thread does it's thing. If you weren't already doing that, it will be quite a difference. I have no idea if this is feasible or not, but if you can segment memory in to classes or front to back of the protein or place some sort of order to it, perhaps you can have a semaphore that is more granular then just "1=crunch", "0=Graphic" access to shared memory. If you can get more granular, then perhaps the graphic can get what it needs to rendure the first 5 AAs of the protein and sidechains, and then release that memory so the crunch thread can start considering the next position for those specific atoms or AAs. Actually... the crunch thread COULD "consider" the next move, just don't write it back to shared memory yet. Picture it like a segmented worm, looped into a circle. Where the threads would each process in order from head to tail, one segment at a time. Since ya gotta crunch something to draw it, I guess the graphic thread would always trail the crunch thread. Since the graphic is read-only, perhaps that helps too, perhaps the work in the crunch thread that actually updates the shared memory could be delayed until late in the compute cycle, thus allowing read-only access to both threads for a higher % of the time and avoiding lock waits. Or, perhaps you could lock the different views with seperate semaphores. So we might be crunching a new... I don't know what ya call the leftmost box, but crunch on that while drawing the "accepted" graphic or the "low energy" graphic. Actually, the only real contention problem is going to be on that left-most box. Perhaps the graphic thread is already smart enough not to redraw the low energy shape unless it actually has changed? That would minimize contention there. The other idea is to keep two copies of the info. needed for the graphic. The crunch thread pushes a copy out to the graphic thread, but only often enough to keep up with the frame rate desired. That way it always has access to the crunching memory it has now, and it is only delayed every 1/10th of a second when we have to synch up to do the push. Combine this with the first idea of having a more granular lock and you won't be waiting for the semaphore often at all. ID: 2648 · Reply Quote

feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 2649 - Posted: 20 Dec 2006, 4:07:47 UTC Last modified: 20 Dec 2006, 4:52:09 UTC By the way, is there a prize for an "accurate prediction" of another sort? ...December 7, can anyone beat that? ID: 2649 · Reply Quote

feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0	Message 2650 - Posted: 20 Dec 2006, 4:33:56 UTC Ok ok, enough of my glout-of-the-year. ...back to business ...a better idea, just double buffer what is pushed out to the graphic thread. And come up with some way of making a reasonable guess at the refresh rate. I mean take the CPU benchmarks and the time to make one iteration through this loop and rough out how many iterations to make before pushing the next frame out to the open buffer. If I'm running 30 frames per second maybe I have to push every other time through a given loop. If I'm only running 10fps then we can crunch through the loop 6 times before taking the time to push out the info. for the next frame. Let's you crunch all the time, no lock wait on the crunching thread. Just adds the time required to do the push and this is minimized by only pushing memory out when it will actually be needed by the graphic thread. I gotta think that compared to the 110MB typical size of a running WU, that two more copies of the info needed for the graphic thread is pretty minimal. ID: 2650 · Reply Quote

FluffyChicken Send message Joined: 17 Feb 06 Posts: 54 Credit: 710 RAC: 0	Message 2651 - Posted: 20 Dec 2006, 14:02:33 UTC Chu, Could you put that problem summary in the 'technical news' at the Rosetta@home site. It would give people a definate place of what the problem is, it would also mean forum helpers could post a link to the news when the errors are happening. ID: 2651 · Reply Quote