Bug Reports for 5.45

Message boards : RALPH@home bug list : Bug Reports for 5.45

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Chu
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 26 Sep 06
Posts: 61
Credit: 12,545
RAC: 0
Message 2743 - Posted: 29 Jan 2007, 19:18:43 UTC - in response to Message 2742.  

If there is any apology to make, that should be from us. Thank you for your time and effort helping us.

I see, you were having problem of pre-empting a rosetta job and swapping it in and out with other BOINC applications. This is consistent with my previous speculation that your problem is probably not grahic-related. Honestly speaking, I don't know exactly either about what has gone wrong, but it could be somehow related to the BOINC api we were using for the rosetta 5.43 (though it did not explain why the problem did not happen universally on all other cilents' machines). The current 5.45 being tested on Ralph has been built with the newest version of BOINC API and that might help solve your problem. The plan is to put it on Rosetta@Home either later today or tomorrow. So please give it a try when it is upgraded and see if things improve on your side. Again, thank you for your generous contribution to our project.

The error message you got is certainly one of the symtoms related to graphics, but definitely not limited to that. May I ask if you have experienced any stability issue with your machine in general?


Hi Chu. Apologies for the long post.

No, I've never had any stability issue with my machine for any applications I run on it, with the sole exception that it doesn't like running the BOINC manager at the same time as I'm ripping DVDs. Other than that, it's rock solid. It's fairly well overclocked -I'm running a Core2Duo E6700 at 3.46 GHz, and my PC6400-rated RAM is actually running as PC8200 - but it's tested completely stable and several months of running both cores at 100% capacity 24/7 has never generated a single error for any BOINC application WU except Rosetta. Rosetta, though, became very touchy about running. It would inevitably fail a WU that was pre-empted and swapped out to allow something else to run. I had to leave it runing all the time on one core.

We certainly do not want to lose users because of application stability and that is why we are trying to work on improving it. Maybe you can check whether this is improved in 5.45 and if the failure rate goes down significantly, you may considering attaching back to Rosetta@Home.


I was quite puzzled and a bit disturbed at how the failure rate on Rosetta got more and more pronounced over time without any change to my machine's configuration or any other evidence of instability. I kept going for as long as possible because I liked crunching Rosetta and I'd accumulated a very respectable number of WUs. But the failure rate was becoming alarming, and on the 15th-16th January this year some 75-80% of all WUs aborted prematurely. That's when I regretfully had to call a halt. I joined RALPH to see whether the newer versions were more stable with an eye to going back to Rosetta when they're implemented. It's hard to tell, since the fairly irregular availability of work means I don't have a large WU base to draw conclusions from, but both 5.45 and 5.44 before it seem more stable than 5.43 on my machine; for one thing, they can both be swapped in and out to allow other BOINC applications to run without causing problems.

Out of curiosity, since the beta versions seemed more stable, I allowed my BOINC manager to download some new Rosetta workunits under 5.43 on Jan 27th. Sure enough, the first three it tried to run all failed with access violations, here, here and here. The fourth WU succeeded. By that stage, though, I'd had enough again and shut it down.

I have no idea why this is happening, and the 10% failure rate you mention would have been, if anything, an overestimate of the situation during the first few months I was crunching. The problems really seem to stem from the introduction of 5.43; which is puzzling since I don't use the graphics. I'll certainly try Rosetta again when 5.43 is upgraded, but I'd be a lot happier if I knew what was going wrong.



ID: 2743 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 2745 - Posted: 30 Jan 2007, 3:13:13 UTC

My previously problematic machine just went 18hrs, ss active, without a burp. Successfully complete 3 WUs and is still crunching on a fourth. During the start of getting these WUs I had set to enable my screen saver, went to take a shower, forgot I had left Rosetta active too, and by time I got back to this machine it was hung already. ...the Rosetta WU, not Ralph!

...so I'd say things are looking great on Windows.
ID: 2745 · Report as offensive    Reply Quote
Billy

Send message
Joined: 29 Jan 07
Posts: 14
Credit: 7,865
RAC: 0
Message 2746 - Posted: 30 Jan 2007, 14:22:59 UTC

It isn't possible to test this update on my Mac as there is no work units. I did get 2 work units on one day, but they ran and I didn't notice them, so I couldn't turn on the graphics.
ID: 2746 · Report as offensive    Reply Quote
Chu
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 26 Sep 06
Posts: 61
Credit: 12,545
RAC: 0
Message 2748 - Posted: 30 Jan 2007, 21:05:25 UTC - in response to Message 2746.  

Now it is updated on Rosetta@Home and you will get plenty of WUs to crunch. Just be aware that there is still some minor problem unsolved for mac platforms. See here
It isn't possible to test this update on my Mac as there is no work units. I did get 2 work units on one day, but they ran and I didn't notice them, so I couldn't turn on the graphics.

ID: 2748 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 2749 - Posted: 31 Jan 2007, 2:53:59 UTC - in response to Message 2748.  

Work units of the form

s018__CASP7_ASSEMBLE_SAVE_ALL_OUT_hom001__IGNORE_THE_REST_s018__BOINC_LOOP_RELAX__1446_0.clean.out.2

are acting a little wacky -- I'm working on the fix!

Now it is updated on Rosetta@Home and you will get plenty of WUs to crunch. Just be aware that there is still some minor problem unsolved for mac platforms. See here
It isn't possible to test this update on my Mac as there is no work units. I did get 2 work units on one day, but they ran and I didn't notice them, so I couldn't turn on the graphics.



ID: 2749 · Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 16 Feb 06
Posts: 364
Credit: 1,368,421
RAC: 0
Message 2750 - Posted: 31 Jan 2007, 10:24:55 UTC

> Had this one fail, was not at the computer so did not operate Boinc screensaver still using standard Windows one. All others have progressed with no trouble so far.

http:ralph.bakerlab.org/result.php?resultid=411601

<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# random seed: 2755617


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00681A55 read attempt to address 0x7BCFB090

Engaging BOINC Windows Runtime Debugger...
ID: 2750 · Report as offensive    Reply Quote
Profile Bober [B@P]

Send message
Joined: 18 Jun 06
Posts: 6
Credit: 15,427
RAC: 0
Message 2751 - Posted: 31 Jan 2007, 11:04:16 UTC - in response to Message 2750.  

I had two WUs with the same error:
result 411470
result 411652

ID: 2751 · Report as offensive    Reply Quote
tallguy-13088
Avatar

Send message
Joined: 17 Feb 06
Posts: 10
Credit: 121,701
RAC: 0
Message 2752 - Posted: 1 Feb 2007, 1:33:58 UTC

Hello,

I just aborted two RALPH work Units. They were:

s018__CASP7_ASSEMBLE_SAVE_ALL_OUT_hom001__IGNORE_THE_REST_s018__BOINC_LOOP_RELAX__1446_0.clean.out.1_1670_3

- and -

s018__CASP7_ASSEMBLE_SAVE_ALL_OUT_hom001__IGNORE_THE_REST_s018__BOINC_LOOP_RELAX__1446_0.clean.out.2_1670_3

Both were at 100%, PRE-EMPTED and still accumulating time while other projects were active. Earlier this evening, both had accumulated 10+ hours apiece. Upon restarting BOINC Manager (v5.4.9), unit #1 dropped back to 5.442% completion (at 49m 16s accumulated time) and the second went back to 10.442% completion at 45m 09s accumulated time). The graphics stated the second was in "stage assembly" for the process.

I am running W2K Build 2195 Service Pack 4 on dual Xeon 2.8Ghz cores. Ralph@Home was at 5.45. If there is any more info you need, please reply to this post. Thanks!
ID: 2752 · Report as offensive    Reply Quote
=Lupus=

Send message
Joined: 23 Sep 06
Posts: 4
Credit: 35,610
RAC: 0
Message 2754 - Posted: 1 Feb 2007, 7:24:59 UTC

Result 412972 same 0xc0000005 error. I was not even near the "show grafx" button! Good luck in bug-hunting,

=Lupus=
ID: 2754 · Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 16 Feb 06
Posts: 364
Credit: 1,368,421
RAC: 0
Message 2755 - Posted: 1 Feb 2007, 7:25:54 UTC

> Got a different one this time.
It had got to 100.00% but the Boinc Manager said it was still running. So I checked in my System Monitor (I am using Linux on this Opteron 275 machine) and it said that 1 of my 4 cpus was at idle and the other 3 at 100%. This then changed with the idle cpu moving from cpu to cpu till all 4 were swapping the idle job around from core to core. It also still held 166 MB of memory.
I had to abort it then all cpus ran at 100% again.

This workunit https://ralph.bakerlab.org/workunit.php?wuid=364578
ID: 2755 · Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 16 Feb 06
Posts: 45
Credit: 43,706
RAC: 20
Message 2757 - Posted: 1 Feb 2007, 11:40:19 UTC

I am still having problems when I display the graphics, notably when I enable the screensaver on [url=https://ralph.bakerlab.org/show_host_detail.php?hostid=2016]this computer{/url].

I currently have an ATI x850x graphics card installed, and the installed driver is 7-1_xp_dd_ccc_wdm_enu_40211 (catalyst version). Here is what I saw happen: the BOINC screensaver was running, and over time I saw the CPDN graphics, the QMC graphics, and either Ralph or Rosetta (both of which are 5.45). The last graphics I saw were from Ralph or Rosetta, then I came back and saw that the "VPU recover" feature had activated (display driver resets instead of hanging, and prepares a crash report for ATI). I allowed it to submit the report, and the Rosetta/Ralph WU did not crash, but finished normally, so I can't point you to the bad WU.

Later today I will put back the NVidia card that I also use with this machine (a GeForce FX5950), and see how that behaves.

ID: 2757 · Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 16 Feb 06
Posts: 45
Credit: 43,706
RAC: 20
Message 2758 - Posted: 1 Feb 2007, 11:41:47 UTC

Rats. I typo'ed the link, and I can't edit it. I'll try again. It's this computer
ID: 2758 · Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 16 Feb 06
Posts: 45
Credit: 43,706
RAC: 20
Message 2760 - Posted: 3 Feb 2007, 3:06:32 UTC

I have the NVidia card installed (it's a GeForce Fx5950), and I haven't seen any graphics problems since, either with Ralph or Rosetta. I'm using driver version 93.71. So much for ATI.

ID: 2760 · Report as offensive    Reply Quote
Viromancy

Send message
Joined: 20 Jan 07
Posts: 7
Credit: 1,425
RAC: 0
Message 2761 - Posted: 3 Feb 2007, 20:10:39 UTC

Well, after all the head scratching in the thread above, it seems I've finally managed to crack the Rosetta Weirdness on my machine. And in some respects it's obvious, while in others it's baffling. It seems I managed to pick a totally borderline overclock setting for my 2.66 GHZ C2D. Every other application and BOINC client program ran at 3.46 GHz without any problem, and that included all the overclock stress-test applications I ran. Apparently, though, Rosetta from mid 5.43 onwards doesn't.

So after going mad when 5.45 didn't work, I tried dropping the effective clock to 3.40 GHz and Vcore down to 5.125V. Rosetta ver 5.45 now appears to be totally stable for a 1.7% reduction in overclocked processor speed, right up to a 24 hour WU timing. I think I can live with that :-) Bloody peculiar, though. Maybe Rosetta should get prepared for being used as an OC stability check, because nothing else showed any effect; though admittedly I didn't try computing prime numbers for 12 hours...
ID: 2761 · Report as offensive    Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 8 Aug 06
Posts: 75
Credit: 2,396,363
RAC: 6,299
Message 2763 - Posted: 5 Feb 2007, 4:32:32 UTC

Please turn on RAC decay.

http://boinc.berkeley.edu/project_tasks.php
Reno, NV
Team: SETI.USA
ID: 2763 · Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 16 Feb 06
Posts: 364
Credit: 1,368,421
RAC: 0
Message 2764 - Posted: 5 Feb 2007, 20:36:19 UTC

> What happened to the crediting system?
It is back to what you get is what you claim, I checked one persons work units and he is getting up to 50 credits an hour (398 credits on an 8 hour WU with not that many decoys done) on the latest batch. Sure beats my 14 to low 20's that I get for my 6 hours processing per WU.
ID: 2764 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 2780 - Posted: 7 Feb 2007, 14:38:48 UTC

I just got one of these WUs:
1who__BOINC_ABINITIO_CONTROL2__1749_26_0 using rosetta_beta version 545
the graphic doesn't show the sidechains.
ID: 2780 · Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 16 Feb 06
Posts: 364
Credit: 1,368,421
RAC: 0
Message 2782 - Posted: 8 Feb 2007, 0:37:24 UTC

> Just had 4 Work Units fail, all at 1 hour processing time, I am expecting the other 2 to fail as well.
All the work units got 'stuck' and the Watchdog says it ended the run, but this is not the case.
All 4 work units on the Boinc Manager said that they were still running with NO CPU usage but still using up to 308 MB of RAM for each WU. All 4 got to 1 hour (my preferences are for 6 hours) and then said they were 100% complete but the WU did not release the CPU to go to another task.

http//ralph.bakerlab.org/result.php?resultid=420621
http//ralph.bakerlab.org/result.php?resultid=420709
http//ralph.bakerlab.org/result.php?resultid=420761
http//ralph.bakerlab.org/result.php?resultid=420767

Thanks
ID: 2782 · Report as offensive    Reply Quote
Chu
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 26 Sep 06
Posts: 61
Credit: 12,545
RAC: 0
Message 2783 - Posted: 8 Feb 2007, 4:56:57 UTC - in response to Message 2782.  

sounds like some problem interfacing with BOINC manager. Those WUs themselves are fine and several of them you killed actually showed that they were stuck at score 0 which means this did not happen in the middle of a simulation. Could you please next time close the BOINC manager and re-open it to see if any of these WUs will be finished and reported? If that does not help, then go ahead to kill them. In addition, it seems to be specific to your linux hosts, but not Windows, right?

€
> Just had 4 Work Units fail, all at 1 hour processing time, I am expecting the other 2 to fail as well.
All the work units got 'stuck' and the Watchdog says it ended the run, but this is not the case.
All 4 work units on the Boinc Manager said that they were still running with NO CPU usage but still using up to 308 MB of RAM for each WU. All 4 got to 1 hour (my preferences are for 6 hours) and then said they were 100% complete but the WU did not release the CPU to go to another task.

http//ralph.bakerlab.org/result.php?resultid=420621
http//ralph.bakerlab.org/result.php?resultid=420709
http//ralph.bakerlab.org/result.php?resultid=420761
http//ralph.bakerlab.org/result.php?resultid=420767

Thanks

ID: 2783 · Report as offensive    Reply Quote
Chu
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 26 Sep 06
Posts: 61
Credit: 12,545
RAC: 0
Message 2784 - Posted: 8 Feb 2007, 4:59:03 UTC - in response to Message 2780.  

in early stage of some simulations, we carried out low-resolution search and thus sidechains will not be shown. Usually in the first box, there will either "search backbone"( no sidechains) or "search_full_atom" (with sidechains).
I just got one of these WUs:
1who__BOINC_ABINITIO_CONTROL2__1749_26_0 using rosetta_beta version 545
the graphic doesn't show the sidechains.

ID: 2784 · Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : RALPH@home bug list : Bug Reports for 5.45



©2024 University of Washington
http://www.bakerlab.org