Bug reports for Ralph 5.42 and 5.43

Message boards : RALPH@home bug list : Bug reports for Ralph 5.42 and 5.43

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
darknightcl

Send message
Joined: 21 Dec 06
Posts: 3
Credit: 36
RAC: 0
Message 2652 - Posted: 21 Dec 2006, 4:00:24 UTC

K, so I've been running Rosetta for over a year now, and only recently started having problems with the project. For instance, roughly 75% of the work units I download crash out with the funny graphics errors. One thing that I've noticed that I haven't seen anyone else mention is this. Sometimes the model displayed seems to "disconnect" at some point along the backbone. For example, it will be a nice continuous C-alpha trace (I assume this is a C-alpha trace of the protein), then it will suddenly have a break in it, like the protein has been cleaved with a peptidase. The two ends will sometimes wave around in a manner that is just not consistent with them still being connected, so it almost looks like there is a problem with the science code, like it is taking liberties (such as introducing breaks where it is convenient to have them) with the sequence, which doesn't seem probable, as that would undermine the science being done. The other possibility might be that there is a problem with the communication between the graphics code and the science code, but I'm not a programmer and cannot make any suggestions with regards to this. I'm running version 5.43 of the rosetta application, with CC 5.4.11 (the official release version). I have an AMD64 X2 4200+ with 1GB DDR 400 RAM, GeForce 6150 integrated graphics, latest directx (9c), Windows XP Pro SP2 (32-bit) with all the necessary patches.

I'm not sure if the helps but I figured I'd mention it.
ID: 2652 · Report as offensive    Reply Quote
darknightcl

Send message
Joined: 21 Dec 06
Posts: 3
Credit: 36
RAC: 0
Message 2653 - Posted: 21 Dec 2006, 4:34:42 UTC

Another thought... One of the posts by Chu was mentioning that there is no locking mechanism in place to prevent the science thread and the graphics thread from trying to access the same memory at the same time, which can cause a problem if it occurs (or at least that is what I understand the post to mean).

If it is the case that the current problem is caused by the graphics and science threads conflicting in this manner, wouldn't we have started seeing this problem a long time ago, like when graphics were first introduced? Why has the problem only started cropping up now?

Just a thought...
ID: 2653 · Report as offensive    Reply Quote
FluffyChicken

Send message
Joined: 17 Feb 06
Posts: 54
Credit: 710
RAC: 0
Message 2654 - Posted: 21 Dec 2006, 14:20:12 UTC - in response to Message 2653.  

Another thought... One of the posts by Chu was mentioning that there is no locking mechanism in place to prevent the science thread and the graphics thread from trying to access the same memory at the same time, which can cause a problem if it occurs (or at least that is what I understand the post to mean).

If it is the case that the current problem is caused by the graphics and science threads conflicting in this manner, wouldn't we have started seeing this problem a long time ago, like when graphics were first introduced? Why has the problem only started cropping up now?

Just a thought...


- The more common usage of dual core processors today
- The increased level of comlexity in the graphics (the sidechains), this part of the 'theory of graphics crashes' coincides quite happily with the release of the docking program and the increased graphics deisplay.
- More active people reporting things on the forum (due to more R@H members overall)
ID: 2654 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 2655 - Posted: 21 Dec 2006, 16:27:40 UTC

...so it almost looks like there is a problem with the science code, like it is taking liberties (such as introducing breaks where it is convenient to have them) with the sequence, which doesn't seem probable, as that would undermine the science being done.


Yes! They have a "jumping" algorythm you will see used on some tasks. It does just that, break at what are believed may prove to be pivitol points in the chain and then search around for what the correct reconnection of the two points might be.

See discussion of Dr. Baker's journal and his original journal entry which started the discussion.

I don't understand it all. But basically it is not a symptom of a graphic problem, it is a visual queue that Rosetta's efficiency in finding the structure is improving.
ID: 2655 · Report as offensive    Reply Quote
darknightcl

Send message
Joined: 21 Dec 06
Posts: 3
Credit: 36
RAC: 0
Message 2656 - Posted: 21 Dec 2006, 16:50:41 UTC - in response to Message 2654.  


- The more common usage of dual core processors today
- The increased level of comlexity in the graphics (the sidechains), this part of the 'theory of graphics crashes' coincides quite happily with the release of the docking program and the increased graphics deisplay.
- More active people reporting things on the forum (due to more R@H members overall)


K, but is it only dual core processors that are having this problem? I have noticed a low rate of work unit failure on another computer, which is a single cored 64-bit AMD processor, but I'm never watching closely enough to see if it is this same graphics failure, or some other problem (though I'll admit the problems did seem to stop with 5.43, but I also don't have any work unit history for this computer, it has been concentrating almost exclusively on CPDN for the last week or so, I'll go see if I can crash work units on it later).

Also, previous versions of rosetta were stable on my Athlon X2, or at least the crash rate was low enough that I didn't notice it. I believe the last stable release was 5.37 or something like that. If memory serves you could rotate and zoom on a molecule in 5.37, with no problems. Essentially, you've now reduced the level of the graphics complexity to below that of 5.37, and my computer is still crashing almost all of its work units. Right now I have one which hasn't done anything for about 40 minutes, but the time counter continues to increment, it is like the science code has stalled. I'll leave it to see if the watchdog kicks in. The important point is that I don't think (I'm not certain on this point) the graphics had been displayed at all. I'd noticed that the time remaining estimate was going up, not down, and decided to check on it.

My point is that I didn't start noticing work unit failures until release 5.41, and these failure occur on computers other than my dual cored X2.

In case anyone is curious I have stress tested my computer using Prime95, both cores (separately) with no problems. I've also tested my RAM using Memtestx86, the most recent version, again, no problems.
ID: 2656 · Report as offensive    Reply Quote
FluffyChicken

Send message
Joined: 17 Feb 06
Posts: 54
Credit: 710
RAC: 0
Message 2657 - Posted: 22 Dec 2006, 10:52:43 UTC - in response to Message 2656.  
Last modified: 22 Dec 2006, 11:38:35 UTC


- The more common usage of dual core processors today
- The increased level of comlexity in the graphics (the sidechains), this part of the 'theory of graphics crashes' coincides quite happily with the release of the docking program and the increased graphics deisplay.
- More active people reporting things on the forum (due to more R@H members overall)


K, but is it only dual core processors that are having this problem? I have noticed a low rate of work unit failure on another computer, which is a single cored 64-bit AMD processor, but I'm never watching closely enough to see if it is this same graphics failure, or some other problem (though I'll admit the problems did seem to stop with 5.43, but I also don't have any work unit history for this computer, it has been concentrating almost exclusively on CPDN for the last week or so, I'll go see if I can crash work units on it later).

Also, previous versions of rosetta were stable on my Athlon X2, or at least the crash rate was low enough that I didn't notice it. I believe the last stable release was 5.37 or something like that. If memory serves you could rotate and zoom on a molecule in 5.37, with no problems. Essentially, you've now reduced the level of the graphics complexity to below that of 5.37, and my computer is still crashing almost all of its work units. Right now I have one which hasn't done anything for about 40 minutes, but the time counter continues to increment, it is like the science code has stalled. I'll leave it to see if the watchdog kicks in. The important point is that I don't think (I'm not certain on this point) the graphics had been displayed at all. I'd noticed that the time remaining estimate was going up, not down, and decided to check on it.

My point is that I didn't start noticing work unit failures until release 5.41, and these failure occur on computers other than my dual cored X2.

In case anyone is curious I have stress tested my computer using Prime95, both cores (separately) with no problems. I've also tested my RAM using Memtestx86, the most recent version, again, no problems.



See option number 2,
It started (or was noticed a lot more) when the docking code came into it.

The part about duat/ht is that it is just more susceptible to the desyncronisation happening. I have also had a rare few fail on my P-M and Athlon64 and without graphics open. but it is nothing like what HT/dual people that play with graphics are reporting.

If they where really smart about it (they being Rosetta@home) they would put a tick box inthe proeferences to say 'I do not want graphics' and then they can sen the person a version with all the graphis ripped out of it, this often speeds up processing a touch (it does slighctly at seti) and decrease the size of the program along with the running memory requirements.

Personaly I would love that option.
ID: 2657 · Report as offensive    Reply Quote
Profile Trog Dog
Avatar

Send message
Joined: 8 Aug 06
Posts: 38
Credit: 41,996
RAC: 0
Message 2660 - Posted: 3 Jan 2007, 7:09:16 UTC

Problem wu here
ID: 2660 · Report as offensive    Reply Quote
Profile Silver Streak

Send message
Joined: 11 Dec 06
Posts: 5
Credit: 216,369
RAC: 0
Message 2665 - Posted: 11 Jan 2007, 3:36:05 UTC

I had over 40 WU's err out in the last 2 hrs!
ID: 2665 · Report as offensive    Reply Quote
Pieface

Send message
Joined: 16 Feb 06
Posts: 64
Credit: 203,513
RAC: 0
Message 2666 - Posted: 11 Jan 2007, 4:05:18 UTC

Likewise, tons of them, but at least they are going quickly, like in 90 secs or so.
ID: 2666 · Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 16 Feb 06
Posts: 166
Credit: 131,419
RAC: 0
Message 2667 - Posted: 11 Jan 2007, 4:09:37 UTC

All seem to fail with this error message

" - exit code -1073741819 (0xc0000005) "

Anders n

ID: 2667 · Report as offensive    Reply Quote
Papagiorgio

Send message
Joined: 2 Nov 06
Posts: 3
Credit: 26,100
RAC: 0
Message 2668 - Posted: 11 Jan 2007, 7:34:24 UTC

This result errored out the same way.

Matthias
ID: 2668 · Report as offensive    Reply Quote
Pepo
Avatar

Send message
Joined: 8 Sep 06
Posts: 104
Credit: 36,890
RAC: 0
Message 2669 - Posted: 11 Jan 2007, 17:48:28 UTC

The same on my hosts, like
Windows here: exit code -1073741819 (0xc0000005), Reason: Access Violation (0xc0000005) at address 0x0066C28D read attempt to address 0x0405FF98 (with full BOINC Windows Runtime Debugger symbolic output),
or Linux here: Maximum disk usage exceeded, segmentation violation, with numeric Stack trace (12 frames).

Peter
ID: 2669 · Report as offensive    Reply Quote
Chu
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 26 Sep 06
Posts: 61
Credit: 12,545
RAC: 0
Message 2670 - Posted: 11 Jan 2007, 19:36:19 UTC - in response to Message 2651.  

I just posted it here. Sorry for the delay.
Chu,


Could you put that problem summary in the 'technical news' at the Rosetta@home site.

It would give people a definate place of what the problem is, it would also mean forum helpers could post a link to the news when the errors are happening.

ID: 2670 · Report as offensive    Reply Quote
Chu
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 26 Sep 06
Posts: 61
Credit: 12,545
RAC: 0
Message 2671 - Posted: 11 Jan 2007, 19:40:03 UTC - in response to Message 2669.  

A bad batch, I think, maybe with bad memory management...
The same on my hosts, like
Windows here: exit code -1073741819 (0xc0000005), Reason: Access Violation (0xc0000005) at address 0x0066C28D read attempt to address 0x0405FF98 (with full BOINC Windows Runtime Debugger symbolic output),
or Linux here: Maximum disk usage exceeded, segmentation violation, with numeric Stack trace (12 frames).

Peter

ID: 2671 · Report as offensive    Reply Quote
darkpella

Send message
Joined: 30 Mar 06
Posts: 4
Credit: 15,691
RAC: 0
Message 2672 - Posted: 12 Jan 2007, 8:37:22 UTC

Hi,

when ralph is suspended by Boinc Core to let anothe task run:

11/01/2007 20.38.31|ralph@home|Pausing task 1mkyA_TREEJUMP_ABRELAX__NEWRELAXFLAGS_LARS_TOP2_BARCODE__1607_25_0 (removed from memory)


it doesn't get preempted nor it stops (i.e. it still runs at full power) hence abosrbing lots of CPU cycles form the task that should be running.

I tried forcing Boic Core to make ralph run (suspending every other task) and then let it switch again to another task (simply resuming all other tasks, since it switched to EDF) but it didn't stop "rosetta_beta_5." form crunching at about 70% of the CPU time.

I also tried suspending ralph as a project, but it didn't work either.

Will try in a while rebooting to see what happens and let you know.
Should you need any information before I reboot let me know ASAP.
I'm running Win 2000 SP4 on a PIV at 2,53 GHz.
Boinc version is 5.4.11
rosetta_beta version running is 5.43.

bye

darkpella
ID: 2672 · Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 16 Feb 06
Posts: 364
Credit: 1,368,421
RAC: 0
Message 2673 - Posted: 12 Jan 2007, 12:47:49 UTC

> Had 21 WUs fail for various reasons, none should be Screensaver related as I no longer run it.

Maximum disk usage exceeded, WU stuck Incorrect fragment size requested for Phi alignment
https://ralph.bakerlab.org/workunit.php?wu=338885

Maximum disk usage exceeded, WU stuck, SIGSEGV:Segmentation Violation
https://ralph.bakerlab.org/workunit.php?wu=384950

Exited with code 1
ERROR:Exit at:loop_relax.cc line:1798
https://ralph.bakerlab.org/workunit.php?wu=382149
https://ralph.bakerlab.org/workunit.php?wu=382150

Exited with code 1
Incorrect Function, ERROR:Exit at:.read_aa_ss.cc line:559
https://ralph.bakerlab.org/workunit.php?wu=382209

Exit Code -1073741819 Access Violation
https://ralph.bakerlab.org/workunit.php?wu=382799, 382800, 382872, 382875, 383015, 383016, 383146, 383148, 383298, 383352, 383405, 383406, 383459, 383460, 384728.

Computers are Opteron 275 (Linux), Opteron 285 (Linux) and 4800+ (Windows)
Only the windows machine has a screensaver running but it is not Boinc screensaver so does not appear to be related to graphics problem, a faulty batch? Testing what exactly?
ID: 2673 · Report as offensive    Reply Quote
Profile [B^S] Dr. Bill Skiba
Avatar

Send message
Joined: 15 Feb 06
Posts: 4
Credit: 6,496
RAC: 0
Message 2674 - Posted: 12 Jan 2007, 17:28:09 UTC
Last modified: 12 Jan 2007, 18:10:14 UTC

https://ralph.bakerlab.org/result.php?resultid=382713

A new one, at least for me.

Work unit would not preempt. Kept on running even though it said it was preempted and boinc manager said another wu from another project was running.

Ralph unit kept counting up both on cpu time and time to finish while saying it was preempted. BM said a uFluids wu was running, but cpu time stayed at zero and task manager showed Ralph using the cycles.

I aborted it.

ID: 2674 · Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 16 Feb 06
Posts: 166
Credit: 131,419
RAC: 0
Message 2675 - Posted: 15 Jan 2007, 15:09:39 UTC

Errors to report

https://ralph.bakerlab.org/result.php?resultid=387680
https://ralph.bakerlab.org/result.php?resultid=387679

ERROR:: Exit at: .rotamer_functions.cc line:1441

And

https://ralph.bakerlab.org/result.php?resultid=387637

file_name>H4H6_1lis_PAIRWISE_DOCK_MCM_1619_4_0_0</file_name>
<error_code>-161</error_code>

Anders n



ID: 2675 · Report as offensive    Reply Quote
Profile Silver Streak

Send message
Joined: 11 Dec 06
Posts: 5
Credit: 216,369
RAC: 0
Message 2676 - Posted: 15 Jan 2007, 16:32:07 UTC
Last modified: 15 Jan 2007, 17:22:29 UTC

I had a few of these also, they occured over night. They seem to have ran a normal length of time before the error occured.

</stderr_txt>
<message>
<file_xfer_error>
<file_name>H1H7_1lis_PAIRWISE_DOCK_MCM_1619_1_0_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>
ID: 2676 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 2677 - Posted: 15 Jan 2007, 18:16:47 UTC - in response to Message 2676.  

Great, I think I know what the problem is, and I'm sending them back out with the potential fix.

I had a few of these also, they occured over night. They seem to have ran a normal length of time before the error occured.




H1H7_1lis_PAIRWISE_DOCK_MCM_1619_1_0_0
-161





ID: 2677 · Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : RALPH@home bug list : Bug reports for Ralph 5.42 and 5.43



©2024 University of Washington
http://www.bakerlab.org