Posts by Chu

21) Message boards : RALPH@home bug list : Bug Reports for 5.44 (Message 2687)
Posted 21 Jan 2007 by Chu
Post:
This update has some new rosetta applications added in, such as a preliminary version of rosetta protein design protocol and a special rosetta docking protocol which handles symmetric oligomers. The primary developers of those protocols will post more details about their applications. Please note that we are still working on adding thread synchronization features to the rosetta graphics and we are sorry that this update DOES NOT have the graphic-related problem fixed.
22) Message boards : RALPH@home bug list : Bug reports for Ralph 5.42 and 5.43 (Message 2671)
Posted 11 Jan 2007 by Chu
Post:
A bad batch, I think, maybe with bad memory management...
The same on my hosts, like
Windows here: exit code -1073741819 (0xc0000005), Reason: Access Violation (0xc0000005) at address 0x0066C28D read attempt to address 0x0405FF98 (with full BOINC Windows Runtime Debugger symbolic output),
or Linux here: Maximum disk usage exceeded, segmentation violation, with numeric Stack trace (12 frames).

Peter

23) Message boards : RALPH@home bug list : Bug reports for Ralph 5.42 and 5.43 (Message 2670)
Posted 11 Jan 2007 by Chu
Post:
I just posted it here. Sorry for the delay.
Chu,


Could you put that problem summary in the 'technical news' at the Rosetta@home site.

It would give people a definate place of what the problem is, it would also mean forum helpers could post a link to the news when the errors are happening.

24) Message boards : RALPH@home bug list : Bug reports for Ralph 5.42 and 5.43 (Message 2646)
Posted 19 Dec 2006 by Chu
Post:
We suspect it is a problem of thread synchronization. Basically Rosetta working thread does the simulation which changes all the atom coordinates ( which are saved in shared memory) while the graphic thread tries to read data from that place to draw the graphic or screensaver. Currently there is no locking mechanism to ensure the shared memory is accessed by one thread at a time and this could generate some conflicts or memory corruption and then trigger an error. On one of our local computers, when screensaver or graphic is turned on, it caught errors at a rate of at least one per day on average and without any graphics, it ran flawlessly. The errors which have been observed include crashing(0xc0000005), hung-up (0x40010004) and being stuck( watchdog ending). All the errors were not reproducable with same random number seeds and we think that is due to the radomness in graphic process. Another side proof was that showing sidechains requires accessing shared memory more often and intensively, and after turning off sidechains and rotating, the graphic error rates drop but the problem is not solved completely. There seems to be an correlation between two.

Anyway, our plan is to add a thread locking mechanism in the next release to see if this helps. This will probably happen after the holiday season. I believe the new boinc 5.8.x should also help to reduce the error rate. Thank everyone for helping test on this issue.
25) Message boards : RALPH@home bug list : Bug reports for Ralph 5.42 and 5.43 (Message 2630)
Posted 15 Dec 2006 by Chu
Post:
Hi gene, that job just crashed and did not freeze your computer, right? From users' report and my local test, it looks like that if a frozen WU is forced to be terminated, it reports error code as - exit code 1073807364 (0x40010004). If a WU just crashes itself without freezing the host computer, it will reports error code as -1073741819 (0xc0000005).
I had a WU fail today, this message was in the log:

12/14/2006 9:14:07 PM|ralph@home|Unrecoverable error for result 1ten__BOINC_POSE_ABRELAX_VARY_ALL_BOND_ANGLES_VARY_ALL_BOND_DISTANCES_NEWRELAXFLAGS_frags83__1561_15_0 ( - exit code -1073741819 (0xc0000005))

This result: resultid=362757

I came back to the computer and had a Windows error message on the screen "Please tell Microsoft about this problem..." . I don't know if graphics were involved, since I was out, however I do have graphics enabled on this machine, and it is a multiprocessor machine. hostid=2016

26) Message boards : RALPH@home bug list : Bug reports for Ralph 5.42 and 5.43 (Message 2629)
Posted 15 Dec 2006 by Chu
Post:
Sidechains, zooming and rotating has been disabled in the current application to help us narrow down the cause of graphic crash. So it is normal that you can not do anything on the screen and since Rosetta spends most of its time in high-resolution refinement ( moving backbone a little and refine sidechains ), it is also normal to hardly observe changes on the screen. However, the step number, cpu time should change frequently to reflect that the WU is still alive . If the WU is still working on generating its first model, it shows the progress at 1% for a while.

Not sure about the slow graphic updating. Do you have other windows application running at the same time which also share cpu, memory and other resouces as well?
I've got a live one!

I can't do anyting with the graphic, no rotate, zoom etc. therefore no crash, as per all these "beta / test" WU's.
but
when viewing the graphic, very slow to show the image, black screen for a few sec then very slow to update the graphic.

WU at 1% but has taken 50min of cpu.

then go back to boinc when typing this message, and the WU shows 1.5% complete and 4 min of cpu time.

it didn't crash though.

boinc 5.75
rosetta 5.43
ralph rosetta_beta 5.43

27) Message boards : RALPH@home bug list : Bug reports for Ralph 5.42 and 5.43 (Message 2617)
Posted 14 Dec 2006 by Chu
Post:
Did it freeze (and you had to maually kill it) or just crash itself? Thanks.

We are trying to increase stability in this release... We have turned off mouse rotation and sidechains temporarily. Please let us know if you can force a crash by playing with the "show graphics" option from the boinc manager, or with your screensaver!


If I open the show graphics window then minimize and re-open the window is just black and then crashes

28) Message boards : RALPH@home bug list : Bug reports for Ralph 5.42 and 5.43 (Message 2607)
Posted 13 Dec 2006 by Chu
Post:
I just put some docking WUs on ralph for graphic stability test and let's what will come out from that.
Well, I'm now back to my problem PC, and was about to blow the train whistle --Whoo hooo!-- when I saw my screensaver MOVING, and it was on model 94. But then I noticed it only crunched the first WU for almost exactly 3hrs. Now that I've updated to project it shows watchdog ended it. And I commonly saw that same symptom on Rosetta once I activated the screensaver on the same host.

Because the watchdog ended it, the messages tab just shows that the very long WU name "finished".

The second WU I received is crunching happily now in to it's 10th hour. I could NEVER have lasted that long on v5.41. So, I can definately say "more stable". It is supposed to run for 24hrs, and that means if all goes well it will complete just after I leave for work in the AM.

I should note that my hyperthreaded CPU is set in BOINC to only use 1 at the moment. That measure did not seem to make Rosetta any more stable. Now I should set it back to two, but the only other work I have is for Rosetta :) and if I have both running, I don't know how BOINC chooses the screensaver, but I think I'll botch my test.

The "docking" WUs seemed to be the most graphically intense. Have some of those been queued up for test here?

29) Message boards : RALPH@home bug list : Bug Report for Ralph 5.41 (Message 2574)
Posted 3 Dec 2006 by Chu
Post:
I think those errors (exit 161) are from a batch of problematic WUs, not general for the application as we have seen before.
Old error 161 is still with us

http://ralph.bakerlab.org/result.php?resultid=331711

stderr out

<core_client_version>5.2.14</core_client_version>
<stderr_txt>
Graphics are disabled due to configuration...
# random seed: 2845454
# cpu_run_time_pref: 21600
======================================================
DONE :: 1 starting structures built 126 (nstruct) times
This process generated 126 decoys from 126 attempts
0 starting pdbs were skipped
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
<message><file_xfer_error>
<file_name>CAPRI_11_t27j_SMALLPERTURBATION_DOCKING_1520_12_1_0</file_name>
<error_code>-161</error_code>
<error_message></error_message>
</file_xfer_error>

</message>

Validate state Invalid

30) Message boards : RALPH@home bug list : Bug Report for Ralph 5.41 (Message 2569)
Posted 1 Dec 2006 by Chu
Post:
Hi Krzychu P., thanks for reporting all these errors. If I remember correctly, I have also seen you reporting similar type errors for the previous versions of ralph applications and other types of WUs. From the stderr output, it looks like that Rosetta simulations were found to enter some bad conditions and triggered pre-mature exits. Although we are not 100% sure on what have been wrong, it is mostly likely due to a corrupted database file. For those WUs you have reported problems, they seem to be running ok on other clients' computers (on both ralph and boinc) with a fairly good successful rate. In other words, this type of error seems to happen on your computer much more frequently than average and this leads to my suspicion that there might be an issue of your computer to handle those input database files. I am not an expert on computer hardware and boinc setup and I just want to bring this to your attention. Do you use this computer also run Rosetta@Home? Have you seen similar errors for the WUs from Rosetta@Home? Have you noticed any other signs of a potential hardware or software problem? From the stderr file, I can see your computer is running a non-English version of operating system. Could that be the reason that some of the files are not input correctly? Maybe there are some other experts here who can have a better idea.

Again, thank you for your contribution and support for our project.
After about 40 minutes of computing:

2006-12-01 08:07:29|ralph@home|Unrecoverable error for result 1mkyA_ETABLE_TEST_ABRELAX_rhh13sm6atrrep__1519_8_0 (Niepoprawna funkcja. (0x1) - exit code 1 (0x1))

<stderr_txt>
fullatom_setup.cc: CHANGING fa_max_dis to 6!!!!!!!!!!!!!
setting hydrogen_interaction_cutoff to: 6
# random seed: 2845518
# cpu_run_time_pref: 3600
fullatom_setup.cc: CHANGING fa_max_dis to 6!!!!!!!!!!!!!
setting hydrogen_interaction_cutoff to: 6
# cpu_run_time_pref: 3600
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] range
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] range
ERROR:: Exit at: .fullatom_energy.cc line:2002

31) Message boards : RALPH@home bug list : Bug Report for Ralph 5.41 (Message 2566)
Posted 30 Nov 2006 by Chu
Post:
The command line file is added for the project team. To test a lot of Rosetta parameters without changing the executable, we made them as input arguments from the command line. One impact of doing so is that Rosetta command line becomes longer and longer, difficutlt to remember and difficult to set up ( and more errors could slip through). The file is meant to help that aspect. In my personal opinion, this is a positive step, though still far away to go, to provide a more friendly control interface for Rosetta, such as to build up a graphic interface and a pull-down menu etc in the future.

Sorry for not making it more clear on the watchdog issue. It did stop the WUs if the run is found to be stuck or running too long and it did preserve models which have been completed. However, the old behavior would throw an errror if there was no model generated before being caught by watchdog and with the fix, this should no longer happen any more. The empty result file (not really empty as it says it is from a watch dog error) will be returned and recognized by the validator and the credit will be assigned.
Regarding the new feature to run Rosetta from a command file and a more flexible interface for setting up runs on BOINC...

I just wanted to confirm... this will make things more flexible for... the project team, right? Or if users can make use of this in some way, then of course we'll want to know how to utilize it and what it can do. If for the project team, another brief sentence about what this will enable you to do in the future would be nice. It will... ??? allow you to setup the runtime parameters for a set of WUs faster? and in a way that's less prone to error? ...or to produce more WUs on the server with less overhead? Or what?

=============

Also, I wasn't clear on what you fixed in item 3. Was the watchdog not stopping WUs? Or were the completed models not being properly preserved?

32) Message boards : RALPH@home bug list : Bug Report for Ralph 5.41 (Message 2560)
Posted 30 Nov 2006 by Chu
Post:
Ralph has been updated to 5.41. In this update, several previously found bugs were fixed. Those are:

1. bug of do checkpointing even after Rosetta is finished.
2. bug of "error" deleting some intermediate files after they are gzipped.
3. watchdog failure -- when a run is stuck and caught by the watchdog, the results, if there is any, will be returned and validated. Credits will be assigned acoordingly.
4. some other bugs related to Rosetta Science.

A new feature of reading Rosetta command from an input file is added and this gives more flexible interface to set up runs on BOINC.

Thanks for everyone's support and please report bugs here!
33) Message boards : RALPH@home bug list : Bug reports for Ralph 5.37 through 5.40 (Message 2554)
Posted 23 Nov 2006 by Chu
Post:
Thanks. that WU is problematic.
One more:

2006-11-23 12:14:34|ralph@home|Unrecoverable error for result CAPRI_11_U1H2_GLOBAL_DOCKING_1510_39_2 (Niepoprawna funkcja. (0x1) - exit code 1 (0x1))

<core_client_version>5.6.5</core_client_version>
<![CDATA[
<message>
Niepoprawna funkcja. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
ERROR:: Exit at: .docking.cc line:1825

Even it didn't begin to compute :(

34) Message boards : RALPH@home bug list : Bug reports for Ralph 5.37 through 5.40 (Message 2553)
Posted 23 Nov 2006 by Chu
Post:
that has been fixed in current Rosetta code repository and will be included in the next ralph release.
Pepo wrote:
Also the version 5.36 is checkpointing after reaching 100% (instead of reporting the result) and then being preempted by other apps afterwards (possibly for a longer time, because of negative STD).

Also the version 5.40 is checkpointing after reaching 100% (instead of reporting the result) and then being preempted by other apps afterwards (possibly for a longer time, if there is too negative STD).

Actually it happened on Rosetta, not here, but I assume the responsible code is the same.

Peter

35) Message boards : RALPH@home bug list : Bug reports for Ralph 5.37 through 5.40 (Message 2539)
Posted 16 Nov 2006 by Chu
Post:
We finally got time tracking down this harmless but confusing "warning" output. It is due to a problem of not closing a file stream properly after opening it. The fix will be included in the next update.
FRA_2rio_RIO2_hom002_6_2rio_6_1a06__IGNORE_THE_REST_10_1499_12_0


WARNING! error deleting file .aa2rio.out

36) Message boards : RALPH@home bug list : Bug reports for Ralph 5.37 through 5.40 (Message 2503)
Posted 8 Nov 2006 by Chu
Post:
Thank you all for the help. We are sorry that the file transfer bug was not completely fixed in 5.39 and that we had several updates in the last several days. The bug is very sneaky that it is only hit by a special combination of command line flags and only for some protein targets under some certain conditions, which makes local debugging difficult. Anyway, we believe this should be completely fixed in 5.40 as shown by our local preliminary test. We will put more tests on RALPH soon to confirm the fix. Thanks again for the patience and the generous support!
37) Message boards : RALPH@home bug list : Bug reports for Ralph 5.37 through 5.40 (Message 2493)
Posted 7 Nov 2006 by Chu
Post:
Sure, I am happy to answer that. A normal output file for Rosetta models contain for each atom the xyz position coordinates and some other necessary information. This file is just too large for BOINC application as there will be thousands and thousands of such files to be handled. Therefore, Rosetta uses a clever trick to compress the output which is called "silent_output". Under this mode, only the variables ( or degrees of freedom ) in the simulation is being output and these are normally the backbone and sidechain torsion angles ( phi, psi and chi ). By this means, the size of output file can be reduced by at least 30 fold. However, this requires us to reconstruct from these silent out files the normal rosetta output model files (with xyz positions in it) and to do so there is a critical assumption taken that a chemical covalent bond which connect any two atoms has a 'ideal" value for its length. Similarly, the angle composed by any three connected atom has its own ideal value too. So taking these ideal values together with phi/psi/chi angles, we are able to restore the positions for all the atoms in the protein model and we often refer the structure with "ideal" bond lengths and angles as "idealized structure". For the ab initio prediction, the output model is always "idealized" as it is folded with "ideal" geometry and an optimal set of phi/psi/chi angles.

However, there are also some other important tests which requires starting from experimentally solved protein structures (native structures). Normally the bond geometries in these structures have a little bit different values from the ideal ones ( the ideal values are computed as an average over a large distribution of these value from experimental structures ). So in order to run these tests on BOINC, we need to add new functions to allow us to reconstruct protein models from non-idealized bond geometries and phi/psi/chi angles. On the client side, there is almost nothing changed except that the silent output file has one number for each residue which indicates whether it has ideal bonds or non-ideal bonds. The file size increase is very trivial with this new feature, but it opens the door for us to do large-scale tests on the experimentally sovled structures to understand better what are the features for these structures and how we can make Rosetta model more like those native structures.

Hopefully this answers your question.
May I ask? I hope a good description can be written for when 5.38 comes to Rosetta... WHY is "...outputting structures with non-ideal backbone and sidehchain geometries" an improvement? I know, useful to the science... please explain more, on the surface it sounds to a layperson like a step backwards. Also, what impact will this have on the user experience? Will it mean we'll see larger upload sizes on results?

38) Message boards : RALPH@home bug list : Bug reports for Ralph 5.37 through 5.40 (Message 2478)
Posted 6 Nov 2006 by Chu
Post:
Hi feet1st. That is really helpful. We have seen that error message ( in the debugging output) for quite a few times but no luck in finding a clue of what is the cause for that. Now for your reporting, we at least know it is somehow related to the graphic and I think it will help us a lot to investigate the real cause.
V5.38 WU 315376 just crashed on my other machine. I just HAPPENED to be enlarging the native structure shown at the time of failure, so that was the first I'd brought up the graphic for this WU, rotated the lowest energy, then enlarged native and then crash and burn.

exit code -1073741819
24hr RT preference.

39) Message boards : RALPH@home bug list : Bug reports for Ralph 5.37 through 5.40 (Message 2476)
Posted 6 Nov 2006 by Chu
Post:
Thank you all for the help. We have already noticed that there are a lot of failures with error code -161 for the newly updated application 5.38. We are investigating the cause for it now...
40) Message boards : RALPH@home bug list : Bug reports for Ralph 5.37 through 5.40 (Message 2452)
Posted 4 Nov 2006 by Chu
Post:
If all goes well with this update, we'll probably update the main application on Monday. Look for some interesting new workunits with multiple copies of a protein -- these are attempt to simulate fibril formation.


Previous 20 · Next 20



©2024 University of Washington
http://www.bakerlab.org