Posts by rjs5

1) Message boards : RALPH@home bug list : Rosetta mini beta and/or android 3.61-3.83 (Message 6075)
Posted 23 Mar 2016 by rjs5
Post:
They are seem to be failing ... Windows takes longer.


Linux64 jobs are failing after a minute with a Ralph assert().

example:

Task ID 3789265
Name design5e_ralph_S_7946_buriedtrp_SAVE_ALL_OUT_20317_2727_1
Workunit 3328949


Watchdog active.
std::cerr: Exception was thrown:
chi angle must be between -180 and 180: -nan

</stderr_txt>
]]>


Windows 10 are failing after an hour with the SEGVIO abort.

example:

Task ID 3788776
Name design5e_ralph_S_7946_buriedtrp_SAVE_ALL_OUT_20317_77_1
Workunit 3326299
Created 23 Mar 2016 13:44:58 UTC


Watchdog active.


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x01D9CA02 read attempt to address 0x21380008

Engaging BOINC Windows Runtime Debugger...




2) Message boards : RALPH@home bug list : Win10 3.71 Unhandled Exception: Reason: Out Of Memory (backrub) (Message 6046)
Posted 7 Feb 2016 by rjs5
Post:
dekim said:
Yes, I'm running a test of a new type of job that runs small perturbations of the protein backbone and then does a round of design. The design protocol can use a lot of memory. I realize that this will be problematic and will see if we can distribute these jobs to high memory machines.


It's time to pass to 64 bit native app?? :-)


I don't think 64-bit needs to be a requirement. I am not sure how much benefit, performance wise, it would offer either. I have guessed 10% or 20% ... I think. I probably should build a couple 32-bit variants and measure results on same hardware and compiler.

It is just a memory leak or a memory management integration problem with the new program code. 4gb is a lot of memory space. BOINCTASKS says the current Rosetta Windows 32-bit is taking 460mb of memory space. I doubt the new changes mean the memory requirement has gone up 10x.

If they cannot make Rosetta work in 32-bit, then they will have to look at all the issues that are building pressure to improve their TASK distribution of jobs.

If they cannot get it to work in 4gb (32-bits):
- 32-bit machines will not work for that version.
- 64-bit machines with <small>gb of memory will not work for that version.
- 64-bit machines with just enough memory PLUS disk paging space will work BUT with <many times> slow down .... complete the same work .... but will burn out the disk head seek mechanism faster ... which I would not run. If my disk light goes on solid, I find out why and stop running the problem program. I just did that with the Linux tracker-miner (I shut down the indexer).


3) Message boards : RALPH@home bug list : Win10 3.71 Unhandled Exception: Reason: Out Of Memory (backrub) (Message 6044)
Posted 6 Feb 2016 by rjs5
Post:
This workload may have been out of its 4gb 32-bit memory space limit but my machine still had 18GB free when this aborted.

Ralphie is taking dumps on my machine. There are a number on this Win10 machine.
http://ralph.bakerlab.org/show_host_detail.php?hostid=35400
started at 2016- 2- 5 23:31:34, and
took a dump 40 min later. 02/06/16 00:19:17

Task ID 3733284
Name 02_2016_3r8b_backrub_design_20294_68_1
Workunit 3289094

minirosetta_3.71_windows_x86_64.exe

http://ralph.bakerlab.org/result.php?resultid=3733284
4) Message boards : Number crunching : Ralph and SSEx (Message 6032)
Posted 31 Jan 2016 by rjs5
Post:
on this topic, it may be also good to mention that modern compilers are sophisticated. even recent versions of open sourced compilers such as gcc and llvm has pretty advanced/sophisticated *auto-vectorization* features

https://gcc.gnu.org/projects/tree-ssa/vectorization.html
http://llvm.org/devmtg/2012-04-12/Slides/Hal_Finkel.pdf

while that may not produce the most tuned codes, it is probably an incorrect notion that r@h don't have SSEn/AVXn optimizations, the compiler may have embedded some of such SSEn/AVXn optimizations.

this may somewhat explain the somewhat higher performance of r@h in 64bits linux vs say 64 bits windows in the statistics. This is because the combination of optimised 64 bits binaries running in 64 bits linux would most likely have (possibly significantly) better performance compared to 32 bits (possibly less optimised) binaries running in 64 bits windows

i.e. windows platform may see (significant) performance gains just compiling and releasing 64 bit binaries targeting 64 bits windows platforms with a modern / recent sophisticated compiler


It depends on you mean by "significant" gains. I would guess that gains would be 10% to 20% over the current 32-bit binary.


Here is a link that how sensitive auto-vectorization is to the source code layout.
http://locklessinc.com/articles/vectorize/

Rosetta code does not have any AVX code but does have SSE scalar code. I think even in 32-bit Windows there is SSE code. All the applications still have 387 code. It will take some time and source code changes to generate any vector code.


For those who have time, you can install a Linux Guest environment on your Windows machine and test out the 32-bit Windows, 32-bit Linux and 64-bit Linux performance on the same hardware.


One of many ways would be to install Virtualbox:
https://www.virtualbox.org/wiki/Downloads

Download a prebuild Linux image that you are interested in.
http://www.osboxes.org/virtualbox-images/

Install BOINC package on that guest.

Run Rosetta application.



5) Message boards : RALPH@home bug list : Rosetta mini beta and/or android 3.61-3.83 (Message 5957)
Posted 5 Jan 2016 by rjs5
Post:
3.68 uses the latest Rosetta source. It includes new protocols, for example, a new cyclic peptide modeling method, and also a newly optimized energy function that improves scientific benchmarks.


I would like to build a binary and test environment somewhat similar to yours and do some performance analysis. I can guess and experiment to find an environment that works or you can guide my efforts to complement the work you guys are doing and would like done.

I am currently using the November source bundle (rosetta_src_2015.39.58186_bundle.tgz) on a RHEL7.2 clone with the standard gcc/g++ 4.8.4.

I have been able to build the "bin" default, debug and release versions and with the "extras=static". The only problem so far is the static scripts generate an error in the TEST scripts caused by a conservative "-Werror" option.

I can upgrade/downgrade software versions easily if you prefer.

If there is no need for Software Performance Engineer, then I also have some time for other tasks too. Testing, validation, bug isolation ... ?





6) Message boards : Number crunching : Ralph and SSEx (Message 5953)
Posted 30 Dec 2015 by rjs5
Post:

It is just an example of the type of problems that accumulate in large mature code over time.


First all. Thanks Rjs5 for your posts: are VERY interesting and accurate.
Second. You have to consider that Rosetta@home's admins are bio-informatics guys, not "pure coder" (as you), so they think -my code have to produce results-, not -my code have to be "beauty", "optimized", "whatsoever"-.


will have 10% to 20% head room, if he does it right. Profiling the code (grouping the frequently executed code together and reducing branch miss prediction) typically yields 20% to 30%. Sometimes much higher for looping code (which Rosetta does not appear to be).
The 2x and 3x performance increases will be seen when Rosetta uses vector operations.



Thank you for the compliment. Everyone was beating on the Rosetta developers and especially David E. K. for ignoring the performance aspects of the project. I really wanted to set reasonable expectations about what to expect from David E.K.'s volunteer option-toggle effort.

I also understand the likely skill set of the Rosetta developers and that is why I volunteered my time. I would have transferred my findings back to them so they could implement what made sense and they wanted to implement.
I have no problem with a Rosetta decision to handle project performance like they are and understand their decision. I am a professional performance engineer and could likely provide some benefit.


IMO, they would be best served by working on their infrastructure to partition the crunchers by machine capability. They currently know if a machine is Windows, Linux or MAC OS and send different applications to those groups. Rosetta also knows the population of machines in each of those groups that support SSEx, AVXy, FMA, ... They currently only send a 32-bit generic application to all Windows machines. I think the first experiment would be to detect 64-bit Windows machines that support SSEx or AVXy (whatever makes sense) and send a tuned application. That would allow them to build out the harness to support that.

Some numbers. Now Rosetta has 278Tflops.
10% is 28Tflops. A I7 4770k (not a bad cpu) makes, approximately, 132Gflops with his 8 cores. So are 210 cpu (1680 core) with "simply" recompilation (not big changes to source code).
I think administrators would not mind a little more power :-P


My current plan is to work down the list below (until it becomes clear that developers really have no interest):
..........
I am not optimistic about making any difference. I fully expect that there is no real need nor incentive for Rosetta to mess with the code to improve performance.


Thank you, again!!


IMO, I think 30% for most CPU is probably pretty easy. I expect 2x improvement for about 50% of the machine population is reasonable.

My Windows 64-bit 4.0GHz SkyLake 6700K runs the same 32-bit x87 binary as every other Windows 32-bit i386 machine.
https://boinc.bakerlab.org/rosetta/results.php?hostid=2431621


I have absolutely no problem with Rosetta developers passing on performance changes. Rosetta will certainly benefit in the future from CPU core count and frequency improvements.



7) Message boards : Number crunching : Ralph and SSEx (Message 5950)
Posted 29 Dec 2015 by rjs5
Post:
The 64-bit Rosetta binary has a couple functions that use the old MMX instructions to do 64-bit operations. The MMX instructions are "aliased" to the FP registers and if you use them, then you have to reset the FP registers. The program will stall 30 to 50 cycles on the next FP operation.

This is a Rosetta error where the source code references the "__m64" datatype and forces the compiler to use the MMX registers instead of the XMM registers. There is really no reason this 64-bit code even needs to use MMX or XMM. Just use its 64-bit registers.


MMX?? :-O
If i'm not wrong, i should have a Pentium II MMX in the basement....
Seriously, they should fork the code and rewrite it from raw.

P.S. Rsj5, have you tried Doxygen? This could help you with code documentation (and not only).


A rewrite is not necessary.

The MMX code is ugly but may not even be used during a Rosetta execution. It is just an example of the type of problems that accumulate in large mature code over time. A developer will not fully understand the program but will find a "choke" point in the program. He will then insert his changes. He will save all the variables (or create new a new scope "{ }" and local variables), perform his fixes, and then restore program variables and continue execution. These changes typically interfere with compiler optimizations. This stuff is also pretty easy to comb out and put in a more natural program structure.

Rosetta 64-bit Linux code can be thought of as a "space station" type program with limited system needs. They have statically linked everything needed to run Rosetta as part of the binary. Rosetta makes limited calls to your computer to open, read and write files. Rosetta makes calls to your computer to allocate and free memory.


David E. K's volunteer can toggle optimization switches, recompile, test/measure ... and will have 10% to 20% head room, if he does it right. Profiling the code (grouping the frequently executed code together and reducing branch miss prediction) typically yields 20% to 30%. Sometimes much higher for looping code (which Rosetta does not appear to be).

The 2x and 3x performance increases will be seen when Rosetta uses vector operations.
Scalar SSE execution is not "SSE". PACKED SSE execution is "SSE".
Scalar AVX execution is not "AVX". PACKED AVX execution is "AVX".

Rosetta is structured and compiled as a scalar program.

The true value of the SSE -> AVX -> AVX2 -> AVX512 ...
is being able to use the width of the machine to perform multiple operations in parallel. The newer instructions are not that much faster. A 1-cycle FP add takes the same amount of nanoseconds on a 4.0GHz SandyBridge CPU as it does on a 4.0GHz SkyLake CPU.

The newer instructions add more flexible operations on the packed vector calculations.



My current plan is to work down the list below (until it becomes clear that developers really have no interest):

1. capture an executing Rosetta BOINC "slot" directory and verify that I can execute the Rosetta work unit stand alone and get the Rosetta answer repeatedly.
2. figure out how to build a standard Rosetta binary that I can substitute into that slot directory and get the same answer.
3. determine if Rosetta or the build process can be changed to make a substantial difference in performance.


I am not optimistic about making any difference. I fully expect that there is no real need nor incentive for Rosetta to mess with the code to improve performance.









8) Message boards : Number crunching : Ralph and SSEx (Message 5947)
Posted 28 Dec 2015 by rjs5
Post:
My understanding is that Rosetta is made up of modules, so is the sensible thing to do to take one of the smaller/most commonly used modules and have a look at optimising the code there?


I think your understanding is correct. Each developer has added their changes and Rosetta code complexity incrementally grew. Changes accumulate whether they are correct or not .... as long as there are no errors detected.


Goofy code gets added. Many times it is hard to see and rarely does anyone have the courage to remove the problem code.


Example:

The 64-bit Rosetta binary has a couple functions that use the old MMX instructions to do 64-bit operations. The MMX instructions are "aliased" to the FP registers and if you use them, then you have to reset the FP registers. The program will stall 30 to 50 cycles on the next FP operation.

This is a Rosetta error where the source code references the "__m64" datatype and forces the compiler to use the MMX registers instead of the XMM registers. There is really no reason this 64-bit code even needs to use MMX or XMM. Just use its 64-bit registers.

The problem is "using the mmx" registers instead of just the general registers ... rax, rbx, rcx, ....

3fcf8c0: 0f 6f 7c d6 f8 movq -0x8(%rsi,%rdx,8),%mm7
3fcf8c5: 0f 6e c9 movd %ecx,%mm1
3fcf8c8: b8 40 00 00 00 mov $0x40,%eax
3fcf8cd: 29 c8 sub %ecx,%eax
3fcf8cf: 0f 6e c0 movd %eax,%mm0
3fcf8d2: 0f 6f df movq %mm7,%mm3
3fcf8d5: 0f d3 f8 psrlq %mm0,%mm7
3fcf8d8: 48 0f 7e f8 movq %mm7,%rax
3fcf8dc: 48 83 ea 02 sub $0x2,%rdx
3fcf8e0: 7c 34 jl 0x3fcf916
3fcf8e2: 66 90 xchg %ax,%ax
3fcf8e4: 0f 6f 34 d6 movq (%rsi,%rdx,8),%mm6
3fcf8e8: 0f 6f d6 movq %mm6,%mm2
3fcf8eb: 0f d3 f0 psrlq %mm0,%mm6
3fcf8ee: 0f f3 d9 psllq %mm1,%mm3
3fcf8f1: 0f eb de por %mm6,%mm3
3fcf8f4: 0f 7f 5c d7 08 movq %mm3,0x8(%rdi,%rdx,8)
3fcf8f9: 74 1e je 0x3fcf919
3fcf8fb: 0f 6f 7c d6 f8 movq -0x8(%rsi,%rdx,8),%mm7
3fcf900: 0f 6f df movq %mm7,%mm3
3fcf903: 0f d3 f8 psrlq %mm0,%mm7
3fcf906: 0f f3 d1 psllq %mm1,%mm2
3fcf909: 0f eb d7 por %mm7,%mm2
3fcf90c: 0f 7f 14 d7 movq %mm2,(%rdi,%rdx,8)
3fcf910: 48 83 ea 02 sub $0x2,%rdx
3fcf914: 7d ce jge 0x3fcf8e4
3fcf916: 0f 6f d3 movq %mm3,%mm2
3fcf919: 0f f3 d1 psllq %mm1,%mm2
3fcf91c: 0f 7f 17 movq %mm2,(%rdi)
3fcf91f: 0f 77 emms
3fcf921: c3 retq










9) Message boards : Number crunching : Ralph and SSEx (Message 5944)
Posted 28 Dec 2015 by rjs5
Post:
@rjs5: This is very interesting, thank you!

Maybe one step forward would be to implement your solution to the "Rosetta screensaver tragedy".

It would alleviate the need for a statically linked binary so that the dynamically linked binary can at least use some optimized libraries...

What are the compiler flags for release mode?



On my Fedora 21 distribution with current updates, which is running gcc 4.92 on kernel 4.1.13, the Python build compile line typically like that below.

The important options are:
Generate common denominator code: -march=core2, -mtune=generic
Try to generate fast generic code: -O3, -ffast-math -funroll-loops -finline-functions, -finline-limit=2000

Explicit unrolling and inlining only make sense when the benefit outweighs the cost. That is why you have to explicitly enable them. Since the execution profile is flat, it is unlikely to be a win.

Portions are additionally built with -fPIC even though statically linked which generates slightly slower code.



g++ -o file.o -c -std=c++98 -isystem external/boost_1_55_0/ -isystem external/include/ -isystem external/dbio/ -pipe -ffor-scope -Wall -Wextra -pedantic -Werror -Wno-long-long -Wno-strict-aliasing -march=core2 -mtune=generic -O3 -ffast-math -funroll-loops -finline-functions -finline-limit=20000 -s -Wno-unused-variable -Wno-unused-parameter -DBOOST_ERROR_CODE_HEADER_ONLY -DBOOST_SYSTEM_NO_DEPRECATED -DPTR_BOOST -DNDEBUG -Isrc -Iexternal/include -Isrc/platform/linux/64/gcc/4.9 -Isrc/platform/linux/64/gcc -Isrc/platform/linux/64 -Isrc/platform/linux -Iexternal/boost_1_55_0 -Iexternal/dbio -I/usr/include -I/usr/local/include file.cc


10) Message boards : Number crunching : Ralph and SSEx (Message 5942)
Posted 27 Dec 2015 by rjs5
Post:
Some news. I was granted a source license and I have started wading in. The documentation is dated and inaccurate (as always). I am looking to hook up with a developer to focus on the configuration they build for this project and feed back findings.

Maybe something will surface.


Well done!!


Any performance improvement will only make progress when someone on the Rosetta team wants to. I have a couple unanswered messages to developers volunteering time and expertise. If anyone has interested contacts, please pass me along to them.

I have successfully ran the build and test scripts for the default and debug versions. The tests scripts worked without errors. The release scripts failed because they have the -Werror compile option set which causes certain warnings to be escalated to compile errors. The compile fails indicating a possible use of an uninitialized variable.

The latest academic release, November 8, 2015, source decompresses to 3.6gb (39,282 files) and that increases to 16gb (42,875 files) when the source is compiled for a Linux Release.

This Academic binary release is compiled with Linux 3.10.64 (https://www.kernel.org/) which was released in Dec 2014 and with a similarly current gcc 4.8. There is no real need to use a newer version of Linux or gcc. Most newer features are to the vector instructions. Since Rosetta does not use AVX, there is no pressing need.

Rosetta is compiled and statically linked with ALL the performance modification (https://goo.gl/G3RfW4) made to it over the years. It is hard to detect when yesterdays big performance boost becomes today's performance bottleneck. Developers tend to just keep adding changes to remove today's hot spot.



Toggling compile time switches has very limited performance improvement headroom .... you will get a couple percent on some machines and lose it on others.

Rosetta's only real value is in generating correct results. I suspect there is little project interest in investing effort and risk to get a 2x, 3x, .... performance improvement. You will know they are serious when you see some serious activity in ralph.


It would be FAR easier for them to simply inflate credits somewhat to attract and retain more crunchers than to invest in Rosetta performance improvements 8-).






11) Message boards : Number crunching : Ralph and SSEx (Message 5940)
Posted 19 Dec 2015 by rjs5
Post:
No news?


Some news. I was granted a source license and I have started wading in. The documentation is dated and inaccurate (as always). I am looking to hook up with a developer to focus on the configuration they build for this project and feed back findings.

Maybe something will surface.



12) Message boards : RALPH@home bug list : Rosetta mini beta and/or android 3.61-3.83 (Message 5921)
Posted 20 Oct 2015 by rjs5
Post:
I would exit BOINC and restart it in that situation rather than aborting them.


Is that because there is the possibility for them to complete and you get the information about the run and the cruncher gets the credit?

13) Message boards : Number crunching : Ralph and SSEx (Message 5875)
Posted 31 Aug 2015 by rjs5
Post:
I did turn on SSE for the windows build.


Just a curiosity: which version of SSE for win? 2? 3? 4.1???


I think it's only SSE "1" so far. No errors except for a few WUs that failed immediately, but it doesn't appear to be a SSE-related problem.

My WUs with SSE



SSE was introduced in 1999 with the Pentium-3 CPU and SSE2 was introduced in 2001 with the Pentium-4 CPU and only extended SSE. If something works under SSE, then it will work under SSE2 UNLESS it is a Pentium-3-era CPU.

The project will get more work done by sacrificing the Pentium-3 cycles (making SSE2 the minimum) and optimizing for SSE2+.

Once you get to SSE2, you will only get minor improvements, probably just a couple %, by going to the trouble of pushing the SSE/AVX envelop.

Since R@H is compiling an running in SCALAR mode which crunches only 1 64-bit value in the 128-bit dual 64-bit XMM registers, there is much more to gain by closely examining the source code and understanding what is preventing the compilers from VECTORIZING the code. If you can use BOTH 64-bit fields in the XMM registers, you get 2x performance increase. You crunch two, 4, 8, ... floating point values in the same time as 1.

This is also the reason that there is no GPU version and can NEVER be a GPU version until this is fixed .... IF the source can be changed to vectorized.


Starting from a generic, crappy 32-bit i386 version, .....
you get 80% of the scalar performance by just generating a 64-bit version.
you get the other 20% of scalar performance by messing with compiler options .... but at a high portability cost.

The next barrier after a 64-bit version should be SSE2.
The next barrier after 64-bit, SSE2 is VECTOR .... NOT .... SSE3, SSE4, AVX, ...






What is the gain in going native 64-bit? I would've thought that going SSE2 would bring a higher gain than 64-bit (I've always associated the 64-bit to better memory addressing, rather than increased computation speed).


All x86_64 have at least SSE2. My first sentence above does not make much sense since 64-bit have SSE2 registers.

64-bit has 16 registers rather than 8 registers of the 386. There is substantial reduction in temporary register spills and fills to/from the stack. When you eliminate the traffic to store/restore data to stack variables, you reduce cycles per instruction. Saving registers to a temporary stack variable requires the WRITE be pushed out to the L2 cache which is typically 5 to 10 cycles. The L1 caches are all write-through.

SSE2 and 64-bit come as a pair.



14) Message boards : Number crunching : Ralph and SSEx (Message 5873)
Posted 29 Aug 2015 by rjs5
Post:
I did turn on SSE for the windows build.


Just a curiosity: which version of SSE for win? 2? 3? 4.1???


I think it's only SSE "1" so far. No errors except for a few WUs that failed immediately, but it doesn't appear to be a SSE-related problem.

My WUs with SSE



SSE was introduced in 1999 with the Pentium-3 CPU and SSE2 was introduced in 2001 with the Pentium-4 CPU and only extended SSE. If something works under SSE, then it will work under SSE2 UNLESS it is a Pentium-3-era CPU.

The project will get more work done by sacrificing the Pentium-3 cycles (making SSE2 the minimum) and optimizing for SSE2+.

Once you get to SSE2, you will only get minor improvements, probably just a couple %, by going to the trouble of pushing the SSE/AVX envelop.

Since R@H is compiling an running in SCALAR mode which crunches only 1 64-bit value in the 128-bit dual 64-bit XMM registers, there is much more to gain by closely examining the source code and understanding what is preventing the compilers from VECTORIZING the code. If you can use BOTH 64-bit fields in the XMM registers, you get 2x performance increase. You crunch two, 4, 8, ... floating point values in the same time as 1.

This is also the reason that there is no GPU version and can NEVER be a GPU version until this is fixed .... IF the source can be changed to vectorized.


Starting from a generic, crappy 32-bit i386 version, .....
you get 80% of the scalar performance by just generating a 64-bit version.
you get the other 20% of scalar performance by messing with compiler options .... but at a high portability cost.

The next barrier after a 64-bit version should be SSE2.
The next barrier after 64-bit, SSE2 is VECTOR .... NOT .... SSE3, SSE4, AVX, ...




15) Message boards : RALPH@home bug list : Rosetta mini beta 3.60 (Message 5857)
Posted 15 Jul 2015 by rjs5
Post:
can you post the /proc/cpuinfo on one of the CPU?
16) Message boards : Number crunching : Ralph and SSEx (Message 5852)
Posted 5 Jul 2015 by rjs5
Post:
Ok, ok. I joined the project with a Win7 and a Fedora 21 Haswell machine. Now if David wants to feed me beta binaries for comment, he can.

Since this is a beta site, he can even send the same workload out repeatedly ... and can even hide it somewhat with different names and few will even know ... 8-)

Sending the same workload will allow him to test for the acceptable results on different machine configurations.

I really hate joining inactive projects.






©2020 University of Washington
http://www.bakerlab.org