Ralph and SSEx

Message boards : Number crunching : Ralph and SSEx

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
rjs5

Send message
Joined: 5 Jul 15
Posts: 22
Credit: 135,787
RAC: 2,494
Message 5942 - Posted: 27 Dec 2015, 17:45:46 UTC - in response to Message 5941.  

Some news. I was granted a source license and I have started wading in. The documentation is dated and inaccurate (as always). I am looking to hook up with a developer to focus on the configuration they build for this project and feed back findings.

Maybe something will surface.


Well done!!


Any performance improvement will only make progress when someone on the Rosetta team wants to. I have a couple unanswered messages to developers volunteering time and expertise. If anyone has interested contacts, please pass me along to them.

I have successfully ran the build and test scripts for the default and debug versions. The tests scripts worked without errors. The release scripts failed because they have the -Werror compile option set which causes certain warnings to be escalated to compile errors. The compile fails indicating a possible use of an uninitialized variable.

The latest academic release, November 8, 2015, source decompresses to 3.6gb (39,282 files) and that increases to 16gb (42,875 files) when the source is compiled for a Linux Release.

This Academic binary release is compiled with Linux 3.10.64 (https://www.kernel.org/) which was released in Dec 2014 and with a similarly current gcc 4.8. There is no real need to use a newer version of Linux or gcc. Most newer features are to the vector instructions. Since Rosetta does not use AVX, there is no pressing need.

Rosetta is compiled and statically linked with ALL the performance modification (https://goo.gl/G3RfW4) made to it over the years. It is hard to detect when yesterdays big performance boost becomes today's performance bottleneck. Developers tend to just keep adding changes to remove today's hot spot.



Toggling compile time switches has very limited performance improvement headroom .... you will get a couple percent on some machines and lose it on others.

Rosetta's only real value is in generating correct results. I suspect there is little project interest in investing effort and risk to get a 2x, 3x, .... performance improvement. You will know they are serious when you see some serious activity in ralph.


It would be FAR easier for them to simply inflate credits somewhat to attract and retain more crunchers than to invest in Rosetta performance improvements 8-).






ID: 5942 · Report as offensive    Reply Quote
Dr. Merkwürdigliebe

Send message
Joined: 12 Jun 15
Posts: 16
Credit: 23,473
RAC: 0
Message 5943 - Posted: 27 Dec 2015, 19:22:47 UTC

@rjs5: This is very interesting, thank you!

Maybe one step forward would be to implement your solution to the "Rosetta screensaver tragedy".

It would alleviate the need for a statically linked binary so that the dynamically linked binary can at least use some optimized libraries...

What are the compiler flags for release mode?
ID: 5943 · Report as offensive    Reply Quote
rjs5

Send message
Joined: 5 Jul 15
Posts: 22
Credit: 135,787
RAC: 2,494
Message 5944 - Posted: 28 Dec 2015, 1:30:14 UTC - in response to Message 5943.  

@rjs5: This is very interesting, thank you!

Maybe one step forward would be to implement your solution to the "Rosetta screensaver tragedy".

It would alleviate the need for a statically linked binary so that the dynamically linked binary can at least use some optimized libraries...

What are the compiler flags for release mode?



On my Fedora 21 distribution with current updates, which is running gcc 4.92 on kernel 4.1.13, the Python build compile line typically like that below.

The important options are:
Generate common denominator code: -march=core2, -mtune=generic
Try to generate fast generic code: -O3, -ffast-math -funroll-loops -finline-functions, -finline-limit=2000

Explicit unrolling and inlining only make sense when the benefit outweighs the cost. That is why you have to explicitly enable them. Since the execution profile is flat, it is unlikely to be a win.

Portions are additionally built with -fPIC even though statically linked which generates slightly slower code.



g++ -o file.o -c -std=c++98 -isystem external/boost_1_55_0/ -isystem external/include/ -isystem external/dbio/ -pipe -ffor-scope -Wall -Wextra -pedantic -Werror -Wno-long-long -Wno-strict-aliasing -march=core2 -mtune=generic -O3 -ffast-math -funroll-loops -finline-functions -finline-limit=20000 -s -Wno-unused-variable -Wno-unused-parameter -DBOOST_ERROR_CODE_HEADER_ONLY -DBOOST_SYSTEM_NO_DEPRECATED -DPTR_BOOST -DNDEBUG -Isrc -Iexternal/include -Isrc/platform/linux/64/gcc/4.9 -Isrc/platform/linux/64/gcc -Isrc/platform/linux/64 -Isrc/platform/linux -Iexternal/boost_1_55_0 -Iexternal/dbio -I/usr/include -I/usr/local/include file.cc


ID: 5944 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 905
Credit: 1,892,541
RAC: 294
Message 5945 - Posted: 28 Dec 2015, 9:08:18 UTC - in response to Message 5942.  
Last modified: 28 Dec 2015, 9:18:27 UTC

Any performance improvement will only make progress when someone on the Rosetta team wants to. I have a couple unanswered messages to developers volunteering time and expertise.

:-(

Rosetta is compiled and statically linked with ALL the performance modification (https://goo.gl/G3RfW4) made to it over the years. It is hard to detect when yesterdays big performance boost becomes today's performance bottleneck. Developers tend to just keep adding changes to remove today's hot spot.

It's a nightmare :-P

you will get a couple percent on some machines and lose it on others.

But if Rosetta loses PentiumII cpu, i think it's not a problem....

Rosetta's only real value is in generating correct results. I suspect there is little project interest in investing effort and risk to get a 2x, 3x, .... performance improvement.

A theorical 2x may be a LOT of performance improvement. It would be CRAZY not to take it.
Why not to test it here on Ralph? If results are not "satisfactory", they may abandon the optimizations. They did the same thing with Android client....

It would be FAR easier for them to simply inflate credits somewhat to attract and retain more crunchers than to invest in Rosetta performance improvements 8-).

Sad, but true
ID: 5945 · Report as offensive    Reply Quote
dcdc

Send message
Joined: 15 Aug 06
Posts: 27
Credit: 90,652
RAC: 0
Message 5946 - Posted: 28 Dec 2015, 11:38:11 UTC

My understanding is that Rosetta is made up of modules, so is the sensible thing to do to take one of the smaller/most commonly used modules and have a look at optimising the code there?
ID: 5946 · Report as offensive    Reply Quote
rjs5

Send message
Joined: 5 Jul 15
Posts: 22
Credit: 135,787
RAC: 2,494
Message 5947 - Posted: 28 Dec 2015, 23:16:27 UTC - in response to Message 5946.  

My understanding is that Rosetta is made up of modules, so is the sensible thing to do to take one of the smaller/most commonly used modules and have a look at optimising the code there?


I think your understanding is correct. Each developer has added their changes and Rosetta code complexity incrementally grew. Changes accumulate whether they are correct or not .... as long as there are no errors detected.


Goofy code gets added. Many times it is hard to see and rarely does anyone have the courage to remove the problem code.


Example:

The 64-bit Rosetta binary has a couple functions that use the old MMX instructions to do 64-bit operations. The MMX instructions are "aliased" to the FP registers and if you use them, then you have to reset the FP registers. The program will stall 30 to 50 cycles on the next FP operation.

This is a Rosetta error where the source code references the "__m64" datatype and forces the compiler to use the MMX registers instead of the XMM registers. There is really no reason this 64-bit code even needs to use MMX or XMM. Just use its 64-bit registers.

The problem is "using the mmx" registers instead of just the general registers ... rax, rbx, rcx, ....

3fcf8c0: 0f 6f 7c d6 f8 movq -0x8(%rsi,%rdx,8),%mm7
3fcf8c5: 0f 6e c9 movd %ecx,%mm1
3fcf8c8: b8 40 00 00 00 mov $0x40,%eax
3fcf8cd: 29 c8 sub %ecx,%eax
3fcf8cf: 0f 6e c0 movd %eax,%mm0
3fcf8d2: 0f 6f df movq %mm7,%mm3
3fcf8d5: 0f d3 f8 psrlq %mm0,%mm7
3fcf8d8: 48 0f 7e f8 movq %mm7,%rax
3fcf8dc: 48 83 ea 02 sub $0x2,%rdx
3fcf8e0: 7c 34 jl 0x3fcf916
3fcf8e2: 66 90 xchg %ax,%ax
3fcf8e4: 0f 6f 34 d6 movq (%rsi,%rdx,8),%mm6
3fcf8e8: 0f 6f d6 movq %mm6,%mm2
3fcf8eb: 0f d3 f0 psrlq %mm0,%mm6
3fcf8ee: 0f f3 d9 psllq %mm1,%mm3
3fcf8f1: 0f eb de por %mm6,%mm3
3fcf8f4: 0f 7f 5c d7 08 movq %mm3,0x8(%rdi,%rdx,8)
3fcf8f9: 74 1e je 0x3fcf919
3fcf8fb: 0f 6f 7c d6 f8 movq -0x8(%rsi,%rdx,8),%mm7
3fcf900: 0f 6f df movq %mm7,%mm3
3fcf903: 0f d3 f8 psrlq %mm0,%mm7
3fcf906: 0f f3 d1 psllq %mm1,%mm2
3fcf909: 0f eb d7 por %mm7,%mm2
3fcf90c: 0f 7f 14 d7 movq %mm2,(%rdi,%rdx,8)
3fcf910: 48 83 ea 02 sub $0x2,%rdx
3fcf914: 7d ce jge 0x3fcf8e4
3fcf916: 0f 6f d3 movq %mm3,%mm2
3fcf919: 0f f3 d1 psllq %mm1,%mm2
3fcf91c: 0f 7f 17 movq %mm2,(%rdi)
3fcf91f: 0f 77 emms
3fcf921: c3 retq










ID: 5947 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 905
Credit: 1,892,541
RAC: 294
Message 5948 - Posted: 29 Dec 2015, 9:01:22 UTC - in response to Message 5947.  
Last modified: 29 Dec 2015, 9:09:12 UTC

The 64-bit Rosetta binary has a couple functions that use the old MMX instructions to do 64-bit operations. The MMX instructions are "aliased" to the FP registers and if you use them, then you have to reset the FP registers. The program will stall 30 to 50 cycles on the next FP operation.

This is a Rosetta error where the source code references the "__m64" datatype and forces the compiler to use the MMX registers instead of the XMM registers. There is really no reason this 64-bit code even needs to use MMX or XMM. Just use its 64-bit registers.


MMX?? :-O
If i'm not wrong, i should have a Pentium II MMX in the basement....
Seriously, they should fork the code and rewrite it from raw.

P.S. Rsj5, have you tried Doxygen? This could help you with code documentation (and not only).
ID: 5948 · Report as offensive    Reply Quote
Dr. Merkwürdigliebe

Send message
Joined: 12 Jun 15
Posts: 16
Credit: 23,473
RAC: 0
Message 5949 - Posted: 29 Dec 2015, 10:20:45 UTC - in response to Message 5948.  


Seriously, they should fork the code and rewrite it from raw.


Yeah, that's not gonna happen for sure. Maybe instead of starting from scratch, it's more important to identify and optimize just the worst kludges like the one above.

I think rjs5 already profiled the binary to look for those occurrences.

I do understand that there is not much space for compiler optimization otherwise.
ID: 5949 · Report as offensive    Reply Quote
rjs5

Send message
Joined: 5 Jul 15
Posts: 22
Credit: 135,787
RAC: 2,494
Message 5950 - Posted: 29 Dec 2015, 14:07:45 UTC - in response to Message 5948.  

The 64-bit Rosetta binary has a couple functions that use the old MMX instructions to do 64-bit operations. The MMX instructions are "aliased" to the FP registers and if you use them, then you have to reset the FP registers. The program will stall 30 to 50 cycles on the next FP operation.

This is a Rosetta error where the source code references the "__m64" datatype and forces the compiler to use the MMX registers instead of the XMM registers. There is really no reason this 64-bit code even needs to use MMX or XMM. Just use its 64-bit registers.


MMX?? :-O
If i'm not wrong, i should have a Pentium II MMX in the basement....
Seriously, they should fork the code and rewrite it from raw.

P.S. Rsj5, have you tried Doxygen? This could help you with code documentation (and not only).


A rewrite is not necessary.

The MMX code is ugly but may not even be used during a Rosetta execution. It is just an example of the type of problems that accumulate in large mature code over time. A developer will not fully understand the program but will find a "choke" point in the program. He will then insert his changes. He will save all the variables (or create new a new scope "{ }" and local variables), perform his fixes, and then restore program variables and continue execution. These changes typically interfere with compiler optimizations. This stuff is also pretty easy to comb out and put in a more natural program structure.

Rosetta 64-bit Linux code can be thought of as a "space station" type program with limited system needs. They have statically linked everything needed to run Rosetta as part of the binary. Rosetta makes limited calls to your computer to open, read and write files. Rosetta makes calls to your computer to allocate and free memory.


David E. K's volunteer can toggle optimization switches, recompile, test/measure ... and will have 10% to 20% head room, if he does it right. Profiling the code (grouping the frequently executed code together and reducing branch miss prediction) typically yields 20% to 30%. Sometimes much higher for looping code (which Rosetta does not appear to be).

The 2x and 3x performance increases will be seen when Rosetta uses vector operations.
Scalar SSE execution is not "SSE". PACKED SSE execution is "SSE".
Scalar AVX execution is not "AVX". PACKED AVX execution is "AVX".

Rosetta is structured and compiled as a scalar program.

The true value of the SSE -> AVX -> AVX2 -> AVX512 ...
is being able to use the width of the machine to perform multiple operations in parallel. The newer instructions are not that much faster. A 1-cycle FP add takes the same amount of nanoseconds on a 4.0GHz SandyBridge CPU as it does on a 4.0GHz SkyLake CPU.

The newer instructions add more flexible operations on the packed vector calculations.



My current plan is to work down the list below (until it becomes clear that developers really have no interest):

1. capture an executing Rosetta BOINC "slot" directory and verify that I can execute the Rosetta work unit stand alone and get the Rosetta answer repeatedly.
2. figure out how to build a standard Rosetta binary that I can substitute into that slot directory and get the same answer.
3. determine if Rosetta or the build process can be changed to make a substantial difference in performance.


I am not optimistic about making any difference. I fully expect that there is no real need nor incentive for Rosetta to mess with the code to improve performance.









ID: 5950 · Report as offensive    Reply Quote
Dr. Merkwürdigliebe

Send message
Joined: 12 Jun 15
Posts: 16
Credit: 23,473
RAC: 0
Message 5951 - Posted: 29 Dec 2015, 18:48:03 UTC - in response to Message 5950.  

Gloomy outlook but thanks for your efforts!

A little comparison Rosetta / folding@home:

Rosetta: situation described as above

folding@home: excerpt from their blog


New Gromacs core. The Gromacs core continues to be a workhorse core for CPU clients and we are working on releasing an updated core with more broad hardware support, including support for AVX, which will give some donors an automatic performance boost.


Source: Blog
ID: 5951 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 905
Credit: 1,892,541
RAC: 294
Message 5952 - Posted: 30 Dec 2015, 8:00:15 UTC - in response to Message 5950.  
Last modified: 30 Dec 2015, 8:01:36 UTC

It is just an example of the type of problems that accumulate in large mature code over time.


First all. Thanks Rjs5 for your posts: are VERY interesting and accurate.
Second. You have to consider that Rosetta@home's admins are bio-informatics guys, not "pure coder" (as you), so they think -my code have to produce results-, not -my code have to be "beauty", "optimized", "whatsoever"-.


will have 10% to 20% head room, if he does it right. Profiling the code (grouping the frequently executed code together and reducing branch miss prediction) typically yields 20% to 30%. Sometimes much higher for looping code (which Rosetta does not appear to be).
The 2x and 3x performance increases will be seen when Rosetta uses vector operations.


Some numbers. Now Rosetta has 278Tflops.
10% is 28Tflops. A I7 4770k (not a bad cpu) makes, approximately, 132Gflops with his 8 cores. So are 210 cpu (1680 core) with "simply" recompilation (not big changes to source code).
I think administrators would not mind a little more power :-P


My current plan is to work down the list below (until it becomes clear that developers really have no interest):
..........
I am not optimistic about making any difference. I fully expect that there is no real need nor incentive for Rosetta to mess with the code to improve performance.


Thank you, again!!
ID: 5952 · Report as offensive    Reply Quote
rjs5

Send message
Joined: 5 Jul 15
Posts: 22
Credit: 135,787
RAC: 2,494
Message 5953 - Posted: 30 Dec 2015, 15:01:24 UTC - in response to Message 5952.  


It is just an example of the type of problems that accumulate in large mature code over time.


First all. Thanks Rjs5 for your posts: are VERY interesting and accurate.
Second. You have to consider that Rosetta@home's admins are bio-informatics guys, not "pure coder" (as you), so they think -my code have to produce results-, not -my code have to be "beauty", "optimized", "whatsoever"-.


will have 10% to 20% head room, if he does it right. Profiling the code (grouping the frequently executed code together and reducing branch miss prediction) typically yields 20% to 30%. Sometimes much higher for looping code (which Rosetta does not appear to be).
The 2x and 3x performance increases will be seen when Rosetta uses vector operations.



Thank you for the compliment. Everyone was beating on the Rosetta developers and especially David E. K. for ignoring the performance aspects of the project. I really wanted to set reasonable expectations about what to expect from David E.K.'s volunteer option-toggle effort.

I also understand the likely skill set of the Rosetta developers and that is why I volunteered my time. I would have transferred my findings back to them so they could implement what made sense and they wanted to implement.
I have no problem with a Rosetta decision to handle project performance like they are and understand their decision. I am a professional performance engineer and could likely provide some benefit.


IMO, they would be best served by working on their infrastructure to partition the crunchers by machine capability. They currently know if a machine is Windows, Linux or MAC OS and send different applications to those groups. Rosetta also knows the population of machines in each of those groups that support SSEx, AVXy, FMA, ... They currently only send a 32-bit generic application to all Windows machines. I think the first experiment would be to detect 64-bit Windows machines that support SSEx or AVXy (whatever makes sense) and send a tuned application. That would allow them to build out the harness to support that.

Some numbers. Now Rosetta has 278Tflops.
10% is 28Tflops. A I7 4770k (not a bad cpu) makes, approximately, 132Gflops with his 8 cores. So are 210 cpu (1680 core) with "simply" recompilation (not big changes to source code).
I think administrators would not mind a little more power :-P


My current plan is to work down the list below (until it becomes clear that developers really have no interest):
..........
I am not optimistic about making any difference. I fully expect that there is no real need nor incentive for Rosetta to mess with the code to improve performance.


Thank you, again!!


IMO, I think 30% for most CPU is probably pretty easy. I expect 2x improvement for about 50% of the machine population is reasonable.

My Windows 64-bit 4.0GHz SkyLake 6700K runs the same 32-bit x87 binary as every other Windows 32-bit i386 machine.
https://boinc.bakerlab.org/rosetta/results.php?hostid=2431621


I have absolutely no problem with Rosetta developers passing on performance changes. Rosetta will certainly benefit in the future from CPU core count and frequency improvements.



ID: 5953 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 905
Credit: 1,892,541
RAC: 294
Message 5954 - Posted: 31 Dec 2015, 13:06:18 UTC - in response to Message 5953.  

Everyone was beating on the Rosetta developers and especially David E. K. for ignoring the performance aspects of the project. I really wanted to set reasonable expectations about what to expect from David E.K.'s volunteer option-toggle effort.
I have no problem with a Rosetta decision to handle project performance like they are and understand their decision. I am a professional performance engineer and could likely provide some benefit.


I hope they are on vacation and on snowboards and they don't think to optimization :-P
But after 9 Jan......


IMO, they would be best served by working on their infrastructure to partition the crunchers by machine capability. They currently know if a machine is Windows, Linux or MAC OS and send different applications to those groups. Rosetta also knows the population of machines in each of those groups that support SSEx, AVXy, FMA, ... They currently only send a 32-bit generic application to all Windows machines.


We have already wrote of the need of server update....


IMO, I think 30% for most CPU is probably pretty easy. I expect 2x improvement for about 50% of the machine population is reasonable.

I continue to consider all values over 10% a nice benefit!!


I have absolutely no problem with Rosetta developers passing on performance changes.

When i proposed - rewrite code from scratch - it was a provocation. I know that rosetta@home code is "bound" whit others project like Robetta Server, Foldit and others, so it's difficult to make big changes...
ID: 5954 · Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 8 Jul 14
Posts: 20
Credit: 4,159
RAC: 0
Message 6024 - Posted: 29 Jan 2016, 13:27:18 UTC - in response to Message 5953.  
Last modified: 29 Jan 2016, 14:00:00 UTC



IMO, they would be best served by working on their infrastructure to partition the crunchers by machine capability. They currently know if a machine is Windows, Linux or MAC OS and send different applications to those groups. Rosetta also knows the population of machines in each of those groups that support SSEx, AVXy, FMA, ... They currently only send a 32-bit generic application to all Windows machines. I think the first experiment would be to detect 64-bit Windows machines that support SSEx or AVXy (whatever makes sense) and send a tuned application. That would allow them to build out the harness to support that.



good point ! :)

took a look at an 'inside' url http://srv2.bakerlab.org/rosetta/download/
and found indeed that
#file minirosetta_3.71_windows_x86_64.exe
minirosetta_3.71_windows_x86_64.exe: PE32 executable (console) Intel 80386, for MS Windows

i'd think even today with existing setup, the servers can distribute 64 bit binaries for windows. distributing a 64 bit binary for 64 bits windows platform may indeed see perhaps a 10-15% improvement per core and if there are 4 cores it may well be an 'extra' 40-60% of a single core performance. & there is no need to 'care' about AVXn/SSEn (compilers may attempt to vectorize where optimization is selected) just yet, & 64 bits apps would enable runs that needs > 4GB to work as well

that 32 v 64 bit performance gains may be 'quantified' such as:
http://www.roylongbottom.org.uk/linpack%20results.htm#anchorWin64
of course 1 of the caveat is: that is linpack benchmark ('easily' vectorizable) & that it is SSE2
ID: 6024 · Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 8 Jul 14
Posts: 20
Credit: 4,159
RAC: 0
Message 6025 - Posted: 29 Jan 2016, 15:18:20 UTC
Last modified: 29 Jan 2016, 15:39:36 UTC

some interesting breakdown based on os:
http://boincstats.com/en/stats/14/host/breakdown/os/
date: 2016-01-29
rank    os                                      num os       total credit     av credit   cred per cpu  av credit/cpu
1  Microsoft Windows XP Professional            573740   6,452,074,082.28    294,408.73      11,245.64       0.51
2  Linux                                        136915   5,484,967,387.64  3,587,575.69      40,061.11      26.20
3  Microsoft Windows 7 Pro x64 Edition           54892   3,611,702,243.18  4,275,172.64      65,796.51      77.88
4  Microsoft Windows 7 Ultimate x64 Edition      93509   2,647,882,818.31  2,440,252.87      28,316.88      26.10
5  Microsoft Windows 7 Home Premium x64 Edn      78359   2,532,615,926.53  1,824,411.40      32,320.68      23.28
6  Microsoft Windows 7 Enterprise x64 Edn        11310   1,321,734,043.18  1,081,802.64     116,864.19      95.65
7  Microsoft Windows 10 Prof x64 Edition          6054   1,080,762,908.70  2,668,200.91     178,520.47     440.73
8  Microsoft Windows XP Home                    100297     883,771,779.68     30,043.58       8,811.55       0.30
9  Microsoft Windows 8.1 Prof x64 Edn            19536     716,854,732.39  1,679,819.20      36,694.04      85.99
10 Microsoft Windows 8.1 Core x64 Edn            27398     564,739,102.16  1,632,035.78      20,612.42      59.57


most 'glaring' is average credit per cpu,
the performance difference between:
the older os mostly 32 bits and very likely the older cpus
vs
the newer os mostly 64 bits and very likely a recent cpu

is more than 40 times for the average case i.e.20 : 0.5
to more than 1400 times for the extreme case i.e. 440 : 0.3

the other detail is linux which is a mixed bag of old & new cpus but benefits from having a true 64 bits r@h binary is fast chasing up pole position no 1 & i'd guess would soon overtake #1 position

& that 'benchmark' is none other than rosetta@home :o lol
conclusion? to get a good perf running r@h? run 64 bits linux & get a fast modern cpu e.g. the skylakes for now (& lots of ram to keep it in memory) :D
ID: 6025 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 905
Credit: 1,892,541
RAC: 294
Message 6026 - Posted: 29 Jan 2016, 17:21:55 UTC - in response to Message 6024.  
Last modified: 29 Jan 2016, 17:22:09 UTC

took a look at an 'inside' url http://srv2.bakerlab.org/rosetta/download/
and found indeed that
#file minirosetta_3.71_windows_x86_64.exe
minirosetta_3.71_windows_x86_64.exe: PE32 executable (console) Intel 80386, for MS Windows

i'd think even today with existing setup, the servers can distribute 64 bit binaries for windows.


On all my 64 bit pcs i have
26/01/2016 17:09 44.981.760 minirosetta_3.71_windows_x86_64.exe
26/01/2016 17:09 197.960.494 minirosetta_database_f513f38.zip
26/01/2016 17:07 18.851.840 minirosetta_graphics_3.71_windows_x86_64.exe

So i think i'm crunching 64 bit app.
ID: 6026 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 905
Credit: 1,892,541
RAC: 294
Message 6027 - Posted: 29 Jan 2016, 17:36:07 UTC - in response to Message 6025.  

some interesting breakdown based on os:
rank    os                                      num os       total credit     av credit   cred per cpu  av credit/cpu
1  Microsoft Windows XP Professional            573740   6,452,074,082.28    294,408.73      11,245.64       0.51
2  Linux                                        136915   5,484,967,387.64  3,587,575.69      40,061.11      26.20
3  Microsoft Windows 7 Pro x64 Edition           54892   3,611,702,243.18  4,275,172.64      65,796.51      77.88
4  Microsoft Windows 7 Ultimate x64 Edition      93509   2,647,882,818.31  2,440,252.87      28,316.88      26.10
5  Microsoft Windows 7 Home Premium x64 Edn      78359   2,532,615,926.53  1,824,411.40      32,320.68      23.28
6  Microsoft Windows 7 Enterprise x64 Edn        11310   1,321,734,043.18  1,081,802.64     116,864.19      95.65
7  Microsoft Windows 10 Prof x64 Edition          6054   1,080,762,908.70  2,668,200.91     178,520.47     440.73
8  Microsoft Windows XP Home                    100297     883,771,779.68     30,043.58       8,811.55       0.30
9  Microsoft Windows 8.1 Prof x64 Edn            19536     716,854,732.39  1,679,819.20      36,694.04      85.99
10 Microsoft Windows 8.1 Core x64 Edn            27398     564,739,102.16  1,632,035.78      20,612.42      59.57


most 'glaring' is average credit per cpu,
the performance difference between:
the older os mostly 32 bits and very likely the older cpus
vs
the newer os mostly 64 bits and very likely a recent cpu

is more than 40 times for the average case i.e.20 : 0.5
to more than 1400 times for the extreme case i.e. 440 : 0.3


:-O
It's time to abandon 32 bit development
ID: 6027 · Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 8 Jul 14
Posts: 20
Credit: 4,159
RAC: 0
Message 6028 - Posted: 29 Jan 2016, 17:37:57 UTC - in response to Message 6026.  



On all my 64 bit pcs i have
26/01/2016 17:09 44.981.760 minirosetta_3.71_windows_x86_64.exe
26/01/2016 17:09 197.960.494 minirosetta_database_f513f38.zip
26/01/2016 17:07 18.851.840 minirosetta_graphics_3.71_windows_x86_64.exe

So i think i'm crunching 64 bit app.


http://srv2.bakerlab.org/rosetta/download/
minirosetta_graphics_3.71_i686-pc-linux-gnu 20-Jan-2016 15:26 44M
minirosetta_graphics_3.71_windows_intelx86.exe 20-Jan-2016 15:26 18M
minirosetta_graphics_3.71_windows_x86_64.exe 20-Jan-2016 15:26 18M
minirosetta_graphics_3.71_x86_64-pc-linux-gnu 20-Jan-2016 15:26 36M

#file minirosetta_graphics_3.71_windows_*
minirosetta_graphics_3.71_windows_intelx86.exe: PE32 executable (GUI) Intel 80386, for MS Windows
minirosetta_graphics_3.71_windows_x86_64.exe: PE32 executable (GUI) Intel 80386, for MS Windows

#md5sum minirosetta_graphics_3.71_windows_*
0aa3534b9311df4e87abec5ce131c37c minirosetta_graphics_3.71_windows_intelx86.exe
0aa3534b9311df4e87abec5ce131c37c minirosetta_graphics_3.71_windows_x86_64.exe

the commands are run in linux, but i'd guess u may have figured out what this means :D

only good thing is the windows binaries is 1/2 the sizes on linux
ID: 6028 · Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 8 Jul 14
Posts: 20
Credit: 4,159
RAC: 0
Message 6029 - Posted: 29 Jan 2016, 18:04:21 UTC
Last modified: 29 Jan 2016, 18:33:08 UTC

on this topic, it may be also good to mention that modern compilers are sophisticated. even recent versions of open sourced compilers such as gcc and llvm has pretty advanced/sophisticated *auto-vectorization* features

https://gcc.gnu.org/projects/tree-ssa/vectorization.html
http://llvm.org/devmtg/2012-04-12/Slides/Hal_Finkel.pdf

while that may not produce the most tuned codes, it is probably an incorrect notion that r@h don't have SSEn/AVXn optimizations, the compiler may have embedded some of such SSEn/AVXn optimizations.

this may somewhat explain the somewhat higher performance of r@h in 64bits linux vs say 64 bits windows in the statistics. This is because the combination of optimised 64 bits binaries running in 64 bits linux would most likely have (possibly significantly) better performance compared to 32 bits (possibly less optimised) binaries running in 64 bits windows

i.e. windows platform may see (significant) performance gains just compiling and releasing 64 bit binaries targeting 64 bits windows platforms with a modern / recent sophisticated compiler
ID: 6029 · Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 8 Jul 14
Posts: 20
Credit: 4,159
RAC: 0
Message 6030 - Posted: 29 Jan 2016, 18:23:36 UTC - in response to Message 6027.  
Last modified: 29 Jan 2016, 18:24:58 UTC


most 'glaring' is average credit per cpu,
the performance difference between:
the older os mostly 32 bits and very likely the older cpus
vs
the newer os mostly 64 bits and very likely a recent cpu

is more than 40 times for the average case i.e.20 : 0.5
to more than 1400 times for the extreme case i.e. 440 : 0.3


:-O
It's time to abandon 32 bit development


note that the real difference comes from the *CPU*, those 32 bit os (e.g. Windows XP) likely runs on an old CPU. think of it as an extreme case of a 80386 (don't even bother with MMX) vs today's top-of-the-line skylake CPUs (with its 64 bits os) it may well be more than a million times of difference in Gflops :o :p lol
ID: 6030 · Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Ralph and SSEx



©2024 University of Washington
http://www.bakerlab.org