benchmark Archives - Microway

2nd Gen AMD EPYC “Rome” CPU Review: A Groundbreaking Leap for HPC

Brett Newman — Wed, 07 Aug 2019 23:00:00 +0000

The 2nd Generation AMD EPYC “Rome” CPUs are here! Rome brings greater core counts, faster memory, and PCI-E Gen4 all to deliver what really matters: up to a 2X increase in HPC application performance. We’re excited to present our thoughts on this advancement, and the return of x86 server CPU competition, in our detailed AMD EPYC Rome review. AMD is unquestionably back to compete for the performance crown in HPC.

2nd Generation AMD EPYC “Rome” CPUs are offered in 8-64 cores and clock speeds from 2.2-3.2Ghz. They are available in dual socket as well as aselect number of single socket only SKUs.

Index of our review:
Changes vs Previous Architectures
Leadership Performance
Class Leading Price-Performance
Chiplets + IO & Compute Dies
Memory Bandwidth
PCI-E Gen4
Infinity Fabric
SKU Selection for Clusters

Important changes in AMD EPYC “Rome” CPUs include:

Up to 64 cores, 2X the max in the previous generation for a massive advancement in aggregate throughput
PCI-E Gen 4 support for 2X the I/O bandwidth of the x86 competition— in a first for an x86 server CPU
2X the FLOPS per core of the previous generation EPYC CPUs with the new Zen2 architecture
DDR4-3200 support for improved memory bandwidth across 8 channels, reaching up to 208GB/sec per socket
Next Generation Infinity Fabric with higher bandwidth for intra and inter-die connection, with roots in PCI-E Gen4
New 14nm + 7nm chiplet architecture that separates the 14nm IO and 7nm compute core dies to yield the performance per watt benefits of the new TSMC 7nm process node

Leadership HPC Performance

There’s no other way to say it: the 2nd Generation AMD EPYC “Rome” CPUs (EPYC 7xx2) break new ground for HPC performance. In our experience, we haven’t seen this type of advancement in CPU performance in many years or without exotic architectural changes. This leap applies across floating point and integer applications.

Note: This article focuses on SPEC benchmark performance (which is rooted in real integer and floating point applications). If you’re hunting for a more raw FLOPS/dollar calculation, please visit our Knowledge Center Article on AMD EPYC 7xx2 “Rome” CPUs.

Floating Point Benchmark Performance

In short: at the top bin, you may see up to 2.12X the performance of the competition. This is compared to the top bin of Xeon Gold Processor (Xeon Gold 6252) on SPECrate2017_fp_base.

Compared to the top Xeon Platinum 8200 series SKU (Xeon Platinum 8280), up to 1.79X the performance.

Integer Benchmark Performance

Integer performance largely mirrors the same story. At the top bin, you may see up to 2.49X the performance of the competition. This is compared to the top bin of Xeon Gold Processor (Xeon Gold 6252) on SPECrate2017_int_base.

Compared to the top Xeon Platinum 8200 series SKU (Xeon Platinum 8280), up to 1.90X the performance.

What Makes EPYC 7xx2 Series Perform Strongly?

Contributions towards this leap in performance come from a combination of:

The 2X the FLOPS per core available in the new architecture
Improved performance of Zen2 microarchitecture
Moderate increases in clock speeds
Most importantly dramatic increases in core count

These last 2 items are facilitated by the new 7nm process node and the chiplet architecture of EPYC. Couple that with the advantages in memory bandwidth, and you have a recipe for HPC performance.

Performance Outlook

The dramatic increase in core count coupled with Zen2 means that we predict that most of the 32 core models and above, about half AMD’s SKU stack, is likely to outperform the top Xeon Platinum 8200 series SKU. Stay tuned for the SPEC benchmarks that confirm this assertion.

If you’re comparing against more modest Xeon Gold 62xx or Silver 52xx/42xx SKUs, we predict even an even more dramatic performance uplift. This is the first time in many years we’ve seen such an incredibly competitive product from the AMD Server Group.

Class Leading Price/Performance

AMD EPYC 7xx2 series isn’t just impressive from an absolute performance perspective. It’s also a price performance machine.

Examine these same two top-bin SKUs once again:

The top-bin AMD SKU does 1.79X the floating point work at approximately 2/3 the price of Xeon Platinum 8280. It delivers 2.13X the floating point performance to the Xeon Gold 6252 for about similar price/performance.

Should you be willing to accept more modest core counts with the lower cost SKUS, these comparisons just get better.

Finally, if you’re looking to roughly match or exceed the performance of the top-bin Xeon Gold 6252 SKU, we predict you’ll be able to do so with the 24-core EPYC 7352. This will be at just over 1/3 the price of the Xeon socket.

This much more typical comparison is emblematic of the price-performance advantage AMD has delivered in the new generation of CPUs. Stay tuned for more benchmark results and charts to support the prediction.

A Few Caveats: Performance Tuning & Out of the Box

Application Performance Engineers have spent years optimizing applications for the most widely available x86 server CPU. For a number of years now, that has meant Intel’s Xeon processors. The benchmarks presented here represent performance-tuned results.

We don’t yet have great data on how easy it is to achieve optimized performance with these new AMD “Rome” CPUs yet. For those of us in HPC for some time, we know out of the box performance and optimized performance often can mean very different things.

AMD does recommend specific compilers (AOCC, GCC, LLVM) and libraries (BLIS over BLAS and FLAME over LAPACK) to achieve optimized results with all EPYC CPUs. We don’t yet have a complete understanding how much these help end users achieve these superior results. Does it require a lot of tuning for the most exceptional performance?

AMD however has released a new Compiler Options Quick Reference Guide for the new CPUs. We strongly recommend using these flags and options for tuning your application.

Chiplet and Multi-Die Architecture: IO and Compute Dies

One of the chief innovations in the 2nd Generation AMD EPYC CPUs is in the evolution of the multi-die architecture pioneered in the first EPYC CPUs.

Rather than create one, monolithic, hard to yield die, AMD has opted to lash together “chiplets” together in a single socket with Infinity Fabric technology.

Compute Dies (now in 7nm)

8 compute chiplets (formally, Core Complex Dies or CCDs) are brought together to create a single socket. These CCDs take advantage of the latest 7nm TSMC process node. By using 7nm for the compute cores in 2nd Generation EPYC, AMD takes advantage of the space and power efficiencies of the latest process—without the yield issues of single monolithic die.

What does it mean for you? More cores than anticipated in a single socket, a reasonable power efficiency for the core count, and a less costly CPU.

The 14nm IO Die

In 2nd Generation EPYC CPUs, AMD has gone a step further with the chiplet architecture. These chiplets are now complemented by an separate I/O die. The IO Die contains the memory controllers, PCI-Express controllers, and Infinity Fabric connection to the remote socket.Also, this resolves any NUMA affinity quirks of the 1st generation EPYC Processors.

Moreover, the I/O die is created in the established 14nm node process. It’s less important that it utilize the same 7nm power efficiencies.

DDR4-3200 and Improved Memory Bandwidth

AMD EPYC 7xx2 series improves its theoretical memory bandwidth when compared to both its predecessor and the competition.

DDR4-3200 DIMMs are supported, and they are clocked 20% faster than DDR4-2666 and 9% faster than DDR4-2933.
In summary, the platform offers:

Compared to Cascade Lake-SP (Xeon Platinum/Gold 82xx, 62xx): Up to a 45% improvement in memory bandwidth
Compared to Skylake-SP (Xeon Platinum/Gold 81xx, 61xx): Up to a 60% improvement in memory bandwidth
Compared to AMD EPYC 7xx1 Series (Naples): Up to a 20% improvement in memory bandwidth

These comparisons are created for a system where only the first DIMM per channel is populated. Part of this memory bandwidth advantage is derived from the increase in DIMM speeds (DDR4-3200 vs 2933/2666); part of it is derived from EPYC’s 8 memory channels (vs 6 on Xeon Skylake/Cascade Lake-SP).

While we’ve yet to see final STREAM testing numbers for the new CPUs, we do anticipate them largely reflecting the changes in theoretical memory bandwidth.

PCI-E Gen4 Support: 2X the I/O bandwidth

EPYC “Rome” CPUs have an integrated PCI-E generation 4.0 controller on the I/O die. Each PCI-E lane doubles in maximum theoretical bandwidth to 4GB/sec (bidirectional).

A 16 lane connection (PCI-E x16 4.0 slot) can now deliver up to 64GB/sec of bidirectional bandwidth (32GB/uni). That’s 2X the bandwidth compared to first generation EPYC and the x86 competition.

Broadening Support for High Bandwidth I/O Devices

The new support allows for higher bandwidth connection to InfiniBand and other fabric adapters, storage adapters, NVMe SSDs, and in the future GPU Accelerators and FPGAs.

Some of these devices, like Mellanox ConnectX-6 200Gb HDR InfiniBand adapters, were unable to realize their maximum bandwidth in a PCI-E Gen3 x16 slot. Their performance should improve in PCI-E Gen4 x16 slot with 2nd Generation AMD EPYC Processors.

2nd Generation AMD EPYC “Rome” is the only x86 server CPU with PCI-E Gen4 support at its launch in 3Q 2019. However, we have seen PCI-E Gen4 support before in the POWER9 platform.

System Support for PCI-E Gen4

Unlike in the previous generation AMD EPYC “Naples” CPUs, there is not strong affinity of PCI-E lanes to a particular chiplet inside the processor. In Rome, all I/O traffic routes through the I/O die and all chiplets reach PCI-E devices through this die.

In order to support PCI-E Gen4, server and motherboard manufacturers are producing brand new versions of their platforms. Not every Rome-ready platform supports Gen4, so if this is a requirement be sure to specify this to your hardware vendor. Our team can help you select a server with full Gen4 capability.

Infinity Fabric

Deeply interrelated with PCI-Express Gen4, AMD has also improved the Infinity Fabric Link between chiplets and sockets with the new generation of EPYC CPUs.

AMD’s Infinity Fabric has many commonalities with PCI-Express used to connect I/O devices. With 2nd Generation AMD EPYC “Rome” CPUs, the link speed of Infinity Fabric has doubled. This allows for higher bandwidth communication between dies on the same socket and to dies on remote sockets.

The result should be improved application performance for NUMA-aware and especially non- NUMA-aware applications. The increased bandwidth should help hide any transport bandwidth issues to I/O devices on a remote socket as well. The overall result is “smoother” performance when applications scale across multiple chiplets and sockets.

SKUs and Strategies to Consider for HPC Clusters

Here are the complete list of SKUs and 1KU (1000 unit) prices (Source: AMD). Please note that these costs are those for CPUs sold to channel integrators, not those for fully integrated systems with these CPUs.

Dual Socket SKUs

SKU	Cores	Base Clock	Boost Clock	L3 Cache	TDP	Price
7742	64	2.25	3.4	256MB	225W	$6950
7702	2.0	3.35	200W	$6450
7642	48	2.3	3.3	225W	$4775
7552	2.2	3.3	192MB	200W	$4025
7542	32	2.9	3.4	128MB	225W	$3400
7502	2.5	3.35	180W	$2600
7452	2.35	3.35	155W	$2025
7402	24	2.8	3.35	128MB	180W	$1783
7352	2.3	3.2	155W	$1350
7302	16	3.0	3.3	128MB	$978
7282	2.8	3.2	64MB	120W	$650
7272	12	2.9	3.2	$625
7262	8	3.2	3.4	128MB	155W	$575
7252	3.2	3.4	64MB	120W	$475

EPYC 7742 or 7702 (64c): Select a High-End SKU, yield up to 2X the performance

Assuming your application scales with core count and maximum performance at a premium cost fits with your budget, you can’t beat the top 64core EPYC 7742 or 7702 SKUs. These will deliver greater throughput on a wide variety of multi-threaded applications.

Anything above EPYC 7452 (32c, 48c): Select a Mid-High Level SKU, reach new performance heights

While these SKUs aren’t inexpensive, they take application performance to new heights and break new benchmark ground. You can take advantage of that performance advantage for your application if it’s multi-threaded. From a price/performance perspective, these SKUs may also be attractive.

EPYC 7452 (32c): Select a Mid Level SKU, improve price performance vs previous generation EPYC

Previous generation AMD EPYC 7xx1 Series CPUs also featured 32 cores. However, the 32 core entrant in the new 7xx2 stack is far less costly than the prior generation while delivering greater memory bandwidth and 2X the FLOPS per core.

EPYC 7452 (32c): Select a Mid Level SKU, match top Xeon Gold and Platinum with far better price/performance

If you’re optimizing for price/performance compared to the top Intel Xeon Platinum 8200 or Xeon Gold 6200 series SKUs, consider this SKU or ones near it. We predict this to be at or near the price/performance sweet-spot for the new platform.

EPYC 7402 (24c): Select a Mid Level SKU, come close to top Xeon Gold and Platinum SKUs

The higher clock speed of this SKU also means it is well suited to some applications.

EPYC 7272-7402 (12, 16 24c):Select an affordable SKU, yield better performance and price performance

Treat these SKUs as much more affordable alternatives to most Xeon Gold or Silver CPUs. We’ll await further benchmarks to see exactly where the further sweet-spots are compared to these SKUs. They also compare favorably from a price/performance standpoint to prior generation 1st Generation EPYC 7xx1 processors with 12, 16, or 24 cores. Same performance, fewer dollars!

Single Socket Performance

As with the previous generation, AMD is heavily promoting the concept of replacing dual socket Intel Xeon servers with single sockets of 2nd Generation AMD EPYC “Rome.” They are producing discounted “P” SKUs with only single socket platform support at reduced prices to help further boost the price-performance advantage of these systems.

Single Socket SKUs

SKU	Cores	Base Clock	Boost Clock	L3 Cache	TDP	Price
7702P	64	2.0	3.35	256MB	200W	$4425
7502P	32	2.5	3.35	128MB	180W	$2300
7402P	24	2.8	3.35	$1250
7302P	16	3.0	3.3	155W	$825
7232P	8	3.1	3.2	32MB	120W	$450

Due to the boosted capability of the new CPUs, a single socket configuration my be increasingly viable comparison to a dual socket Xeon platform for many workloads.

Next Steps: get started today!

If you’d like to read more speeds and feeds about these new processors, check out our article with detailed specifications of the 2nd Gen AMD EPYC “Rome” CPUs. We summarize and compare the specifications of each model, and provide guidance over and beyond what you’ve seen here.

Try 2nd Gen AMD EPYC CPUs for Yourself

Groups which prefer to verify performance before making a design are encouraged to sign up for a Test Drive, which will provide you with access to bare-metal hardware with AMD EPYC CPUs, large-memory, and more.

Browse Our Navion AMD EPYC Product Line

WhisperStation

Ultra-Quiet AMD EPYC workstations

Learn More

Servers

High performance AMD EPYC rackmount servers

Learn More

Clusters

Leadership performance clusters from 5-500 nodes

Learn More

The post 2nd Gen AMD EPYC “Rome” CPU Review: A Groundbreaking Leap for HPC appeared first on Microway.

NVIDIA “Turing” Tesla T4 HPC Performance Benchmarks

connor.kenyon — Fri, 15 Mar 2019 17:06:57 +0000

Performance benchmarks are an insightful way to compare new products on the market. With so many GPUs available, it can be difficult to assess which are suitable to your needs. Various benchmarks provide information to compare performance on individual algorithms or operations. Since there are so many different algorithms to choose from, there is no shortage of benchmarking suites available.

For this comparison, the SHOC benchmark suite (https://github.com/vetter/shoc/) is used to compare the performance of the NVIDIA Tesla T4 with other GPUs commonly used for scientific computing: the NVIDIA Tesla P100 and Tesla V100.

The Scalable Heterogeneous Computing Benchmark Suite (SHOC) is a collection of benchmark programs testing the performance and stability of systems using computing devices with non-traditional architectures for general purpose computing, and the software used to program them. Its initial focus is on systems containing Graphics Processing Units (GPUs) and multi-core processors, and on the OpenCL programming standard. It can be used on clusters as well as individual hosts.

The SHOC benchmark suite includes options for many benchmarks relevant to a variety of scientific computations. Most of the benchmarks are provided in both single- and double-precision and with and without PCIE transfer consideration. This means that for each test there are up to four results for each benchmark. These benchmarks are organized into three levels and can be run individually or all together.

The Tesla P100 and V100 GPUs are well-established accelerators for HPC and AI workloads. They typically offer the highest performance, consume the most power (250~300W), and have the highest price tag (~$10k). The Tesla T4 is a new product based on the latest “Turing” architecture, delivering increased efficiency along with new features. However, it is not a replacement for the bigger/more power-hungry GPUs. Instead, it offers good performance while consuming far less power (70W) at a lower price (~$2.5k). You’ll want to use the right tool for the job, which will depend upon your workload(s). A summary of each Tesla GPU is shown below.

Tesla V100 – World’s Most Advanced Datacenter GPU, for AI & HPC

Integrated in Microway NumberSmasher and OpenPOWER GPU Servers & GPU Clusters

Specifications

Up to 7.8 TFLOPS double- and 15.7 TFLOPS single-precision floating-point performance
Up to 125 TensorTFLOPS of Deep Learning Performance
NVIDIA Volta GPU architecture
5120 CUDA cores, 620 Tensor Cores
16GB or 32GB of on-die HBM2 GPU memory
Memory bandwidth up to 900GB/s
NVIDIA NVLink or PCI-E x16 Gen3 interface to system
Available with enhanced NVLink interface, with 300GB/sec bi-directional bandwidth to the GPU
Passive heatsink only, suitable for specially-designed GPU servers

Tesla T4 – Price/performance for AI and Single Precision

Integrated in Microway NumberSmasher and Navion GPU Servers & GPU Clusters

Specifications

Up to 8.1 TFLOPS single-precision floating-point performance
Up to 65 TensorTFLOPS of Deep Learning Training Performance; 260 INT4 TOPS of Inference Performance
NVIDIA “Turing” TU104 graphics processing unit (GPU)
2560 CUDA cores, 320 Tensor Cores
16GB of GDDR6 GPU memory
Memory bandwidth up to 320GB/s
PCI-E x16 Gen3 interface to system
Passive heatsink only, suitable for specially-designed GPU servers

Tesla P100 – Strong Performance and Connectivity for HPC or AI

Integrated in Microway NumberSmasher and OpenPOWER GPU Servers & GPU Clusters

Specifications

Up to 5.3 TFLOPS double- and 10.6 TFLOPS single-precision floating-point performance
NVIDIA “Pascal” GP100 graphics processing unit (GPU)
3584 CUDA cores
12GB or 16GB of on-die HBM2 CoWoS GPU memory
Memory bandwidth up to 732GB/s
NVLink or PCI-E x16 Gen3 interface to system
Passive heatsink only, suitable for specially-designed GPU servers

In our testing, both single- and double-precision SHOC benchmarks were run, which allows us to make a direct comparison of the capabilities of each GPU. A few HPC-relevant benchmarks were selected to compare the T4 to the P100 and V100. Tesla P100 is based on the “Pascal” architecture, which provides standard CUDA cores. Tesla V100 features the “Volta” architecture, which introduced deep-learning specific TensorCores to complement CUDA cores. Tesla T4 has NVIDIA’s “Turing” architecture, which includes TensorCores and CUDA cores (weighted towards single-precision). This product was designed primarily with machine learning in mind, which results in higher single-precision performance and relatively low double-precision performance. Below, some of the commonly-used HPC benchmarks are compared side-by-side for the three GPUs.

Double Precision Results

GPU	Tesla T4	Tesla V100	Tesla P100
Max Flops (GFLOPS)	253.38	7072.86	4736.76
Fast Fourier Transform (GFLOPS)	132.60	1148.75	756.29
Matrix Multiplication (GFLOPS)	249.57	5920.01	4256.08
Molecular Dynamics (GFLOPS)	105.26	908.62	402.96
S3D (GFLOPS)	59.97	227.85	161.54

Single Precision Results

GPU	Tesla T4	Tesla V100	Tesla P100
Max Flops (GFLOPS)	8073.26	14016.50	9322.46
Fast Fourier Transform (GFLOPS)	660.05	2301.32	1510.49
Matrix Multiplication (GFLOPS)	3290.94	13480.40	8793.33
Molecular Dynamics (GFLOPS)	572.91	997.61	480.02
S3D (GFLOPS)	99.42	434.78	295.20

What Do These Results Mean?

The single-precision results show Tesla T4 performing well for its size, though it falls short in double precision compared to the NVIDIA Tesla V100 and Tesla P100 GPUs. Applications that require double-precision accuracy are not suited to the Tesla T4. However, the single precision performance is impressive and bodes well for the performance of applications that are optimized for lower or mixed precision.

To explain the single-precision benchmarks shown above:

The Max Flops for the T4 are good compared to V100 and competitive with P100. Tesla T4 provides more than half as many FLOPS as V100 and more than 80% of P100.
The T4 shows impressive performance in the Molecular Dynamics benchmark (an n-body pairwise computation using the Lennard-Jones potential). It again offers more than half the performance of Tesla V100, while beating the Tesla P100.
In the Fast Fourier Transform (FFT) and Matrix Multiplication benchmarks, the performance of Tesla T4 is on par for both price/performance and power/performance (one fourth the performance of V100 for one fourth the price and one fourth the wattage). This reflects how the T4 will perform in a large number of HPC applications.
For S3D, the T4 falls behind by a few additional percent.

Looking at these results, it’s important to remember the context. Tesla T4 consumes only ~25% the wattage of the larger Tesla GPUs and costs only ~25% as much. It is also a physically smaller GPU that can be installed in a wider variety of servers and compute nodes. In that context, the Tesla T4 holds its own as a powerful option for a reasonable price when compared to the larger NVIDIA Tesla GPUs.

What to Expect from the NVIDIA Tesla T4

Cost-Effective Machine Learning

The T4 has substantial single/mixed precision machine learning focused performance, with a price tag significantly lower than larger Tesla GPUs. What the T4 lacks in double precision, it makes up for with impressive single-precision results. The single-precision performance available will strongly cater to the machine learning algorithms with potential to be applied to mixed precision. Future work will examine this aspect more closely, but Tesla T4 is expected to be of high interest for deep learning inference and to have specific use-cases for deep learning training.

Impressive Single-Precision HPC Performance

In the molecular dynamics benchmark, the T4 outperforms the Tesla P100 GPU. This is extremely impressive, and for those interested in single- or mixed-precision calculations involving similar algorithms, the T4 could provide an excellent solution. With some adapting algorithms, the T4 may be a strong contender for scientific applications that also want to utilize machine learning capabilities to analyze results or run a variety of different types of algorithms from both machine learning and scientific computing on an easily accessible GPU.

In addition to the outright lower price tag, the T4 also operates at 70 Watts, in comparison to the 250+ Watts required for the Tesla P100 / V100 GPUs. Running on one quarter of the power means that it is both cheaper to purchase and cheaper to operate.

Next Steps for leveraging Tesla T4

If it appears the new Tesla T4 will accelerate your workload, but you’d like to benchmark, please sign up to Test Drive for yourself. We also invite you to contact one of our experts to discuss your needs further. Our goal is to understand your requirements, provide guidance on best options, and see the project through to successful system/cluster deployment.

Full SHOC Benchmark Results

--- Welcome To The SHOC Benchmark Suite version 1.1.5 --- 
Hostname: node9 
Platform selection not specified, default to platform #0
Number of available platforms: 1 
Number of available devices on platform 0 : 4
Device 0: 'Tesla T4'
Device 1: 'Tesla T4'
Device 2: 'Tesla T4'
Device 3: 'Tesla T4'
Device selection not specified: defaulting to device #0.
Using size class: 4

--- Starting Benchmarks ---
Running benchmark BusSpeedDownload
  result for bspeed_download: 12.3585 GB/sec
Running benchmark BusSpeedReadback
  result for bspeed_readback: 13.2077 GB/sec
Running benchmark MaxFlops
  result for maxspflops: 8073.2600 GFLOPS
  result for maxdpflops: 253.3760 GFLOPS
Running benchmark DeviceMemory
  result for gmem_readbw: 215.2640 GB/s
  result for gmem_readbw_strided: 109.2370 GB/s
  result for gmem_writebw: 201.0440 GB/s
  result for gmem_writebw_strided: 29.2783 GB/s
  result for lmem_readbw: 3435.8600 GB/s
  result for lmem_writebw: 3704.9400 GB/s
  result for tex_readbw: 884.0470 GB/sec
  Skipping non-cuda benchmark KernelCompile
  Skipping non-cuda benchmark QueueDelay
Running benchmark BFS
  result for bfs: 6.3894 GB/s
  result for bfs_pcie: 3.8521 GB/s
  result for bfs_teps: 344078000.0000 Edges/s
Running benchmark FFT
  result for fft_sp: 660.0520 GFLOPS
  result for fft_sp_pcie: 62.5926 GFLOPS
  result for ifft_sp: 657.7220 GFLOPS
  result for ifft_sp_pcie: 62.6273 GFLOPS
  result for fft_dp: 132.5970 GFLOPS
  result for fft_dp_pcie: 27.4628 GFLOPS
  result for ifft_dp: 125.4250 GFLOPS
  result for ifft_dp_pcie: 27.1584 GFLOPS
Running benchmark GEMM
  result for sgemm_n: 3290.9400 GFlops
  result for sgemm_t: 3287.4400 GFlops
  result for sgemm_n_pcie: 2377.5600 GFlops
  result for sgemm_t_pcie: 2375.7400 GFlops
  result for dgemm_n: 249.5690 GFlops
  result for dgemm_t: 249.6800 GFlops
  result for dgemm_n_pcie: 227.2710 GFlops
  result for dgemm_t_pcie: 227.3630 GFlops
Running benchmark MD
  result for md_sp_flops: 572.9100 GFLOPS
  result for md_sp_bw: 439.0600 GB/s
  result for md_sp_flops_pcie: 53.9088 GFLOPS
  result for md_sp_bw_pcie: 41.3140 GB/s
  result for md_dp_flops: 105.2590 GFLOPS
  result for md_dp_bw: 141.2860 GB/s
  result for md_dp_flops_pcie: 37.2010 GFLOPS
  result for md_dp_bw_pcie: 49.9335 GB/s
Running benchmark MD5Hash
  result for md5hash: 14.8551 GHash/s
Running benchmark NeuralNet
  result for nn_learning: BenchmarkError
  result for nn_learning_pcie: BenchmarkError
Running benchmark Reduction
  result for reduction: 225.9420 GB/s
  result for reduction_pcie: 11.6754 GB/s
  result for reduction_dp: 257.2570 GB/s
  result for reduction_dp_pcie: 11.7360 GB/s
Running benchmark Scan
  result for scan: 81.0464 GB/s
  result for scan_pcie: 5.8949 GB/s
  result for scan_dp: 62.2882 GB/s
  result for scan_dp_pcie: 5.7605 GB/s
Running benchmark Sort
  result for sort: 6.3951 GB/s
  result for sort_pcie: 3.1917 GB/s
Running benchmark Spmv
  result for spmv_csr_scalar_sp: 19.3042 Gflop/s
  result for spmv_csr_scalar_sp_pcie: 2.5486 Gflop/s
  result for spmv_csr_scalar_dp: 11.9228 Gflop/s
  result for spmv_csr_scalar_dp_pcie: 1.7080 Gflop/s
  result for spmv_csr_scalar_pad_sp: 24.5346 Gflop/s
  result for spmv_csr_scalar_pad_sp_pcie: 2.6437 Gflop/s
  result for spmv_csr_scalar_pad_dp: 14.4112 Gflop/s
  result for spmv_csr_scalar_pad_dp_pcie: 1.7501 Gflop/s
  result for spmv_csr_vector_sp: 51.6801 Gflop/s
  result for spmv_csr_vector_sp_pcie: 2.7829 Gflop/s
  result for spmv_csr_vector_dp: 35.7128 Gflop/s
  result for spmv_csr_vector_dp_pcie: 1.8895 Gflop/s
  result for spmv_csr_vector_pad_sp: 55.1641 Gflop/s
  result for spmv_csr_vector_pad_sp_pcie: 2.8127 Gflop/s
  result for spmv_csr_vector_pad_dp: 37.4158 Gflop/s
  result for spmv_csr_vector_pad_dp_pcie: 1.8914 Gflop/s
  result for spmv_ellpackr_sp: 37.6080 Gflop/s
  result for spmv_ellpackr_dp: 27.4393 Gflop/s
Running benchmark Stencil2D
  result for stencil: 218.0090 GFLOPS
  result for stencil_dp: 100.4440 GFLOPS
Running benchmark Triad
  result for triad_bw: 16.2555 GB/s
Running benchmark S3D
  result for s3d: 99.4160 GFLOPS
  result for s3d_pcie: 86.6513 GFLOPS
  result for s3d_dp: 56.9674 GFLOPS
  result for s3d_dp_pcie: 48.7782 GFLOPS

--- Welcome To The SHOC Benchmark Suite version 1.1.5 --- 
Hostname: node6 
Platform selection not specified, default to platform #0
Number of available platforms: 1 
Number of available devices on platform 0 : 4
Device 0: 'Tesla V100-PCIE-32GB'
Device 1: 'Tesla V100-PCIE-32GB'
Device 2: 'Tesla V100-PCIE-32GB'
Device 3: 'Tesla V100-PCIE-32GB'
Specified 1 device IDs: 0
Using size class: 4

--- Starting Benchmarks ---
Running benchmark BusSpeedDownload
result for bspeed_download: 12.3182 GB/sec
Running benchmark BusSpeedReadback
result for bspeed_readback: 13.2066 GB/sec
Running benchmark MaxFlops
result for maxspflops: 14016.5000 GFLOPS
result for maxdpflops: 7072.8600 GFLOPS
Running benchmark DeviceMemory
result for gmem_readbw: 795.4980 GB/s
result for gmem_readbw_strided: 430.5780 GB/s
result for gmem_writebw: 710.4180 GB/s
result for gmem_writebw_strided: 54.3789 GB/s
result for lmem_readbw: 8535.5600 GB/s
result for lmem_writebw: 9191.3800 GB/s
result for tex_readbw: 1368.0900 GB/sec
Skipping non-cuda benchmark KernelCompile
Skipping non-cuda benchmark QueueDelay
Running benchmark BFS
result for bfs: 10.2526 GB/s
result for bfs_pcie: 4.9526 GB/s
result for bfs_teps: 489112000.0000 Edges/s
Running benchmark FFT
result for fft_sp: 2301.3200 GFLOPS
result for fft_sp_pcie: 66.9615 GFLOPS
result for ifft_sp: 2283.8400 GFLOPS
result for ifft_sp_pcie: 67.0689 GFLOPS
result for fft_dp: 1148.7500 GFLOPS
result for fft_dp_pcie: 33.4412 GFLOPS
result for ifft_dp: 1138.6500 GFLOPS
result for ifft_dp_pcie: 33.4938 GFLOPS
Running benchmark GEMM
result for sgemm_n: 13480.4000 GFlops
result for sgemm_t: 13685.9000 GFlops
result for sgemm_n_pcie: 5231.6300 GFlops
result for sgemm_t_pcie: 5262.3000 GFlops
result for dgemm_n: 5920.0100 GFlops
result for dgemm_t: 5606.4400 GFlops
result for dgemm_n_pcie: 1774.8200 GFlops
result for dgemm_t_pcie: 1745.5500 GFlops
Running benchmark MD
result for md_sp_flops: 997.6080 GFLOPS
result for md_sp_bw: 764.5360 GB/s
result for md_sp_flops_pcie: 55.5554 GFLOPS
result for md_sp_bw_pcie: 42.5760 GB/s
result for md_dp_flops: 908.6200 GFLOPS
result for md_dp_bw: 1219.6100 GB/s
result for md_dp_flops_pcie: 53.7409 GFLOPS
result for md_dp_bw_pcie: 72.1343 GB/s
Running benchmark MD5Hash
result for md5hash: 31.3448 GHash/s
Running benchmark NeuralNet
result for nn_learning: BenchmarkError
result for nn_learning_pcie: BenchmarkError
Running benchmark Reduction
result for reduction: 293.9380 GB/s
result for reduction_pcie: 11.7540 GB/s
result for reduction_dp: 506.6470 GB/s
result for reduction_dp_pcie: 11.9523 GB/s
Running benchmark Scan
result for scan: 182.4320 GB/s
result for scan_pcie: 6.1221 GB/s
result for scan_dp: 185.5270 GB/s
result for scan_dp_pcie: 6.1331 GB/s
Running benchmark Sort
result for sort: 19.9312 GB/s
result for sort_pcie: 4.8228 GB/s
Running benchmark Spmv
result for spmv_csr_scalar_sp: 65.9282 Gflop/s
result for spmv_csr_scalar_sp_pcie: 2.7467 Gflop/s
result for spmv_csr_scalar_dp: 46.7535 Gflop/s
result for spmv_csr_scalar_dp_pcie: 1.9000 Gflop/s
result for spmv_csr_scalar_pad_sp: 72.0344 Gflop/s
result for spmv_csr_scalar_pad_sp_pcie: 2.8377 Gflop/s
result for spmv_csr_scalar_pad_dp: 54.4875 Gflop/s
result for spmv_csr_scalar_pad_dp_pcie: 1.9227 Gflop/s
result for spmv_csr_vector_sp: 153.1620 Gflop/s
result for spmv_csr_vector_sp_pcie: 2.8131 Gflop/s
result for spmv_csr_vector_dp: 109.5760 Gflop/s
result for spmv_csr_vector_dp_pcie: 1.9441 Gflop/s
result for spmv_csr_vector_pad_sp: 156.8750 Gflop/s
result for spmv_csr_vector_pad_sp_pcie: 2.8987 Gflop/s
result for spmv_csr_vector_pad_dp: 115.0560 Gflop/s
result for spmv_csr_vector_pad_dp_pcie: 1.9587 Gflop/s
result for spmv_ellpackr_sp: 76.6566 Gflop/s
result for spmv_ellpackr_dp: 65.7927 Gflop/s
Running benchmark Stencil2D
result for stencil: 595.8100 GFLOPS
result for stencil_dp: 339.2710 GFLOPS
Running benchmark Triad
result for triad_bw: 16.4229 GB/s
Running benchmark S3D
result for s3d: 434.7830 GFLOPS
result for s3d_pcie: 263.8650 GFLOPS
result for s3d_dp: 227.8530 GFLOPS
result for s3d_dp_pcie: 136.3140 GFLOPS

--- Welcome To The SHOC Benchmark Suite version 1.1.5 --- 
Hostname: node7 
Platform selection not specified, default to platform #0
Number of available platforms: 1 
Number of available devices on platform 0 : 4
Device 0: 'Tesla P100-PCIE-16GB'
Device 1: 'Tesla P100-PCIE-16GB'
Device 2: 'Tesla P100-PCIE-16GB'
Device 3: 'Tesla P100-PCIE-16GB'
Specified 1 device IDs: 0
Using size class: 4

--- Starting Benchmarks ---
Running benchmark BusSpeedDownload
result for bspeed_download: 12.3502 GB/sec
Running benchmark BusSpeedReadback
result for bspeed_readback: 13.2060 GB/sec
Running benchmark MaxFlops
result for maxspflops: 9322.4600 GFLOPS
result for maxdpflops: 4736.7600 GFLOPS
Running benchmark DeviceMemory
result for gmem_readbw: 574.4540 GB/s
result for gmem_readbw_strided: 98.2470 GB/s
result for gmem_writebw: 432.2270 GB/s
result for gmem_writebw_strided: 25.2659 GB/s
result for lmem_readbw: 4203.2000 GB/s
result for lmem_writebw: 5259.1000 GB/s
result for tex_readbw: 587.9750 GB/sec
Skipping non-cuda benchmark KernelCompile
Skipping non-cuda benchmark QueueDelay
Running benchmark BFS
result for bfs: 3.6904 GB/s
result for bfs_pcie: 2.6656 GB/s
result for bfs_teps: 208754000.0000 Edges/s
Running benchmark FFT
result for fft_sp: 1510.4900 GFLOPS
result for fft_sp_pcie: 66.1778 GFLOPS
result for ifft_sp: 1502.4700 GFLOPS
result for ifft_sp_pcie: 66.2629 GFLOPS
result for fft_dp: 756.2940 GFLOPS
result for fft_dp_pcie: 33.0865 GFLOPS
result for ifft_dp: 752.3340 GFLOPS
result for ifft_dp_pcie: 33.1221 GFLOPS
Running benchmark GEMM
result for sgemm_n: 8793.3300 GFlops
result for sgemm_t: 8882.6100 GFlops
result for sgemm_n_pcie: 4343.6700 GFlops
result for sgemm_t_pcie: 4365.3400 GFlops
result for dgemm_n: 4256.0800 GFlops
result for dgemm_t: 4389.7700 GFlops
result for dgemm_n_pcie: 1589.1300 GFlops
result for dgemm_t_pcie: 1607.4100 GFlops
Running benchmark MD
result for md_sp_flops: 480.0150 GFLOPS
result for md_sp_bw: 367.8690 GB/s
result for md_sp_flops_pcie: 52.8129 GFLOPS
result for md_sp_bw_pcie: 40.4741 GB/s
result for md_dp_flops: 402.9640 GFLOPS
result for md_dp_bw: 540.8830 GB/s
result for md_dp_flops_pcie: 50.0934 GFLOPS
result for md_dp_bw_pcie: 67.2385 GB/s
Running benchmark MD5Hash
result for md5hash: 14.6630 GHash/s
Running benchmark NeuralNet
result for nn_learning: BenchmarkError
result for nn_learning_pcie: BenchmarkError
Running benchmark Reduction
result for reduction: 257.3830 GB/s
result for reduction_pcie: 11.7287 GB/s
result for reduction_dp: 424.4240 GB/s
result for reduction_dp_pcie: 11.9433 GB/s
Running benchmark Scan
result for scan: 110.2530 GB/s
result for scan_pcie: 6.0040 GB/s
result for scan_dp: 131.8250 GB/s
result for scan_dp_pcie: 6.0633 GB/s
Running benchmark Sort
result for sort: 10.4056 GB/s
result for sort_pcie: 3.9523 GB/s
Running benchmark Spmv
result for spmv_csr_scalar_sp: 17.0055 Gflop/s
result for spmv_csr_scalar_sp_pcie: 2.4774 Gflop/s
result for spmv_csr_scalar_dp: 13.7115 Gflop/s
result for spmv_csr_scalar_dp_pcie: 1.7301 Gflop/s
result for spmv_csr_scalar_pad_sp: 21.3641 Gflop/s
result for spmv_csr_scalar_pad_sp_pcie: 2.6089 Gflop/s
result for spmv_csr_scalar_pad_dp: 16.0769 Gflop/s
result for spmv_csr_scalar_pad_dp_pcie: 1.7779 Gflop/s
result for spmv_csr_vector_sp: 58.5214 Gflop/s
result for spmv_csr_vector_sp_pcie: 2.7625 Gflop/s
result for spmv_csr_vector_dp: 45.8722 Gflop/s
result for spmv_csr_vector_dp_pcie: 1.8983 Gflop/s
result for spmv_csr_vector_pad_sp: 63.1210 Gflop/s
result for spmv_csr_vector_pad_sp_pcie: 2.8367 Gflop/s
result for spmv_csr_vector_pad_dp: 49.2344 Gflop/s
result for spmv_csr_vector_pad_dp_pcie: 1.9114 Gflop/s
result for spmv_ellpackr_sp: 54.0921 Gflop/s
result for spmv_ellpackr_dp: 37.1737 Gflop/s
Running benchmark Stencil2D
result for stencil: 424.1380 GFLOPS
result for stencil_dp: 263.4790 GFLOPS
Running benchmark Triad
result for triad_bw: 16.2500 GB/s
Running benchmark S3D
result for s3d: 295.1980 GFLOPS
result for s3d_pcie: 205.1260 GFLOPS
result for s3d_dp: 161.5440 GFLOPS
result for s3d_dp_pcie: 109.4630 GFLOPS

The post NVIDIA “Turing” Tesla T4 HPC Performance Benchmarks appeared first on Microway.

DDR4 Memory on Xeon E5-2600v3 with 3 DIMMs per channel

marc — Thu, 24 Mar 2016 14:35:40 +0000

This week I had the opportunity to run the STREAM memory benchmark on a Microway 2U NumberSmasher server which supports up to 3 DIMMs per channel. In practice, this system is typically configured with 768GB or 1.5TB of DDR4 memory.A key goal of this benchmarking was to examine how RAM quantity and clock frequency affect bandwidth performance. When fully loading all three DIMMs per channel, the memory frequency defaults to 1600MHz. At two DIMMs per channel, the default memory frequency increases to 1866MHz. With one DIMM per channel, the frequency maxes out at 2133MHz.

The Test System

System: NumberSmasher 2U Server based on SYS-6028U-TR4+
Motherboard: X10DRU-i+
Processors x 2: Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz
DIMMs: 32GB DDR4-2133 ECC/Registered Samsung M393A4K40BB0-CPB0Q
Operating System: CentOS Linux release 7.2.1511 (Core)
Kernel Version: 3.10.0-327.10.1.el7.x86_64
Compiler: Intel Parallel Studio XE 2016

Benchmark Compilation and Execution

When compiling STREAM with the Intel compiler, I used the following compiler knobs in the makefile:

CC = icc
CFLAGS = -O3 -xHost -openmp -DSTREAM_ARRAY_SIZE=64000000 -opt-streaming-cache-evict=0 -opt-streaming-stores always -opt-prefetch-distance=64,8

Information on compiling STREAM can be found from an Intel Developer Zone article on STREAM Triad Optimization. Also, reading through the STREAM FAQ at the University of Virginia site can be helpful.

I set the KMP_AFFINITY and OMP_NUM_THREADS environment variables before running STREAM:

export KMP_AFFINITY=granularity=core,compact
export OMP_NUM_THREADS=8
./stream_intel

On a system that has hyper-threading turned on, I could have used GOMP_CPU_AFFINITY environment variable to focus on real cores, but I elected to turn off hyper-threading in BIOS instead.

STREAM Performance Results

Results with 3 DIMMs per Channel – 768GB RAM @ 1600MHz

Task	Best Rate MB/s	Avg time	Min time	Max time
Copy	73,876.7	0.013882	0.013861	0.013905
Scale	73,430.8	0.013967	0.013945	0.013989
Add	70,320.2	0.021891	0.021843	0.022147
Triad	70,555.8	0.021859	0.021770	0.022379

Results with 2 DIMMs per Channel – 512GB RAM @ 1866MHz

Task	Best Rate MB/s	Avg Time	Min time	Max time
Copy	88,413.8	0.011661	0.011582	0.011900
Scale	87,867.6	0.011765	0.011654	0.012166
Add	90,289.8	0.017417	0.017012	0.018789
Triad	89,756.5	0.017596	0.017113	0.018941

Results with 1 DIMM per Channel – 256GB RAM @ 2133MHz

Task	Best Rate MB/s	Avg time	Min time	Max time
Copy	89,242.5	0.011479	0.011468	0.011495
Scale	87,724.0	0.011699	0.011673	0.011757
Add	90,363.3	0.017031	0.016998	0.017057
Triad	90,411.5	0.017006	0.016989	0.017027

Graph of STREAM Triad performance for 768GB, 512GB and 256GB memory

Summary of Results

Notice in the chart how rapidly performance improves moving from 3 DIMMs per channel 768GB at 1600MHz to 2 DIMMs per channel 512GB at 1866MHz. Also notice that going from 2 DIMMs per channel to 1 DIMM per channel 256GB at 2133MHz does not change very much at all.

This is significant when deciding how much RAM to spec on a new system, or how much to add when upgrading. Outfitting a server with eight or sixteen DIMMs results in excellent performance. Outfitting a server with twenty-four DIMMs provides exceptional memory capacity, but results in reduced performance. Thus, there is a trade-off between memory capacity and memory performance.

Realize too that using the E5-2637 v3 processors – with only 4 real cores each – reduces the STREAM performance results. Had I used something like the E5-2690 v3 processors – with 12 real cores each – the peak STREAM throughput results would be roughly 110GB/sec.

Results with 2 DIMMs per Channel – 512GB RAM @ 2133MHz (Forced in BIOS)

The best performance over all for the day (though not graphed above) came from forcing the 512GB configuration to 2133MHz in BIOS:

Task	Best Rate MB/s	Avg Time	Min time	Max time
Copy	89,510.2	0.011477	0.011440	0.011605
Scale	88,981.7	0.011523	0.011508	0.011539
Add	92,473.6	0.016640	0.016610	0.016665
Triad	92,403.3	0.016674	0.016623	0.016710

Be careful though – a configuration like this needs to be heavily tested to insure stability. Call us at Microway if you are not sure or have questions about memory configuration on your next server.

The post DDR4 Memory on Xeon E5-2600v3 with 3 DIMMs per channel appeared first on Microway.

Caffe Deep Learning Tutorial using NVIDIA DIGITS on Tesla K80 & K40 GPUs

john — Thu, 17 Sep 2015 14:14:25 +0000

In this Caffe deep learning tutorial, we will show how to use DIGITS in order to train a classifier on a small image set. Along the way, we’ll see how to adjust certain run-time parameters, such as the learning rate, number of training epochs, and others, in order to tweak and optimize the network’s performance. Other DIGITS features will be introduced, such as starting a training run using the network weights derived from a previous training run, and using a completed classifier from the command line.

Caffe Deep Learning Framework

The Caffe Deep Learning framework has gained great popularity.It originated in the Berkeley Vision and Learning Center (BVLC) and has since attracted a number of community contributors.

NVIDIA maintains their own branch of Caffe – the latest version (0.13 at the time of writing) can be downloaded from NVIDIA’s github.

NVIDIA DIGITS & Caffe Deep Learning GPU Training System (DIGITS)

NVIDIA DIGITS is a production quality, artificial neural network image classifier available for free from NVIDIA. DIGITS provides an easy-to-use web interface for training and testing your classifiers, while using the underlying Caffe Deep Learning framework.

The latest version of NVIDIA DIGITS (2.1 at the time of writing) can be downloaded here.

neural network distinguishes Land Rover from Jeep Cherokee

Hardware for NVIDIA DIGITS and Caffe Deep Learning Neural Networks

The hardware we will be using are two Tesla K80 GPU cards, on a single compute node, as well as a set of two Tesla K40 GPUs on a separate compute node. Each Tesla K80 card contains two Kepler GK210 chips, 24 GB of total shared GDDR5 memory, and 2,496 CUDA cores on each chip, for a total of 4,992 CUDA cores. The Tesla K40 cards, by comparison, each contain one GK110B chip, 12 GB of GDDR5 memory, and 2,880 CUDA cores.

Since the data associated with a trained neural network classifier is not heavy in data weight, a classifier could be easily deployed onto a mobile embedded system, and run, for example, by an NVIDIA Tegra processor. In many cases, however, neural network image classifiers are run on GPU-accelerated servers at a fixed location.

Runtimes will be compared for various configurations of these Tesla GPUs (see gpu benchmarks below). The main objectives of this tutorial, however, can be achieved using other NVIDIA GPU accelerators, such as the NVIDIA GeForce GTX Titan X, or the NVIDIA Quadro line (K6000, for example). Both of these GPUs are available in Microway’s Deep Learning WhisperStation, a quiet, desktop-sized GPU supercomputer pre-configured for extensive Deep Learning computation. The NVIDIA GPU hardware on Microway’s HPC cluster is available for “Test Driving”. Readers are encouraged to request a GPU Test Drive.

Introduction to Deep Learning with DIGITS

To begin, let’s examine the creation of a small image dataset. The images were downloaded using a simple in-house bash shell script. Images were chosen to consist of two categories: one of recent series of the SUV Land Rover, and the other of recent series of the Jeep Cherokee – both comprised mostly of the 2014 or 2015 models.

The process of building a deep learning artificial neural network image classifier for these two types of SUVs using NVIDIA DIGITS is described in detail below in a video tutorial. As a simple proof of concept, only these two SUV types were included in the data set. A larger data set could be easily constructed including an arbitrary number of vehicle types. Building a high quality data set is somewhat of an art, where consideration must be given to:

sizes of features in relation to convolution filter sizes
having somewhat uniform aspect ratios, so that potentially distinguishing features do not get distorted too differently from image to image during the squash transformation of DIGITS
ensuring that ample sampling of images taken from various angles are present in the data set (side view, front, back, close-ups, etc.) – this will train the network to recognize more facets of the objects to be classified

The laborious task of planning and creating a quality image data set is an investment into the final performance quality of the deep learning network, so care and attention at this stage will yield better performance during classifier testing and deployment. The original SUV image data set was expanded by window sub-sampling the original set of images, and then by also applying horizontal, vertical, and combined flips of the sub-sampled, as well as of the original images.

Neural Network Image Classifier Performance Considerations

Beforehand, some performance-oriented questions we can pose are:

Can the classifier distinguish SUV type from front, back, side, and top viewpoints?
To what level of accuracy can the classifier distinguish image categories?
What sort of discernable, high-level object features will be learned by the network?

(We recommend viewing the NVIDIA DIGITS Deep Learning Tutorial video with 720p HD)

GPU Benchmarks for Caffe deep learning on Tesla K40 and K80

A GoogLeNet neural network model computation was benchmarked on the same learning parameters and dataset for the hardware configurations shown in the table below. All other aspects of hardware were the same across these configurations.

Hardware Configuration	Speedup Factor¹
2 NVIDIA K80 GPU cards (4 GK210 chips)	2.55
2 NVIDIA K40 GPU cards (2 GK110B chips)	1.56
1 NVIDIA K40 GPU card (1 GK110B chip)	1

¹compared against the runtime on a single Tesla K40 GPU

The runtimes in this table reflect 30 epochs of training the GoogLeNet model with a learning rate of 0.005. The batch size was set to 120, compared to the default of 24. This was done in order to use a greater percentage of GPU memory.

In this tutorial, we specified a local directory for DIGITS to construct the image set. If you instead provide text files for the training and validation images, you may want to ensure that the default setting of Shuffle lines is set to “Yes”. This is important if you downloaded your images sequentially, by category. If the lines from such files are not shuffled, then your validation set may not guide the training as well as it would if the image URLs are random in order.

Although NVIDIA DIGITS already supports Caffe deep learning, it will soon support the Torch and Theano frameworks, so check back with Microway’s blog for more information on exciting new developments and tips on how you can quickly get started on using Deep Learning in your research.

DDR4 RDIMM and LRDIMM Performance Comparison

marc — Fri, 10 Jul 2015 19:47:41 +0000

Recently, while carrying out memory testing in our integration lab, Lead Systems Integrator, Rick Warner, was able to clearly identify when it is appropriate to choose load-reduced DIMMs (LRDIMM) and when it is appropriate to choose registered DIMMs (RDIMM) for servers running large amounts of DDR4 RAM (i.e., 256 Gigabytes and greater). The critical factors to consider are latency, speed, and capacity, along with what your computing objectives are with respect to them.

Misconceptions on Load Reduced DIMM Performance

Load-reduced DIMMs were built so that high-speed memory controllers in CPUs could drive larger quantities of memory. Thus, it’s often assumed that LRDIMMs will offer the best performance for memory-dense servers. This impression is strengthened by the fact that Intel’s guide for DDR4 memory population shows LRDIMMs running at a higher frequency than RDIMMs (e.g., 2133MHz vs 1866MHz). However, as we’ll show below, there are greater factors at play.

RDIMM vs LRDIMM Performance Testing

Using the STREAM memory benchmark, Rick took a look at 1 DIMM and 2 DIMMs per channel configurations using DDR4 LRDIMMS and RDIMMs on a Supermicro X10DAi motherboard with two Intel Xeon E5-2687W v3 CPU’s. Both our WhisperStation and WhisperStation for R are available in this configuration. We also have several Xeon Rackmount Servers which support this configuration.

For each case, the DIMM speed was forced to 2133MHz in the BIOS. Tests were run with both RDIMMs and LRDIMMs in 256GB and 512GB configurations.

LRDIMM Benchmark Results

Function	Best Rate MB/s	Avg. Time	Min. Time	Max. Time
Copy	81,383.5	0.004005	0.003932	0.004151
Scale	95,746.7	0.003409	0.003342	0.003561
Add	109,661.0	0.004505	0.004377	0.004862
Triad	109,315.6	0.004490	0.004391	0.004771

One LRDIMM Per Channel — 256GB RAM @ 2133MHz

Function	Best Rate MB/s	Avg. Time	Min. Time	Max. Time
Copy	72,499.2	0.004461	0.004414	0.004546
Scale	83,572.7	0.003901	0.003829	0.004036
Add	95,979.5	0.005103	0.005001	0.005220
Triad	96,541.0	0.005105	0.004972	0.005265

Two LRDIMMs Per Channel — 512GB RAM @ 2133MHz*

* for LRDIMMs, the 512GB configuration automatically operates at 2133MHz

LRDIMM Performance Summary

From these tests, we concluded that the latency imposed by the LRDIMMs results in approximately 12% reduction in overall performance when doubling the amount of RAM from 256GB to 512GB.

RDIMM Benchmark Results

Rick then tested RDIMMs using the same system for comparison (with 256GB and 512GB DDR4 memory configurations). Below are the stream results.

Function	Best Rate MB/s	Avg. Time	Min. Time	Max. Time
Copy	82,707.5	0.003939	0.003869	0.004093
Scale	101,973.7	0.003243	0.003138	0.003471
Add	111,966.3	0.004502	0.004287	0.004978
Triad	110,881.0	0.004468	0.004329	0.004843

One RDIMM Per Channel — 256GB RAM @ 2133MHz

Function	Best Rate MB/s	Avg. Time	Min. Time	Max. Time
Copy	75,049.1	0.004314	0.004264	0.004405
Scale	93,812.6	0.003460	0.003411	0.003550
Add	103,091.1	0.004729	0.004656	0.004969
Triad	103,493.9	0.004704	0.004638	0.004909

Two RDIMMs Per Channel — 512GB RAM @ 2133MHz*

* for RDIMMs, the 512GB configuration requires the memory speed to manually be increased to 2133MHz

RDIMM Performance Summary

Just as we saw with LRDIMMs, there is a reduction in performance between 1 DIMM per channel and 2 DIMMs per channel when using RDIMMs. However, this penalty is reduced to approximately 7% (compared to the 12% penalty suffered by LRDIMMs).

Side-by-Side Comparison of RDIMM and LRDIMM Performance

For clarity, here is a side by side table of DDR4 memory performance comparing LRDIMMs to RDIMMs. Note that RDIMM memory bandwidth is higher than LRDIMM bandwidth in every case.

	1 DIMM Per Channel Best Rate (MB/s)		2 DIMMs Per Channel Best Rate (MB/s)
Function	LRDIMM	RDIMM	LRDIMM	RDIMM
Copy	81,383.5	82,707.5	72,499.2	75,049.1
Scale	95,746.7	101,973.7	83,572.7	93,812.6
Add	109,661.0	111,966.3	95,979.5	103,091.1
Triad	109,315.6	110,881.0	96,541.0	103,493.9

LRDIMMs and RDIMMs Compared

When Registered DIMMs (RDIMMs) are Best

Many of our HPC customers are looking for high speed and low latency. In that realm, RDIMMs are the hands down winner. At slightly cheaper cost and with the ability to ramp up memory frequency on certain motherboards, they are the right choice for fast memory performance.

When Load-Reduced DIMMs (LRDIMMs) are Best

When very large quantities of RAM are the goal, then LRDIMMs are the way to go. In this chart from Intel’s Grantly Platform Memory Configuration Guide, you can see that when packing a system full of RAM you can achieve twice the capacity from LRDIMMs. However, 64GB DDR4 LRDIMMs are still quite costly. There are also specific configurations using 3 DIMMs per channel that require LRDIMMs. Contact one of our experts to discuss the best options when you are considering servers with more than 512GB memory.

SKU	Max DIMMs in Platform	Number of CPU Sockets	RDIMM Config	LRDIMM Config
E5-1600 v3	12 DIMMS	1	384GB (12x32GB) @ 1600MHz	768GB (12x64GB) @ 1600MHz
E5-2600 v3	24 DIMMs	2	768GB (24x32GB) @ 1600MHz	1.5TB (24x64GB) @ 1600MHz
E5-4600 v3	48 DIMMs	4	1.5TB (48x32GB) @ 1600MHz	3TB (48x64GB) @ 1600MHz

Memory Configuration

Choosing between LRDIMMs and RDIMMs depends entirely on what performance characteristics meet the needs of your applications. Careful consideration of latency, speed and capacity as applied to your problem will show you the way to go. Our engineering team can help you work your way through this important design choice. Contact us or give us a call for assistance choosing the HPC platform that works best for you.

The post DDR4 RDIMM and LRDIMM Performance Comparison appeared first on Microway.

How to Benchmark GROMACS GPU Acceleration on HPC Clusters

jan — Tue, 21 Oct 2014 15:42:32 +0000

We know that many of our readers are interested in seeing how molecular dynamics applications perform with GPUs, so we are continuing to highlight various packages. This time we will be looking at GROMACS, a well-established and free-to-use (under GNU GPL) application. GROMACS is a popular choice for scientists interested in simulating molecular interaction. With NVIDIA Tesla K40 GPUs, it’s common to see 2X and 3X speedups compared to the latest multi-core CPUs.

Logging on to the Test Drive Cluster

To obtain access, fill out this quick and easy form: sign up for a GPU Test Drive. Once you obtain approval, you’ll receive an email with a list of commands to help you get your benchmark running. For your convenience, you can also reference a more detailed step-by-step guide below.

To begin, log in to the Microway Test Drive cluster using SSH. Don’t worry if you’re unfamiliar with SSH – we include an instruction manual for logging in. SSH is built-in on Linux and MacOS; Windows users need to install one application.

Run GROMACS on CPUs and GPUs

This first step is very easy. Simply enter the GROMACS directory and run the default benchmark script which we have pre-written for you:

cd gromacs
sbatch run-gromacs-on-TeslaK40.sh

Remember that Linux is case sensitive!

Managing GROMACS Jobs on the Cluster

Our cluster uses SLURM for resource management. Keeping track of your job is easy using the squeue command. For real-time information on your job, run: watch squeue (hit CTRL+c to exit). Alternatively, you can tell the cluster to e-mail you when your job is finished by editing the GROMACS batch script file (although this must be done before submitting jobs with sbatch). Run:

nano run-gromacs-on-TeslaK40.sh

Within this file, add the following two lines to the #SBATCH section (specifying your own e-mail address):

#SBATCH --mail-user=yourname@example.com
#SBATCH --mail-type=END

If you would like to monitor the compute node which is running your job, examine the output of squeue and take note of which node your job is running on. Log into that node using SSH and then use the tools of your choice to monitor it. For example:

ssh node2
nvidia-smi
htop

(hit q to exit htop)

See the speedup of GPUs vs. CPUs

The results from our benchmark script will be placed in an output file called gromacs-K40.xxxx.output.log – below is a sample of the output running on CPUs:

=======================================================================
= Run CPU-only water scaling benchmark system (1536)
=======================================================================
               Core t (s)   Wall t (s)        (%)
       Time:     1434.957       71.763     1999.6
                 (ns/day)    (hour/ns)
Performance:        1.206       19.894

Just below it is the GPU-accelerated run (showing a ~2.8X speedup):

=======================================================================
= Run Tesla_K40m GPU-accelerated water scaling benchmark system (1536)
=======================================================================
               Core t (s)   Wall t (s)        (%)
       Time:      508.847       25.518     1994.0
                 (ns/day)    (hour/ns)
Performance:        3.393        7.074

Should you require more information on a particular run, it’s available in the benchmarks/water/ directory. If your job has any problems, the errors will be logged to the file gromacs-K40.xxxx.output.errors

The chart below demonstrates the performance improvements between a CPU-only GROMACS run (on two 10-core Ivy Bridge Intel Xeon CPUs) and a GPU-accelerated GROMACS run (on two NVIDIA Tesla K40 GPUs):

Benchmarking your GROMACS Inputs

If you’re familiar with BASH, you can of course create your own batch script, but we recommend using the run-gromacs-your-files.sh file as a template for when you want to run you own simulations. You can upload these files yourself or you can build them. If you opt for the latter, you need to load the appropriate software packages by running:

module load cuda/6.5 gcc/4.8.3 openmpi-cuda/1.8.1 gromacs

Once your files are either created or uploaded, you’ll need to ensure that the batch script is referencing the correct input files. The relevant parts of the run-gromacs-your-files.sh file are:

echo  "=================================================================="
echo  "= Run CPU-only water scaling benchmark system (1536)"
echo  "=================================================================="

srun --mpi=pmi2 -n $num_processes -c $num_threads_per_process mdrun_mpi -s topol.tpr -npme 0 -resethway -noconfout -nb cpu -nsteps 10000 -pin on -v

and for execution on GPUs:

echo  "=================================================================="
echo  "= Run ${GPU_TYPE} GPU-accelerated benchmark"
echo  "=================================================================="

srun --mpi=pmi2 -n $num_processes -c $num_threads_per_process mdrun_mpi -s topol.tpr -npme  0 -resethway -noconfout -nsteps 1000 -pin on -v

Although you might not be familiar with all of the above GROMACS flags, you should hopefully recognize the .tpr file. This binary file contains the atomic-level input of the equilibration, temperature, pressure, and other inputs that the grompp module has processed. The flags themselves are important for benchmarking and are explained below:

-npme 0: This flag is normally used to tell GROMACS how many threads to use. However, unless you have compute nodes with different numbers of cores, it’s best to let MPI manage the threads.
-resethway: As the name suggests, this flag acts as a time reset. Half way through the job, GROMACS will reset the counter so that any overhead from memory initialization or load balancing won’t affect the benchmark score.
-noconfout: For when you want to once again reduce overhead, this flag tells GROMACS to not create a toconfout.gro file.
-nsteps 1000: A tag that you’re probably familiar with, this one lets you set the maximum number of integration steps. It’s useful to change if you don’t want to waste too much time waiting for your benchmark to finish.
-pin on: Finally, this tag lets you set affinities for the cores, meaning that threads will remain locked to cores and won’t jump around.

If you’d like to visualize your results, you will need to initialize a graphical session on our cluster. You are welcome to contact us if you’re uncertain of this step. After you have access to an X-session, you can run VMD by typing the following:

module load vmd
vmd

Next Steps for GROMACS GPU Acceleration

As you can see, we’ve set up our Test Drive so that running GROMACS on a GPU cluster isn’t much more difficult than running it on your own workstation. Benchmarking CPU vs GPU performance is also very easy. If you’d like to learn more, contact one of our experts or sign up for a GPU Test Drive today!

solvated alcohol dehydrogenase (ADH) protein in a rectangular box (134,000 atoms)

Citation for GROMACS:

https://www.gromacs.org/

Berendsen, H.J.C., van der Spoel, D. and van Drunen, R., GROMACS: A message-passing parallel molecular dynamics implementation, Comp. Phys. Comm. 91 (1995), 43-56.

Lindahl, E., Hess, B. and van der Spoel, D., GROMACS 3.0: A package for molecular simulation and trajectory analysis, J. Mol. Mod. 7 (2001) 306-317.

Featured Illustration:

Solvated alcohol dehydrogenase (ADH) protein in a rectangular box (134,000 atoms)
https://www.gromacs.org/topic/heterogeneous_parallelization.html

Citation for VMD:

Humphrey, W., Dalke, A. and Schulten, K., “VMD – Visual Molecular Dynamics” J. Molec. Graphics 1996, 14.1, 33-38
https://www.ks.uiuc.edu/Research/vmd/

The post How to Benchmark GROMACS GPU Acceleration on HPC Clusters appeared first on Microway.

Benchmark MATLAB GPU Acceleration on NVIDIA Tesla K40 GPUs

Eliot Eshelman — Fri, 17 Oct 2014 20:44:41 +0000

MATLAB is a well-known and widely-used application – and for good reason. It functions as a powerful, yet easy-to-use, platform for technical computing. With support for a variety of parallel execution methods, MATLAB also performs well. Support for running MATLAB on GPUs has been built-in for a couple years, with better support in each release. If you haven’t tried yet, take this opportunity to test MATLAB performance on GPUs. Microway’s GPU Test Drive makes the process quick and easy. As we’ll show in this post, you can expect to see 3X to 6X performance increases for many tasks (with 30X to 60X speedups on select workloads).

Access a Compute Node with GPU-accelerated MATLAB

Getting started with MATLAB on our GPU cluster is easy: complete this form to sign up for MATLAB GPU benchmarking. We will send you an e-mail with detailed instructions for logging in and starting up MATLAB. Once you’re in, all you need to do is click the MATLAB icon and the latest version of GPU-Accelerated MATLAB will pop up:

We use NoMachine to export the graphical sessions from our cluster to your local PC/laptop. This makes login extremely user-friendly, ensures your interactive session performs well and provides a built-in method for file transfers in and out of the GPU cluster. MATLAB is fairly well-known for performing sluggishly over standard Unix/Linux graphical sessions (e.g., X11 forwarding, VNC), but you’ll have no such issues here.

You’ll be dropped into a standard MATLAB workspace. A variety of parallelized demonstrations of GPU usage are included with MATLAB. Pick one and give it a try! You can type paralleldemo_gpu and then hit to see the full list of options.

Measure MATLAB GPU Speedups

Below we show the output from several of the built-in MATLAB parallel GPU demos. A few are text-only, but several include a graphical component or performance plot. The first example runs a quick test on memory transfer speeds and computational throughput. Results from both the GPU and the host (CPUs) are shown:

>> paralleldemo_gpu_benchmark
Using a Tesla K40m GPU.
Achieved peak send speed of 3.44069 GB/s
Achieved peak gather speed of 2.20036 GB/s
Achieved peak read+write speed on the GPU: 233.613 GB/s
Achieved peak read+write speed on the host: 12.9773 GB/s
Achieved peak calculation rates of 398.9 GFLOPS (host), 1345.8 GFLOPS (GPU)

Note that the host results will be impacted by the number of local workers available in the Parallel Computing Toolbox. Since version R2011b, the default has been limited to 12 threads/CPU cores. With the release of R2014a, Mathworks removed that limit. For these tests we changed the number of workers to 20 in the Parallel Preferences dialog box.

The next demo generates plots of the speedup between matrix multiplications on dual 10-core Xeon CPUs versus a single NVIDIA Tesla K40 GPU. Both single-precision and double-precision floating-point calculations were run.

Matrix Multiplication Speedups
Raw MATLAB Output

>> paralleldemo_gpu_backslash
Starting benchmarks with 8 different single-precision matrices of sizes
ranging from 1024-by-1024 to 29696-by-29696.
Creating a matrix of size 1024-by-1024.
Gigaflops on CPU: 66.278709
Gigaflops on GPU: 107.556334
Creating a matrix of size 5120-by-5120.
Gigaflops on CPU: 235.782899
Gigaflops on GPU: 988.360718
Creating a matrix of size 9216-by-9216.
Gigaflops on CPU: 345.775846
Gigaflops on GPU: 1411.722193
Creating a matrix of size 13312-by-13312.
Gigaflops on CPU: 430.923486
Gigaflops on GPU: 1631.047366
Creating a matrix of size 17408-by-17408.
Gigaflops on CPU: 493.923539
Gigaflops on GPU: 1708.917025
Creating a matrix of size 21504-by-21504.
Gigaflops on CPU: 529.809413
Gigaflops on GPU: 1754.558735
Creating a matrix of size 25600-by-25600.
Gigaflops on CPU: 567.786871
Gigaflops on GPU: 1804.538355
Creating a matrix of size 29696-by-29696.
Gigaflops on CPU: 597.913569
Gigaflops on GPU: 1842.050491
Starting benchmarks with 6 different double-precision matrices of sizes
ranging from 1024-by-1024 to 21504-by-21504.
Creating a matrix of size 1024-by-1024.
Gigaflops on CPU: 45.881347
Gigaflops on GPU: 84.044136
Creating a matrix of size 5120-by-5120.
Gigaflops on CPU: 112.758309
Gigaflops on GPU: 653.228694
Creating a matrix of size 9216-by-9216.
Gigaflops on CPU: 135.980895
Gigaflops on GPU: 883.155216
Creating a matrix of size 13312-by-13312.
Gigaflops on CPU: 223.848074
Gigaflops on GPU: 975.277154
Creating a matrix of size 17408-by-17408.
Gigaflops on CPU: 254.737638
Gigaflops on GPU: 1004.284010
Creating a matrix of size 21504-by-21504.
Gigaflops on CPU: 277.688546
Gigaflops on GPU: 1028.731291

GPU-Accelerated Stencil Operations

MATLAB also includes a couple of Stencil Operation demos running on a GPU. These include both a “generic” implementation and an optimized implementation using GPU shared & texture memory. As shown below, MATLAB GPU speedups can be 30+ times faster than MATLAB on CPUs with properly-optimized algorithms.

>> paralleldemo_gpu_mexstencil
Average time on the GPU: 1.119ms per generation
Average time of 0.038ms per generation (29.4x faster).
Average time of 0.019ms per generation (58.9x faster).
First version using gpuArray:  1.119ms per generation.
MEX with shared memory: 0.038ms per generation (29.4x faster).
MEX with texture memory: 0.019ms per generation (58.9x faster).

Running your own test of MATLAB GPU speedups

To see a list of other useful demos, take a look at the GPU-accelerated examples on Mathworks FileExchange. You’ll find a large number of useful demonstrations, including:

GPU acceleration for FFTs
Heat transfer equations
Navier-Stokes equations for incompressible fluids
Anisotropic Diffusion
Gradient Vector Flow (GVF) force field calculation
3D linear and trilinear interpolation
more than 60 others

Also consider that hundreds of MATLAB’s standard functions support GPU acceleration. . Utilizing these capabilities is quite straightforward: your data must be loaded into a gpuArray. With this done, pass the gpuArray to any of MATLAB’s standard functions and the operations will be carried out on the GPU!

Will GPU acceleration speed up your research?

With our pre-configured GPU cluster, running MATLAB on high-performance GPUs is as easy as running it on your own workstation. Find out for yourself how much faster you’ll be able to work if you add GPUs to your toolbelt. Sign up for a GPU Test Drive today!

Featured Illustration:

“Solving 2nd Order Wave Equation on the GPU Using Spectral Methods” by Jiro Doke
Mathworks MATLAB Central

The post Benchmark MATLAB GPU Acceleration on NVIDIA Tesla K40 GPUs appeared first on Microway.

Running GPU Benchmarks of HOOMD-blue on a Tesla K40 GPU-Accelerated Cluster

Eliot Eshelman — Tue, 14 Oct 2014 20:28:19 +0000

This short tutorial explains the usage of the GPU-accelerated HOOMD-blue particle simulation toolkit on our GPU-accelerated HPC cluster. Microway allows you to quickly test your codes on the latest high-performance systems – you are free to upload and run your own software, although we also provide a variety of pre-compiled applications with built-in GPU acceleration. Our GPU Test Drive Cluster is a useful resource for benchmarking the faster performance which can be achieved with NVIDIA Tesla GPUs.

This post demonstrate HOOMD-blue, which comes out of the Glotzer group at the University of Michigan. HOOMD blue supports a wide variety of integrators and potentials, as well as the capability to scale runs up to thousands of GPU compute processors. We’ll demonstrate one server with dual NVIDIA® Tesla® K40 GPUs delivering speedups over 13X!

Before continuing, please note that successful use of HOOMD-blue will require some familiarity with Python. However, you can reference their excellent Quick Start Tutorial. If you’re already familiar with a different software package, read through our list of pre-installed applications. There may be no need for you to learn a new tool.

Access a Tesla GPU-accelerated Compute Node

Getting started on our GPU system is fast and easy – complete this short form to sign up for HOOMD-blue benchmarking. We will send you an e-mail with a general list of commands when your request is accepted, but this post provides guidelines specific to HOOMD-blue tests.

First, you need SSH to access our GPU cluster. Don’t worry if you’re unfamiliar with SSH – we will send you step-by-step login instructions. Windows users have one extra step, but SSH is built-in on Linux and MacOS.

Run CPU and GPU-accelerated HOOMD-blue

Once you’re logged in, it’s easy to compare CPU and GPU performance: enter the HOOMD-blue directory and run the benchmark batch script which we have pre-written for you:

cd hoomd-blue
sbatch run-hoomd-on-TeslaK40.sh

Waiting for your HOOMD-blue job to finish

Our cluster uses SLURM to manage computational tasks. You should use the squeue command to check the status of your jobs. To watch as your job runs, use: watch squeue (hit CTRL+c to exit). Alternatively, the cluster can e-mail you when your job has finished if you update the HOOMD batch script file (although this must be done before submitting your job). Run:

nano run-hoomd-on-TeslaK40.sh

Within this file, add the following lines to the #SBATCH section (changing the e-mail address to your own):

#SBATCH --mail-user=yourname@example.com
#SBATCH --mail-type=END

If you would like to closely monitor the compute node which is executing your job, run squeue to check which compute node your job is running on. Log into that node via SSH and use one of the following tools to monitor the GPU and system status:

ssh node2
nvidia-smi
htop

(hit q to exit htop)

Check the speedup of HOOMD-blue on GPUs vs. CPUs

The results from the HOOMD-blue benchmark script will be placed in an output file named hoomd-K40.xxxx.output.log – below is a sample of the output running on CPUs:

======================================================
= Run CPU only lj_liquid_bmark_512K
======================================================
Average TPS: 21.90750

and with HOOMD-blue running on two GPUs (demonstrating a 13X speed-up):

======================================================
= Run Tesla_K40m GPU-accelerated lj_liquid_bmark_512K
======================================================
Average TPS: 290.27084

If you would like to examine the full execution sequence of a particular input, you will see that a log file has been created for each of the inputs (e.g., lj_liquid_bmark_512K.20_cpu_cores.output). If the HOOMD-blue job has any problems, the errors will be logged to the file hoomd-K40.xxxx.output.errors

The chart below shows the performance improvements for a CPU-only HOOMD-blue run (on two 10-core Ivy Bridge Intel Xeon CPUs) compared to a GPU-accelerated HOOMD-blue run (on two NVIDIA Tesla K40 GPUs):

Running your own HOOMD-blue inputs on GPUs

If you’re comfortable with shell scripts you can write your own batch script from scratch, but we recommend using the run-hoomd-your-files.sh file as a template when you’d like to try your own simulations. For most HOOMD-blue runs, the batch script will only reference a single Python script as input (e.g., the lj_liquid_bmark_512K.hoomd script). Reference the HOOMD-blue Quick Start Tutorial.

Once your script is in place in your hoomd-blue/ directory, you’ll need to ensure that the batch script is referencing the correct .hoomd input file. The relevant lines of the run-hoomd-your-files.sh file are:

echo "==============================================================="
echo "= Run CPU-only"
echo "==============================================================="

srun --mpi=pmi2 hoomd input_file.hoomd --mode=cpu > hoomd_output__cpu_run.txt
grep "Average TPS:" hoomd_output__cpu_run.txt

and for execution on GPUs:

echo "==============================================================="
echo "= Run GPU-Accelerated"
echo "==============================================================="

srun --mpi=pmi2 -n $GPUS_PER_NODE hoomd input_file.hoomd > hoomd_output__gpu_run.txt
grep "Average TPS:" hoomd_output__gpu_run.txt

As shown above, both the CPU and GPU runs use the same input file (input_file.hoomd). They will each save their output to a separate text file (hoomd_output__cpu_run.txt and hoomd_output__gpu_run.txt). The final line of each section uses the grep tool to print the performance of that run. HOOMD-blue typically measures performance in millions of particle time steps per second (TPS), where a higher number indicates better performance.

VMD visualization of micellar crystals

Will GPU acceleration speed up your research?

With our pre-configured GPU cluster, running HOOMD-blue across an HPC cluster isn’t much more difficult than running it on your own workstation. This makes it easy to compare HOOMD-blue simulations running on CPUs and GPUs. If you’d like to give it a try, contact one of our experts or sign up for a GPU Test Drive today!

Citation for HOOMD-blue:

Joshua A. Anderson, Chris D. Lorenz, and Alex Travesset – ‘General Purpose Molecular Dynamics Fully Implemented on Graphics Processing Units’, Journal of Computational Physics 227 (2008) 5342-5359
https://glotzerlab.engin.umich.edu/hoomd-blue/

Featured Illustration:

“Micellar crystals in solution from molecular dynamics simulations”, J. Chem. Phys. 128, 184906 (2008); DOI:10.1063/1.2913522
https://doi.org/10.1063/1.2913522

Citation for VMD:

Humphrey, W., Dalke, A. and Schulten, K., “VMD – Visual Molecular Dynamics” J. Molec. Graphics 1996, 14.1, 33-38
https://www.ks.uiuc.edu/Research/vmd/

The post Running GPU Benchmarks of HOOMD-blue on a Tesla K40 GPU-Accelerated Cluster appeared first on Microway.

Benchmarking NAMD on a GPU-Accelerated HPC Cluster with NVIDIA Tesla K40

Eliot Eshelman — Fri, 10 Oct 2014 17:32:04 +0000

This is a tutorial on the usage of GPU-accelerated NAMD for molecular dynamics simulations. We make it simple to test your codes on the latest high-performance systems – you are free to use your own applications on our cluster and we also provide a variety of pre-installed applications with built-in GPU support. Our GPU Test Drive Cluster acts as a useful resource for demonstrating the increased application performance which can be achieved with NVIDIA Tesla GPUs.

This post describes the scalable molecular dynamics software NAMD, which comes out of the Theoretical and Computational Biophysics Group at the University of Illinois Urbana-Champaign. NAMD supports a variety of operational modes, including GPU-accelerated runs across large numbers of compute nodes. We’ll demonstrate how a single server with NVIDIA® Tesla® K40 GPUs can deliver speedups over 4X!

Before continuing, please note that this post assumes you are familiar with NAMD. If you prefer a different molecular dynamics package (e.g., AMBER), read through the list of applications we have pre-installed. There may be no need for you to learn a new tool. If all of these tools are new to you, you will find a number of NAMD tutorials online.

Access the Tesla GPU-accelerated Cluster

Getting started with our GPU Benchmark cluster is fast and easy – fill out this short form to sign up for GPU benchmarking. Although we will send you an e-mail with a general list of commands when your request is accepted, this post goes into further detail.

First, you need to log in to the GPU cluster using SSH. Don’t worry if you haven’t used SSH before – we will send you step-by-step login instructions. Windows users have to perform one additional step, but SSH is built-in on Linux and MacOS.

Run CPU and GPU-accelerated versions of NAMD

Once you’re logged in, it’s easy to compare CPU and GPU performance: enter the NAMD directory and run the NAMD batch script which we have pre-written for you:

cd namd
sbatch run-namd-on-TeslaK40.sh

Waiting for your NAMD job to finish

Our cluster uses SLURM to manage users’ jobs. You can use the squeue command to keep track of your jobs. For real-time information on your job, run: watch squeue (hit CTRL+c to exit). Alternatively, the cluster can e-mail you when your job is finished if you update the NAMD batch script file (although this must be done before submitting your job). Run:

nano run-namd-on-TeslaK40.sh

Within this file, add the following two lines to the #SBATCH section (changing the e-mail address to your own):

#SBATCH --mail-user=yourname@example.com
#SBATCH --mail-type=END

If you would like to closely monitor the compute node which is running your job, check the output of squeue and take note of which compute node your job is running on. Log into that node with SSH and then use one of the following tools to keep an eye on GPU and system status:

ssh node2
nvidia-smi
htop

(hit q to exit htop)

Check the speedup of NAMD on GPUs vs. CPUs

The results from the NAMD batch script will be placed in an output file named namd-K40.xxxx.output.log – below is a sample of the output running on CPUs:

======================================================
= Run CPU only stmv
======================================================
Info: Benchmark time: 20 CPUs 0.531318 s/step 6.14951 days/ns 4769.63 MB memory

and with NAMD running on two GPUs (demonstrating over 4X speed-up):

======================================================
= Run Tesla_K40m GPU-accelerated stmv
======================================================
Info: Benchmark time: 18 CPUs 0.112677 s/step 1.30413 days/ns 2475.9 MB memory

Should you require further details on a particular run, you will see that a separate log file has been created for each of the inputs (e.g., stmv.20_cpu_cores.output). The NAMD output files are available in the benchmarks/ directory (with a separate subdirectory for each test case). If your job has any problems, the errors will be logged to the file namd-K40.xxxx.output.errors

The following chart shows the performance improvements for a CPU-only NAMD run (on two 10-core Ivy Bridge Intel Xeon CPUs) versus a GPU-accelerated NAMD run (on two NVIDIA Tesla K40 GPUs):

Running your own NAMD inputs on GPUs

If you’re familiar with BASH you can write your own batch script from scratch, but we recommend using the run-namd-your-files.sh file as a template when you’d like to try your own simulations. For most NAMD runs, the batch script will only reference a single input file (e.g., the stmv.namd script). This input script will reference any other input files which NAMD might require:

Structure file (e.g., stmv.psf)
Coordinates file (e.g., stmv.pdb)
Input parameters file (e.g., par_all27_prot_na.inp)

You can upload existing inputs from your own workstation/laptop or you can assemble an input job on the cluster. If you opt for the latter, you need to load the appropriate software packages by running:

module load cuda gcc namd

Once your files are in place in your namd/ directory, you’ll need to ensure that the batch script is referencing the correct .namd input file. The relevant lines of the run-namd-your-files.sh file are:

echo "==============================================================="
echo "= Run CPU-only"
echo "==============================================================="

namd2 +p $num_cores_cpu input_file.namd > namd_output__cpu_run.txt
grep Benchmark namd_output__cpu_run.txt

and for execution on GPUs:

echo "==============================================================="
echo "= Run GPU-Accelerated"
echo "==============================================================="

namd2 +p $num_cores_gpu +devices $CUDA_VISIBLE_DEVICES +idlepoll input_file.namd > namd_output__gpu_run.txt
grep Benchmark namd_output__gpu_run.txt

As is hopefully clear, both the CPU and GPU runs use the same input file (input_file.namd). They will each output to a separate log file (namd_output__cpu_run.txt and namd_output__gpu_run.txt). The final line of each section uses the grep utility to print the performance of each run in days per nanosecond (where a lower number indicates better performance).

If you’d like to visualize your results, you will need an SSH client which properly forwards your X-session. You are welcome to contact us if you’re uncertain of this step. Once that’s done, the VMD visualization tool can be run:

module load vmd
vmd

VMD visualization of the Satellite Tobacco Mosaic Virus

Ready to try GPUs?

Once properly configured (which we’ve already done for you), running NAMD on a GPU cluster isn’t much more difficult than running it on your own workstation. This makes it easy to compare NAMD simulations running on CPUs and GPUs. If you’d like to give it a try, contact one of our experts or sign up for a GPU Test Drive today!

Citations for NAMD:

“NAMD was developed by the Theoretical and Computational Biophysics Group in the Beckman Institute for Advanced Science and Technology at the University of Illinois at Urbana-Champaign.”

James C. Phillips, Rosemary Braun, Wei Wang, James Gumbart, Emad Tajkhorshid, Elizabeth Villa, Christophe Chipot, Robert D. Skeel, Laxmikant Kale, and Klaus Schulten. Scalable molecular dynamics with NAMD. Journal of Computational Chemistry, 26:1781-1802, 2005. abstract, journal
https://www.ks.uiuc.edu/Research/namd/

Featured Illustration:

Molecular Dynamics of Viruses – Satellite Tobacco Mosaic Virus (STMV)

Citation for VMD:

Humphrey, W., Dalke, A. and Schulten, K., “VMD – Visual Molecular Dynamics” J. Molec. Graphics 1996, 14.1, 33-38
https://www.ks.uiuc.edu/Research/vmd/

The post Benchmarking NAMD on a GPU-Accelerated HPC Cluster with NVIDIA Tesla K40 appeared first on Microway.

Running AMBER on a GPU Cluster

jan — Mon, 06 Oct 2014 14:14:39 +0000

Welcome to our tutorial on GPU-accelerated AMBER! We make it easy to benchmark your applications and problem sets on the latest hardware. Our GPU Test Drive Cluster provides developers, scientists, academics, and anyone else interested in GPU computing with the opportunity to test their code. While Test Drive users are given free reign to use their own applications on the cluster, Microway also provides a variety of pre-installed GPU accelerated applications.

In this post, we will look at the molecular dynamics package AMBER. Collaboratively developed by professors at a variety of university labs, the latest versions of AMBER natively support GPU acceleration. We’ll demonstrate how NVIDIA® Tesla® K40 GPUs can deliver a speedup of up to 86X!

Before we jump in, we should mention that this post assumes you are familiar with AMBER and/or AmberTools. If you are more familiar with another molecular dynamics package (e.g., GROMACS), check to see what we already have pre-installed on our cluster. There may be no need for you to learn a new tool. If you’re new to these tools in general, you can find quite a large number of AMBER tutorials online.

Access our GPU-accelerated Test Cluster

Getting access to the Microway Test Drive cluster is fast and easy – fill out a short form to sign up for a GPU Test Drive. Although our approval e-mail includes a list of commands to help you get your benchmark running, we’ll go over the steps in more detail below.

First, you need to log in to the Microway Test Drive cluster using SSH. Don’t worry if you’re unfamiliar with SSH – we include a step-by-step instruction manual for logging in. SSH is built-in on Linux and MacOS; Windows users need to install one application.

Run CPU and GPU versions of AMBER

This is one of the easiest steps. Just enter the AMBER directory and run the default benchmark script which we have pre-written for you:

cd amber
sbatch run-amber-on-TeslaK40.sh

Waiting for jobs to complete

Our cluster uses SLURM for resource management. Keeping track of your job is easy using the squeue command. For real-time information on your job, run: watch squeue (hit CTRL+c to exit). Alternatively, you can tell the cluster to e-mail you when your job is finished by editing the AMBER batch script file (although this must be done before submitting jobs with sbatch). Run:

nano run-amber-on-TeslaK40.sh

Within this file, add the following two lines to the #SBATCH section (specifying your own e-mail address):

#SBATCH --mail-user=yourname@example.com
#SBATCH --mail-type=END

ssh node2
nvidia-smi
htop

(hit q to exit htop)

See the speedup of GPUs vs. CPUs

The results from our benchmark script will be placed in an output file called amber-K40.xxxx.output.log – below is a sample of the output running on CPUs:

===============================================================
= Run CPU-only: JAC_PRODUCTION_NVE - 23,558 atoms PME
===============================================================
|         ns/day =      25.95   seconds/ns =    3329.90

and with AMBER running on GPUs (demonstrating a 6X speed-up):

========================================================================
= Run Tesla_K40m GPU-accelerated: JAC_PRODUCTION_NVE - 23,558 atoms PME
========================================================================
|         ns/day =     157.24   seconds/ns =     549.47

Should you require more information on a particular run, it’s available in the benchmarks/ directory (with a separate subdirectory for each test case). If your job has any problems, the errors will be logged to the file amber-K40.xxxx.output.errors

The chart below demonstrates the performance improvements between a CPU-only AMBER run (on two 10-core Ivy Bridge Intel Xeon CPUs) and a GPU-accelerated AMBER run (on two NVIDIA Tesla K40 GPUs):

Running your own AMBER inputs on GPUs

If you’re familiar with BASH, you can of course create your own batch script, but we recommend using the run-amber-your-files.sh file as a template for when you want to run you own simulations. For AMBER, the key files are the prmtop, inpcrd, and mdin files. You can upload these files yourself or you can build them. If you opt for the latter, you need to load the appropriate software packages by running:

module load cuda gcc mvapich2-cuda amber

Once your files are either created or uploaded, you’ll need to ensure that the batch script is referencing the correct input files. The relevant parts of the run-amber-your-files.sh file are:

echo "==============================================================="
echo "= Run CPU-only"
echo "==============================================================="

srun -n $NPROCS pmemd.MPI -O -i mdin -o mdout.cpu -p prmtop -inf mdinfo.cpu -c inpcrd -r restrt.cpu -x mdcrd.cpu
grep "ns/day" mdinfo.cpu | tail -n1

and for execution on GPUs:

echo "==============================================================="
echo "= Run GPU-Accelerated"
echo "==============================================================="

srun -n $GPUS_PER_NODE pmemd.cuda.MPI -O -i mdin -o mdout.cpu -p prmtop -inf mdinfo.gpu -c inpcrd -r restrt.gpu -x mdcrd.gpu
grep "ns/day" mdinfo.gpu | tail -n1

The above script assumes that mdin (control data: variables and simulation options), prmtop (topology: the molecular topology and force field parameters), and inpcrd (coordinates: the atom coordinates, velocities, box dimensions) are the main input files, but you are free to add additional levels of complexity as well. The output files (mdout, mdinfo, restrt, mdcrd) are labeled with the suffixes .cpu and .gpu. The line which lists the grep command is used to populate the amber-K40.xxxx.output.log output file with the ns/day benchmark times (just as shown in the sample output listed above).

module load vmd
vmd

VMD visualization of a nucleosome

What’s next?

With the right setup (which we’ve already done for you), running AMBER on a GPU cluster isn’t much more difficult than running it on your own workstation. We also make it easy to compare benchmark results between CPUs and GPUs. If you’d like to learn more, contact one of our experts or sign up for a GPU Test Drive today!

Citations for AMBER and AmberTools:

D.A. Case, T.A. Darden, T.E. Cheatham, III, C.L. Simmerling, J. Wang, R.E. Duke, R. Luo, R.C. Walker, W. Zhang, K.M. Merz, B. Roberts, S. Hayik, A. Roitberg, G. Seabra, J. Swails, A.W. Goetz, I. Kolossváry, K.F. Wong, F. Paesani, J. Vanicek, R.M. Wolf, J. Liu, X. Wu, S.R. Brozell, T. Steinbrecher, H. Gohlke, Q. Cai, X. Ye, J. Wang, M.-J. Hsieh, G. Cui, D.R. Roe, D.H. Mathews, M.G. Seetin, R. Salomon-Ferrer, C. Sagui, V. Babin, T. Luchko, S. Gusarov, A. Kovalenko, and P.A. Kollman (2012), AMBER 12, University of California, San Francisco.

PME: Romelia Salomon-Ferrer; Andreas W. Goetz; Duncan Poole; Scott Le Grand; & Ross C. Walker* “Routine microsecond molecular dynamics simulations with AMBER – Part II: Particle Mesh Ewald” , J. Chem. Theory Comput., 2013, 9 (9), pp 3878-3888, DOI: 10.1021/ct400314y

GB: Andreas W. Goetz; Mark J. Williamson; Dong Xu; Duncan Poole; Scott Le Grand; & Ross C. Walker* “Routine microsecond molecular dynamics simulations with AMBER – Part I: Generalized Born”, J. Chem. Theory Comput., (2012), 8 (5), pp 1542-1555, DOI: 10.1021/ct200909j

https://ambermd.org/

Citation for VMD:

Humphrey, W., Dalke, A. and Schulten, K., “VMD – Visual Molecular Dynamics” J. Molec. Graphics 1996, 14.1, 33-38

https://www.ks.uiuc.edu/Research/vmd/

The post Running AMBER on a GPU Cluster appeared first on Microway.

benchmark Archives - Microway

2nd Gen AMD EPYC “Rome” CPU Review: A Groundbreaking Leap for HPC

Important changes in AMD EPYC “Rome” CPUs include:

Leadership HPC Performance

Floating Point Benchmark Performance

Integer Benchmark Performance

What Makes EPYC 7xx2 Series Perform Strongly?

Performance Outlook

Class Leading Price/Performance

A Few Caveats: Performance Tuning & Out of the Box

Chiplet and Multi-Die Architecture: IO and Compute Dies

Compute Dies (now in 7nm)

The 14nm IO Die

DDR4-3200 and Improved Memory Bandwidth

PCI-E Gen4 Support: 2X the I/O bandwidth

Broadening Support for High Bandwidth I/O Devices

System Support for PCI-E Gen4

Infinity Fabric

SKUs and Strategies to Consider for HPC Clusters

Dual Socket SKUs

EPYC 7742 or 7702 (64c): Select a High-End SKU, yield up to 2X the performance

Anything above EPYC 7452 (32c, 48c): Select a Mid-High Level SKU, reach new performance heights

EPYC 7452 (32c): Select a Mid Level SKU, improve price performance vs previous generation EPYC

EPYC 7452 (32c): Select a Mid Level SKU, match top Xeon Gold and Platinum with far better price/performance

EPYC 7402 (24c): Select a Mid Level SKU, come close to top Xeon Gold and Platinum SKUs

EPYC 7272-7402 (12, 16 24c):Select an affordable SKU, yield better performance and price performance

Single Socket Performance

Single Socket SKUs

Next Steps: get started today!

Read More

Try 2nd Gen AMD EPYC CPUs for Yourself

Browse Our Navion AMD EPYC Product Line

WhisperStation

Servers

Clusters

NVIDIA “Turing” Tesla T4 HPC Performance Benchmarks

Double Precision Results

Single Precision Results

What Do These Results Mean?

What to Expect from the NVIDIA Tesla T4

Cost-Effective Machine Learning

Impressive Single-Precision HPC Performance

Next Steps for leveraging Tesla T4

Full SHOC Benchmark Results

DDR4 Memory on Xeon E5-2600v3 with 3 DIMMs per channel

The Test System

Benchmark Compilation and Execution

STREAM Performance Results

Results with 3 DIMMs per Channel – 768GB RAM @ 1600MHz

Results with 2 DIMMs per Channel – 512GB RAM @ 1866MHz

Results with 1 DIMM per Channel – 256GB RAM @ 2133MHz

Summary of Results

Results with 2 DIMMs per Channel – 512GB RAM @ 2133MHz (Forced in BIOS)

Caffe Deep Learning Tutorial using NVIDIA DIGITS on Tesla K80 & K40 GPUs

Caffe Deep Learning Framework

NVIDIA DIGITS & Caffe Deep Learning GPU Training System (DIGITS)

Hardware for NVIDIA DIGITS and Caffe Deep Learning Neural Networks

Introduction to Deep Learning with DIGITS

Neural Network Image Classifier Performance Considerations

GPU Benchmarks for Caffe deep learning on Tesla K40 and K80

Further Reading on NVIDIA DIGITS Deep Learning and Neural Networks

DDR4 RDIMM and LRDIMM Performance Comparison

Misconceptions on Load Reduced DIMM Performance

RDIMM vs LRDIMM Performance Testing

LRDIMM Benchmark Results

LRDIMM Performance Summary

RDIMM Benchmark Results

RDIMM Performance Summary

Side-by-Side Comparison of RDIMM and LRDIMM Performance

When Registered DIMMs (RDIMMs) are Best

When Load-Reduced DIMMs (LRDIMMs) are Best

How to Benchmark GROMACS GPU Acceleration on HPC Clusters

Logging on to the Test Drive Cluster

Run GROMACS on CPUs and GPUs

Managing GROMACS Jobs on the Cluster

See the speedup of GPUs vs. CPUs

Benchmarking your GROMACS Inputs

Next Steps for GROMACS GPU Acceleration

Benchmark MATLAB GPU Acceleration on NVIDIA Tesla K40 GPUs

Access a Compute Node with GPU-accelerated MATLAB