benchmark Archives - Microway https://www.microway.com/tag/benchmark/ We Speak HPC & AI Thu, 30 May 2024 20:48:13 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 2nd Gen AMD EPYC “Rome” CPU Review: A Groundbreaking Leap for HPC https://www.microway.com/hpc-tech-tips/amd-epyc-rome-cpu-review/ https://www.microway.com/hpc-tech-tips/amd-epyc-rome-cpu-review/#comments Wed, 07 Aug 2019 23:00:00 +0000 https://www.microway.com/?p=11787 The 2nd Generation AMD EPYC “Rome” CPUs are here! Rome brings greater core counts, faster memory, and PCI-E Gen4 all to deliver what really matters: up to a 2X increase in HPC application performance. We’re excited to present our thoughts on this advancement, and the return of x86 server CPU competition, in our detailed AMD […]

The post 2nd Gen AMD EPYC “Rome” CPU Review: A Groundbreaking Leap for HPC appeared first on Microway.

]]>

The 2nd Generation AMD EPYC “Rome” CPUs are here! Rome brings greater core counts, faster memory, and PCI-E Gen4 all to deliver what really matters: up to a 2X increase in HPC application performance. We’re excited to present our thoughts on this advancement, and the return of x86 server CPU competition, in our detailed AMD EPYC Rome review. AMD is unquestionably back to compete for the performance crown in HPC.

2nd Generation AMD EPYC “Rome” CPUs are offered in 8-64 cores and clock speeds from 2.2-3.2Ghz. They are available in dual socket as well as aselect number of single socket only SKUs.

Important changes in AMD EPYC “Rome” CPUs include:

  • Up to 64 cores, 2X the max in the previous generation for a massive advancement in aggregate throughput
  • PCI-E Gen 4 support for 2X the I/O bandwidth of the x86 competition— in a first for an x86 server CPU
  • 2X the FLOPS per core of the previous generation EPYC CPUs with the new Zen2 architecture
  • DDR4-3200 support for improved memory bandwidth across 8 channels, reaching up to 208GB/sec per socket
  • Next Generation Infinity Fabric with higher bandwidth for intra and inter-die connection, with roots in PCI-E Gen4
  • New 14nm + 7nm chiplet architecture that separates the 14nm IO and 7nm compute core dies to yield the performance per watt benefits of the new TSMC 7nm process node

Leadership HPC Performance

There’s no other way to say it: the 2nd Generation AMD EPYC “Rome” CPUs (EPYC 7xx2) break new ground for HPC performance. In our experience, we haven’t seen this type of advancement in CPU performance in many years or without exotic architectural changes. This leap applies across floating point and integer applications.

Note: This article focuses on SPEC benchmark performance (which is rooted in real integer and floating point applications). If you’re hunting for a more raw FLOPS/dollar calculation, please visit our Knowledge Center Article on AMD EPYC 7xx2 “Rome” CPUs.

Floating Point Benchmark Performance

In short: at the top bin, you may see up to 2.12X the performance of the competition. This is compared to the top bin of Xeon Gold Processor (Xeon Gold 6252) on SPECrate2017_fp_base.

Compared to the top Xeon Platinum 8200 series SKU (Xeon Platinum 8280), up to 1.79X the performance.
AMD Rome SPECfp 2017 vs Xeon CPUs - Top Bin

Integer Benchmark Performance

Integer performance largely mirrors the same story. At the top bin, you may see up to 2.49X the performance of the competition. This is compared to the top bin of Xeon Gold Processor (Xeon Gold 6252) on SPECrate2017_int_base.

Compared to the top Xeon Platinum 8200 series SKU (Xeon Platinum 8280), up to 1.90X the performance.
AMD Rome SPECint 2017 vs Xeon CPUs - Top Bin

What Makes EPYC 7xx2 Series Perform Strongly?

Contributions towards this leap in performance come from a combination of:

  • The 2X the FLOPS per core available in the new architecture
  • Improved performance of Zen2 microarchitecture
  • Moderate increases in clock speeds
  • Most importantly dramatic increases in core count

These last 2 items are facilitated by the new 7nm process node and the chiplet architecture of EPYC. Couple that with the advantages in memory bandwidth, and you have a recipe for HPC performance.

Performance Outlook


The dramatic increase in core count coupled with Zen2 means that we predict that most of the 32 core models and above, about half AMD’s SKU stack, is likely to outperform the top Xeon Platinum 8200 series SKU. Stay tuned for the SPEC benchmarks that confirm this assertion.

If you’re comparing against more modest Xeon Gold 62xx or Silver 52xx/42xx SKUs, we predict even an even more dramatic performance uplift. This is the first time in many years we’ve seen such an incredibly competitive product from the AMD Server Group.

Class Leading Price/Performance

AMD EPYC 7xx2 series isn’t just impressive from an absolute performance perspective. It’s also a price performance machine.

Examine these same two top-bin SKUs once again:
AMD Rome SPECfp 2017 vs Xeon CPUs - Price Performance

The top-bin AMD SKU does 1.79X the floating point work at approximately 2/3 the price of Xeon Platinum 8280. It delivers 2.13X the floating point performance to the Xeon Gold 6252 for about similar price/performance.

Should you be willing to accept more modest core counts with the lower cost SKUS, these comparisons just get better.

Finally, if you’re looking to roughly match or exceed the performance of the top-bin Xeon Gold 6252 SKU, we predict you’ll be able to do so with the 24-core EPYC 7352. This will be at just over 1/3 the price of the Xeon socket.

This much more typical comparison is emblematic of the price-performance advantage AMD has delivered in the new generation of CPUs. Stay tuned for more benchmark results and charts to support the prediction.

A Few Caveats: Performance Tuning & Out of the Box

Application Performance Engineers have spent years optimizing applications for the most widely available x86 server CPU. For a number of years now, that has meant Intel’s Xeon processors. The benchmarks presented here represent performance-tuned results.

We don’t yet have great data on how easy it is to achieve optimized performance with these new AMD “Rome” CPUs yet. For those of us in HPC for some time, we know out of the box performance and optimized performance often can mean very different things.

AMD does recommend specific compilers (AOCC, GCC, LLVM) and libraries (BLIS over BLAS and FLAME over LAPACK) to achieve optimized results with all EPYC CPUs. We don’t yet have a complete understanding how much these help end users achieve these superior results. Does it require a lot of tuning for the most exceptional performance?

AMD however has released a new Compiler Options Quick Reference Guide for the new CPUs. We strongly recommend using these flags and options for tuning your application.

Chiplet and Multi-Die Architecture: IO and Compute Dies

AMD EPYC Rome Die

One of the chief innovations in the 2nd Generation AMD EPYC CPUs is in the evolution of the multi-die architecture pioneered in the first EPYC CPUs.

Rather than create one, monolithic, hard to yield die, AMD has opted to lash together “chiplets” together in a single socket with Infinity Fabric technology.

Compute Dies (now in 7nm)

8 compute chiplets (formally, Core Complex Dies or CCDs) are brought together to create a single socket. These CCDs take advantage of the latest 7nm TSMC process node. By using 7nm for the compute cores in 2nd Generation EPYC, AMD takes advantage of the space and power efficiencies of the latest process—without the yield issues of single monolithic die.

What does it mean for you? More cores than anticipated in a single socket, a reasonable power efficiency for the core count, and a less costly CPU.

The 14nm IO Die

In 2nd Generation EPYC CPUs, AMD has gone a step further with the chiplet architecture. These chiplets are now complemented by an separate I/O die. The IO Die contains the memory controllers, PCI-Express controllers, and Infinity Fabric connection to the remote socket.Also, this resolves any NUMA affinity quirks of the 1st generation EPYC Processors.

Moreover, the I/O die is created in the established 14nm node process. It’s less important that it utilize the same 7nm power efficiencies.

DDR4-3200 and Improved Memory Bandwidth

AMD EPYC 7xx2 series improves its theoretical memory bandwidth when compared to both its predecessor and the competition.

DDR4-3200 DIMMs are supported, and they are clocked 20% faster than DDR4-2666 and 9% faster than DDR4-2933.
In summary, the platform offers:

  • Compared to Cascade Lake-SP (Xeon Platinum/Gold 82xx, 62xx): Up to a 45% improvement in memory bandwidth
  • Compared to Skylake-SP (Xeon Platinum/Gold 81xx, 61xx): Up to a 60% improvement in memory bandwidth
  • Compared to AMD EPYC 7xx1 Series (Naples): Up to a 20% improvement in memory bandwidth



These comparisons are created for a system where only the first DIMM per channel is populated. Part of this memory bandwidth advantage is derived from the increase in DIMM speeds (DDR4-3200 vs 2933/2666); part of it is derived from EPYC’s 8 memory channels (vs 6 on Xeon Skylake/Cascade Lake-SP).

While we’ve yet to see final STREAM testing numbers for the new CPUs, we do anticipate them largely reflecting the changes in theoretical memory bandwidth.

PCI-E Gen4 Support: 2X the I/O bandwidth

EPYC “Rome” CPUs have an integrated PCI-E generation 4.0 controller on the I/O die. Each PCI-E lane doubles in maximum theoretical bandwidth to 4GB/sec (bidirectional).

A 16 lane connection (PCI-E x16 4.0 slot) can now deliver up to 64GB/sec of bidirectional bandwidth (32GB/uni). That’s 2X the bandwidth compared to first generation EPYC and the x86 competition.

Broadening Support for High Bandwidth I/O Devices

Mellanox ConnectX-6 Adapter
The new support allows for higher bandwidth connection to InfiniBand and other fabric adapters, storage adapters, NVMe SSDs, and in the future GPU Accelerators and FPGAs.

Some of these devices, like Mellanox ConnectX-6 200Gb HDR InfiniBand adapters, were unable to realize their maximum bandwidth in a PCI-E Gen3 x16 slot. Their performance should improve in PCI-E Gen4 x16 slot with 2nd Generation AMD EPYC Processors.

2nd Generation AMD EPYC “Rome” is the only x86 server CPU with PCI-E Gen4 support at its launch in 3Q 2019. However, we have seen PCI-E Gen4 support before in the POWER9 platform.

System Support for PCI-E Gen4

Unlike in the previous generation AMD EPYC “Naples” CPUs, there is not strong affinity of PCI-E lanes to a particular chiplet inside the processor. In Rome, all I/O traffic routes through the I/O die and all chiplets reach PCI-E devices through this die.

In order to support PCI-E Gen4, server and motherboard manufacturers are producing brand new versions of their platforms. Not every Rome-ready platform supports Gen4, so if this is a requirement be sure to specify this to your hardware vendor. Our team can help you select a server with full Gen4 capability.

Infinity Fabric

AMD Infinity Fabric DiagramDeeply interrelated with PCI-Express Gen4, AMD has also improved the Infinity Fabric Link between chiplets and sockets with the new generation of EPYC CPUs.

AMD’s Infinity Fabric has many commonalities with PCI-Express used to connect I/O devices. With 2nd Generation AMD EPYC “Rome” CPUs, the link speed of Infinity Fabric has doubled. This allows for higher bandwidth communication between dies on the same socket and to dies on remote sockets.

The result should be improved application performance for NUMA-aware and especially non- NUMA-aware applications. The increased bandwidth should help hide any transport bandwidth issues to I/O devices on a remote socket as well. The overall result is “smoother” performance when applications scale across multiple chiplets and sockets.

SKUs and Strategies to Consider for HPC Clusters

Here are the complete list of SKUs and 1KU (1000 unit) prices (Source: AMD). Please note that these costs are those for CPUs sold to channel integrators, not those for fully integrated systems with these CPUs.

Dual Socket SKUs

SKUCoresBase ClockBoost ClockL3 CacheTDPPrice
7742642.253.4256MB225W$6950
77022.03.35200W$6450
7642482.33.3225W$4775
75522.23.3192MB200W$4025
7542322.93.4128MB225W$3400
75022.53.35180W$2600
74522.353.35155W$2025
7402242.83.35128MB180W$1783
73522.33.2155W$1350
7302163.03.3128MB$978
72822.83.264MB120W$650
7272122.93.2$625
726283.23.4128MB155W$575
72523.23.464MB120W$475

EPYC 7742 or 7702 (64c): Select a High-End SKU, yield up to 2X the performance

Assuming your application scales with core count and maximum performance at a premium cost fits with your budget, you can’t beat the top 64core EPYC 7742 or 7702 SKUs. These will deliver greater throughput on a wide variety of multi-threaded applications.

Anything above EPYC 7452 (32c, 48c): Select a Mid-High Level SKU, reach new performance heights

While these SKUs aren’t inexpensive, they take application performance to new heights and break new benchmark ground. You can take advantage of that performance advantage for your application if it’s multi-threaded. From a price/performance perspective, these SKUs may also be attractive.

EPYC 7452 (32c): Select a Mid Level SKU, improve price performance vs previous generation EPYC

Previous generation AMD EPYC 7xx1 Series CPUs also featured 32 cores. However, the 32 core entrant in the new 7xx2 stack is far less costly than the prior generation while delivering greater memory bandwidth and 2X the FLOPS per core.

EPYC 7452 (32c): Select a Mid Level SKU, match top Xeon Gold and Platinum with far better price/performance

If you’re optimizing for price/performance compared to the top Intel Xeon Platinum 8200 or Xeon Gold 6200 series SKUs, consider this SKU or ones near it. We predict this to be at or near the price/performance sweet-spot for the new platform.

EPYC 7402 (24c): Select a Mid Level SKU, come close to top Xeon Gold and Platinum SKUs

The higher clock speed of this SKU also means it is well suited to some applications.

EPYC 7272-7402 (12, 16 24c):Select an affordable SKU, yield better performance and price performance

Treat these SKUs as much more affordable alternatives to most Xeon Gold or Silver CPUs. We’ll await further benchmarks to see exactly where the further sweet-spots are compared to these SKUs. They also compare favorably from a price/performance standpoint to prior generation 1st Generation EPYC 7xx1 processors with 12, 16, or 24 cores. Same performance, fewer dollars!

Single Socket Performance

As with the previous generation, AMD is heavily promoting the concept of replacing dual socket Intel Xeon servers with single sockets of 2nd Generation AMD EPYC “Rome.” They are producing discounted “P” SKUs with only single socket platform support at reduced prices to help further boost the price-performance advantage of these systems.

Single Socket SKUs

SKUCoresBase ClockBoost ClockL3 CacheTDPPrice
7702P642.03.35256MB200W$4425
7502P322.53.35128MB180W$2300
7402P242.83.35$1250
7302P163.03.3155W$825
7232P83.13.232MB120W$450

Due to the boosted capability of the new CPUs, a single socket configuration my be increasingly viable comparison to a dual socket Xeon platform for many workloads.

Next Steps: get started today!

Read More

If you’d like to read more speeds and feeds about these new processors, check out our article with detailed specifications of the 2nd Gen AMD EPYC “Rome” CPUs. We summarize and compare the specifications of each model, and provide guidance over and beyond what you’ve seen here.

Try 2nd Gen AMD EPYC CPUs for Yourself

Groups which prefer to verify performance before making a design are encouraged to sign up for a Test Drive, which will provide you with access to bare-metal hardware with AMD EPYC CPUs, large-memory, and more.

Browse Our Navion AMD EPYC Product Line

WhisperStation

Ultra-Quiet AMD EPYC workstations

Learn More

Servers

High performance AMD EPYC rackmount servers

Learn More

Clusters

Leadership performance clusters from 5-500 nodes

Learn More

The post 2nd Gen AMD EPYC “Rome” CPU Review: A Groundbreaking Leap for HPC appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/amd-epyc-rome-cpu-review/feed/ 2
NVIDIA “Turing” Tesla T4 HPC Performance Benchmarks https://www.microway.com/hpc-tech-tips/nvidia-turing-tesla-t4-hpc-performance-benchmarks/ https://www.microway.com/hpc-tech-tips/nvidia-turing-tesla-t4-hpc-performance-benchmarks/#respond Fri, 15 Mar 2019 17:06:57 +0000 https://www.microway.com/?p=11118 Performance benchmarks are an insightful way to compare new products on the market. With so many GPUs available, it can be difficult to assess which are suitable to your needs. Various benchmarks provide information to compare performance on individual algorithms or operations. Since there are so many different algorithms to choose from, there is no […]

The post NVIDIA “Turing” Tesla T4 HPC Performance Benchmarks appeared first on Microway.

]]>
Performance benchmarks are an insightful way to compare new products on the market. With so many GPUs available, it can be difficult to assess which are suitable to your needs. Various benchmarks provide information to compare performance on individual algorithms or operations. Since there are so many different algorithms to choose from, there is no shortage of benchmarking suites available.

For this comparison, the SHOC benchmark suite (https://github.com/vetter/shoc/) is used to compare the performance of the NVIDIA Tesla T4 with other GPUs commonly used for scientific computing: the NVIDIA Tesla P100 and Tesla V100.

The Scalable Heterogeneous Computing Benchmark Suite (SHOC) is a collection of benchmark programs testing the performance and stability of systems using computing devices with non-traditional architectures for general purpose computing, and the software used to program them. Its initial focus is on systems containing Graphics Processing Units (GPUs) and multi-core processors, and on the OpenCL programming standard. It can be used on clusters as well as individual hosts.

The SHOC benchmark suite includes options for many benchmarks relevant to a variety of scientific computations. Most of the benchmarks are provided in both single- and double-precision and with and without PCIE transfer consideration. This means that for each test there are up to four results for each benchmark. These benchmarks are organized into three levels and can be run individually or all together.

The Tesla P100 and V100 GPUs are well-established accelerators for HPC and AI workloads. They typically offer the highest performance, consume the most power (250~300W), and have the highest price tag (~$10k). The Tesla T4 is a new product based on the latest “Turing” architecture, delivering increased efficiency along with new features. However, it is not a replacement for the bigger/more power-hungry GPUs. Instead, it offers good performance while consuming far less power (70W) at a lower price (~$2.5k). You’ll want to use the right tool for the job, which will depend upon your workload(s). A summary of each Tesla GPU is shown below.

In our testing, both single- and double-precision SHOC benchmarks were run, which allows us to make a direct comparison of the capabilities of each GPU. A few HPC-relevant benchmarks were selected to compare the T4 to the P100 and V100. Tesla P100 is based on the “Pascal” architecture, which provides standard CUDA cores. Tesla V100 features the “Volta” architecture, which introduced deep-learning specific TensorCores to complement CUDA cores. Tesla T4 has NVIDIA’s “Turing” architecture, which includes TensorCores and CUDA cores (weighted towards single-precision). This product was designed primarily with machine learning in mind, which results in higher single-precision performance and relatively low double-precision performance. Below, some of the commonly-used HPC benchmarks are compared side-by-side for the three GPUs.

Double Precision Results

GPUTesla T4Tesla V100Tesla P100
Max Flops (GFLOPS)253.387072.864736.76
Fast Fourier Transform (GFLOPS)132.601148.75756.29
Matrix Multiplication (GFLOPS)249.575920.014256.08
Molecular Dynamics  (GFLOPS)105.26908.62402.96
S3D (GFLOPS)59.97227.85161.54

 

Single Precision Results

GPUTesla T4Tesla V100Tesla P100
Max Flops (GFLOPS)8073.2614016.509322.46
Fast Fourier Transform (GFLOPS)660.052301.321510.49
Matrix Multiplication (GFLOPS)3290.9413480.408793.33
Molecular Dynamics (GFLOPS)572.91997.61480.02
S3D (GFLOPS)99.42434.78295.20

 

What Do These Results Mean?

The single-precision results show Tesla T4 performing well for its size, though it falls short in double precision compared to the NVIDIA Tesla V100 and Tesla P100 GPUs. Applications that require double-precision accuracy are not suited to the Tesla T4. However, the single precision performance is impressive and bodes well for the performance of applications that are optimized for lower or mixed precision.

Plot comparing the performance of Tesla T4 with the Tesla P100 and Tesla V100 GPUs

To explain the single-precision benchmarks shown above:

  • The Max Flops for the T4 are good compared to V100 and competitive with P100. Tesla T4 provides more than half as many FLOPS as V100 and more than 80% of P100.
  • The T4 shows impressive performance in the Molecular Dynamics benchmark (an n-body pairwise computation using the Lennard-Jones potential). It again offers more than half the performance of Tesla V100, while beating the Tesla P100.
  • In the Fast Fourier Transform (FFT) and Matrix Multiplication benchmarks, the performance of Tesla T4 is on par for both price/performance and power/performance (one fourth the performance of V100 for one fourth the price and one fourth the wattage). This reflects how the T4 will perform in a large number of HPC applications.
  • For S3D, the T4 falls behind by a few additional percent.

Looking at these results, it’s important to remember the context. Tesla T4 consumes only ~25% the wattage of the larger Tesla GPUs and costs only ~25% as much. It is also a physically smaller GPU that can be installed in a wider variety of servers and compute nodes. In that context, the Tesla T4 holds its own as a powerful option for a reasonable price when compared to the larger NVIDIA Tesla GPUs.

What to Expect from the NVIDIA Tesla T4

Cost-Effective Machine Learning

The T4 has substantial single/mixed precision machine learning focused performance, with a price tag significantly lower than larger Tesla GPUs. What the T4 lacks in double precision, it makes up for with impressive single-precision results. The single-precision performance available will strongly cater to the machine learning algorithms with potential to be applied to mixed precision. Future work will examine this aspect more closely, but Tesla T4 is expected to be of high interest for deep learning inference and to have specific use-cases for deep learning training.

Impressive Single-Precision HPC Performance

In the molecular dynamics benchmark, the T4 outperforms the Tesla P100 GPU. This is extremely impressive, and for those interested in single- or mixed-precision calculations involving similar algorithms, the T4 could provide an excellent solution. With some adapting algorithms, the T4 may be a strong contender for scientific applications that also want to utilize machine learning capabilities to analyze results or run a variety of different types of algorithms from both machine learning and scientific computing on an easily accessible GPU.

In addition to the outright lower price tag, the T4 also operates at 70 Watts, in comparison to the 250+ Watts required for the Tesla P100 / V100 GPUs. Running on one quarter of the power means that it is both cheaper to purchase and cheaper to operate.

Next Steps for leveraging Tesla T4

If it appears the new Tesla T4 will accelerate your workload, but you’d like to benchmark, please sign up to Test Drive for yourself. We also invite you to contact one of our experts to discuss your needs further. Our goal is to understand your requirements, provide guidance on best options, and see the project through to successful system/cluster deployment.

Full SHOC Benchmark Results

The post NVIDIA “Turing” Tesla T4 HPC Performance Benchmarks appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/nvidia-turing-tesla-t4-hpc-performance-benchmarks/feed/ 0
DDR4 Memory on Xeon E5-2600v3 with 3 DIMMs per channel https://www.microway.com/hpc-tech-tips/3-dimms-per-channel-on-xeon-e5-2600v3/ https://www.microway.com/hpc-tech-tips/3-dimms-per-channel-on-xeon-e5-2600v3/#respond Thu, 24 Mar 2016 14:35:40 +0000 https://www.microway.com/?p=7081 This week I had the opportunity to run the STREAM memory benchmark on a Microway 2U NumberSmasher server which supports up to 3 DIMMs per channel.  In practice, this system is typically configured with 768GB or 1.5TB of DDR4 memory.A key goal of this benchmarking was to examine how RAM quantity and clock frequency affect […]

The post DDR4 Memory on Xeon E5-2600v3 with 3 DIMMs per channel appeared first on Microway.

]]>
This week I had the opportunity to run the STREAM memory benchmark on a Microway 2U NumberSmasher server which supports up to 3 DIMMs per channel.  In practice, this system is typically configured with 768GB or 1.5TB of DDR4 memory.A key goal of this benchmarking was to examine how RAM quantity and clock frequency affect bandwidth performance.  When fully loading all three DIMMs per channel, the memory frequency defaults to 1600MHz.  At two DIMMs per channel, the default memory frequency increases to 1866MHz.  With one DIMM per channel, the frequency maxes out at 2133MHz.

Photo of the Supermicro X10DRU-i motherboard

The Test System

System: NumberSmasher 2U Server based on SYS-6028U-TR4+
Motherboard: X10DRU-i+
Processors x 2: Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz
DIMMs: 32GB DDR4-2133 ECC/Registered Samsung M393A4K40BB0-CPB0Q
Operating System: CentOS Linux release 7.2.1511 (Core)
Kernel Version: 3.10.0-327.10.1.el7.x86_64
Compiler: Intel Parallel Studio XE 2016

Close-up photo of the Supermicro SYS-6028U-TR4 2U server supporting 3 DIMMs per channel

Benchmark Compilation and Execution

When compiling STREAM with the Intel compiler, I used the following compiler knobs in the makefile:

CC = icc
CFLAGS = -O3 -xHost -openmp -DSTREAM_ARRAY_SIZE=64000000 -opt-streaming-cache-evict=0 -opt-streaming-stores always -opt-prefetch-distance=64,8

Information on compiling STREAM can be found from an Intel Developer Zone article on STREAM Triad Optimization.  Also, reading through the STREAM FAQ at the University of Virginia site can be helpful.

I set the KMP_AFFINITY and OMP_NUM_THREADS environment variables before running STREAM:

export KMP_AFFINITY=granularity=core,compact
export OMP_NUM_THREADS=8
./stream_intel

On a system that has hyper-threading turned on, I could have used GOMP_CPU_AFFINITY environment variable to focus on real cores, but I elected to turn off hyper-threading in BIOS instead.

STREAM Performance Results

Results with 3 DIMMs per Channel – 768GB RAM @ 1600MHz

TaskBest Rate MB/sAvg timeMin timeMax time
Copy73,876.70.0138820.0138610.013905
Scale73,430.80.0139670.0139450.013989
Add70,320.20.0218910.0218430.022147
Triad70,555.80.0218590.0217700.022379

Results with 2 DIMMs per Channel – 512GB RAM @ 1866MHz

TaskBest Rate MB/sAvg TimeMin timeMax time
Copy88,413.80.0116610.0115820.011900
Scale87,867.60.0117650.0116540.012166
Add90,289.80.0174170.0170120.018789
Triad89,756.50.0175960.0171130.018941

Results with 1 DIMM per Channel – 256GB RAM @ 2133MHz

TaskBest Rate MB/sAvg timeMin timeMax time
Copy89,242.50.0114790.0114680.011495
Scale87,724.00.0116990.0116730.011757
Add90,363.30.0170310.0169980.017057
Triad90,411.50.0170060.0169890.017027

Plot of STREAM Triad memory performance for Intel Xeon E5-2637v3 CPUs with DDR4 Memory
Graph of STREAM Triad performance for 768GB, 512GB and 256GB memory

Summary of Results

Notice in the chart how rapidly performance improves moving from 3 DIMMs per channel 768GB at 1600MHz to 2 DIMMs per channel 512GB at 1866MHz.  Also notice that going from 2 DIMMs per channel to 1 DIMM per channel 256GB at 2133MHz does not change very much at all.

This is significant when deciding how much RAM to spec on a new system, or how much to add when upgrading. Outfitting a server with eight or sixteen DIMMs results in excellent performance. Outfitting a server with twenty-four DIMMs provides exceptional memory capacity, but results in reduced performance. Thus, there is a trade-off between memory capacity and memory performance.

Realize too that using the E5-2637 v3 processors – with only 4 real cores each – reduces the STREAM performance results.  Had I used something like the E5-2690 v3 processors – with 12 real cores each – the peak STREAM throughput results would be roughly 110GB/sec.

Results with 2 DIMMs per Channel – 512GB RAM @ 2133MHz (Forced in BIOS)

The best performance over all for the day (though not graphed above) came from forcing the 512GB configuration to 2133MHz in BIOS:

TaskBest Rate MB/sAvg TimeMin timeMax time
Copy89,510.20.0114770.0114400.011605
Scale88,981.70.0115230.0115080.011539
Add92,473.60.0166400.0166100.016665
Triad92,403.30.0166740.0166230.016710

Be careful though – a configuration like this needs to be heavily tested to insure stability.  Call us at Microway if you are not sure or have questions about memory configuration on your next server.

Photo of the Supermicro SYS-6028U-TR4 2U server

The post DDR4 Memory on Xeon E5-2600v3 with 3 DIMMs per channel appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/3-dimms-per-channel-on-xeon-e5-2600v3/feed/ 0
Caffe Deep Learning Tutorial using NVIDIA DIGITS on Tesla K80 & K40 GPUs https://www.microway.com/hpc-tech-tips/caffe-deep-learning-using-nvidia-digits-tesla-gpus/ https://www.microway.com/hpc-tech-tips/caffe-deep-learning-using-nvidia-digits-tesla-gpus/#respond Thu, 17 Sep 2015 14:14:25 +0000 http://https://www.microway.com/?p=5485 In this Caffe deep learning tutorial, we will show how to use DIGITS in order to train a classifier on a small image set.  Along the way, we’ll see how to adjust certain run-time parameters, such as the learning rate, number of training epochs, and others, in order to tweak and optimize the network’s performance.  […]

The post Caffe Deep Learning Tutorial using NVIDIA DIGITS on Tesla K80 & K40 GPUs appeared first on Microway.

]]>
NVIDIA DIGITS Deep Learning Tutorial

In this Caffe deep learning tutorial, we will show how to use DIGITS in order to train a classifier on a small image set.  Along the way, we’ll see how to adjust certain run-time parameters, such as the learning rate, number of training epochs, and others, in order to tweak and optimize the network’s performance.  Other DIGITS features will be introduced, such as starting a training run using the network weights derived from a previous training run, and using a completed classifier from the command line.

Caffe Deep Learning Framework

The Caffe Deep Learning framework has gained great popularity.It originated in the Berkeley Vision and Learning Center (BVLC) and has since attracted a number of community contributors.

NVIDIA maintains their own branch of Caffe – the latest version (0.13 at the time of writing) can be downloaded from NVIDIA’s github.

NVIDIA DIGITS & Caffe Deep Learning GPU Training System (DIGITS)

NVIDIA DIGITS is a production quality, artificial neural network image classifier available for free from NVIDIA. DIGITS provides an easy-to-use web interface for training and testing your classifiers, while using the underlying Caffe Deep Learning framework.

The latest version of NVIDIA DIGITS (2.1 at the time of writing) can be downloaded here.

NVIDIA DIGITS Deep Learning Tutorial
neural network distinguishes Land Rover from Jeep Cherokee

Hardware for NVIDIA DIGITS and Caffe Deep Learning Neural Networks

The hardware we will be using are two Tesla K80 GPU cards, on a single compute node, as well as a set of two Tesla K40 GPUs on a separate compute node. Each Tesla K80 card contains two Kepler GK210 chips, 24 GB of total shared GDDR5 memory, and 2,496 CUDA cores on each chip, for a total of 4,992 CUDA cores. The Tesla K40 cards, by comparison, each contain one GK110B chip, 12 GB of GDDR5 memory, and 2,880 CUDA cores.

Since the data associated with a trained neural network classifier is not heavy in data weight, a classifier could be easily deployed onto a mobile embedded system, and run, for example, by an NVIDIA Tegra processor. In many cases, however, neural network image classifiers are run on GPU-accelerated servers at a fixed location.

Runtimes will be compared for various configurations of these Tesla GPUs (see gpu benchmarks below). The main objectives of this tutorial, however, can be achieved using other NVIDIA GPU accelerators, such as the NVIDIA GeForce GTX Titan X, or the NVIDIA Quadro line (K6000, for example).  Both of these GPUs are available in Microway’s Deep Learning WhisperStation™, a quiet, desktop-sized GPU supercomputer pre-configured for extensive Deep Learning computation.  The NVIDIA GPU hardware on Microway’s HPC cluster is available for “Test Driving”. Readers are encouraged to request a GPU Test Drive.

Introduction to Deep Learning with DIGITS

To begin, let’s examine the creation of a small image dataset.  The images were downloaded using a simple in-house bash shell script.  Images were chosen to consist of two categories: one of recent series of the SUV Land Rover, and the other of recent series of the Jeep Cherokee – both comprised mostly of the 2014 or 2015 models.

The process of building a deep learning artificial neural network image classifier for these two types of SUVs using NVIDIA DIGITS is described in detail below in a video tutorial.  As a simple proof of concept, only these two SUV types were included in the data set.  A larger data set could be easily constructed including an arbitrary number of vehicle types.  Building a high quality data set is somewhat of an art, where consideration must be given to:

  • sizes of features in relation to convolution filter sizes
  • having somewhat uniform aspect ratios, so that potentially distinguishing features do not get distorted too differently from image to image during the squash transformation of DIGITS
  • ensuring that ample sampling of images taken from various angles are present in the data set (side view, front, back, close-ups, etc.) – this will train the network to recognize more facets of the objects to be classified

The laborious task of planning and creating a quality image data set is an investment into the final performance quality of the deep learning network, so care and attention at this stage will yield better performance during classifier testing and deployment.  The original SUV image data set was expanded by window sub-sampling the original set of images, and then by also applying horizontal, vertical, and combined flips of the sub-sampled, as well as of the original images.

Neural Network Image Classifier Performance Considerations

Beforehand, some performance-oriented questions we can pose are:

  • Can the classifier distinguish SUV type from front, back, side, and top viewpoints?
  • To what level of accuracy can the classifier distinguish image categories?
  • What sort of discernable, high-level object features will be learned by the network?

(We recommend viewing the NVIDIA DIGITS Deep Learning Tutorial video with 720p HD)

GPU Benchmarks for Caffe deep learning on Tesla K40 and K80

A GoogLeNet neural network model computation was benchmarked on the same learning parameters and dataset for the hardware configurations shown in the table below. All other aspects of hardware were the same across these configurations.

Hardware ConfigurationSpeedup Factor1
2 NVIDIA K80 GPU cards
(4 GK210 chips)
2.55
2 NVIDIA K40 GPU cards
(2 GK110B chips)
1.56
1 NVIDIA K40 GPU card
(1 GK110B chip)
1

1compared against the runtime on a single Tesla K40 GPU

The runtimes in this table reflect 30 epochs of training the GoogLeNet model with a learning rate of 0.005. The batch size was set to 120, compared to the default of 24. This was done in order to use a greater percentage of GPU memory.

In this tutorial, we specified a local directory for DIGITS to construct the image set. If you instead provide text files for the training and validation images, you may want to ensure that the default setting of Shuffle lines is set to “Yes”. This is important if you downloaded your images sequentially, by category. If the lines from such files are not shuffled, then your validation set may not guide the training as well as it would if the image URLs are random in order.

Although NVIDIA DIGITS already supports Caffe deep learning, it will soon support the Torch and Theano frameworks, so check back with Microway’s blog for more information on exciting new developments and tips on how you can quickly get started on using Deep Learning in your research.

Further Reading on NVIDIA DIGITS Deep Learning and Neural Networks

1. Srivastava, et al., Journal of Machine Learning Research, 15 (2014), 1929-1958
2. NVIDIA devblog: Easy Multi-GPU Deep Learning with DIGITS 2 https://devblogs.nvidia.com/parallelforall/easy-multi-gpu-deep-learning-digits-2/
3. Szegedy, et al., Going Deeper with Convolutions, 2014, https://arxiv.org/abs/1409.4842
4. Krizhevsky, et al., ImageNet Classification with Deep Convolutional Neural Networks, ILSVRC-2010
LeCun, et al., Proc. of the IEEE, Nov. 1998, pgs. 1-46
5. Fukushima, K., Biol. Cybernetics, 36, 1980, pgs. 193-202

The post Caffe Deep Learning Tutorial using NVIDIA DIGITS on Tesla K80 & K40 GPUs appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/caffe-deep-learning-using-nvidia-digits-tesla-gpus/feed/ 0
DDR4 RDIMM and LRDIMM Performance Comparison https://www.microway.com/hpc-tech-tips/ddr4-rdimm-lrdimm-performance-comparison/ https://www.microway.com/hpc-tech-tips/ddr4-rdimm-lrdimm-performance-comparison/#respond Fri, 10 Jul 2015 19:47:41 +0000 http://https://www.microway.com/?p=5419 Recently, while carrying out memory testing in our integration lab, Lead Systems Integrator, Rick Warner,  was able to clearly identify when it is appropriate to choose load-reduced DIMMs (LRDIMM) and when it is appropriate to choose registered DIMMs (RDIMM) for servers running large amounts of DDR4 RAM (i.e., 256 Gigabytes and greater). The critical factors […]

The post DDR4 RDIMM and LRDIMM Performance Comparison appeared first on Microway.

]]>
Recently, while carrying out memory testing in our integration lab, Lead Systems Integrator, Rick Warner,  was able to clearly identify when it is appropriate to choose load-reduced DIMMs (LRDIMM) and when it is appropriate to choose registered DIMMs (RDIMM) for servers running large amounts of DDR4 RAM (i.e., 256 Gigabytes and greater). The critical factors to consider are latency, speed, and capacity, along with what your computing objectives are with respect to them.

Misconceptions on Load Reduced DIMM Performance

Load-reduced DIMMs were built so that high-speed memory controllers in CPUs could drive larger quantities of memory. Thus, it’s often assumed that LRDIMMs will offer the best performance for memory-dense servers. This impression is strengthened by the fact that Intel’s guide for DDR4 memory population shows LRDIMMs running at a higher frequency than RDIMMs (e.g., 2133MHz vs 1866MHz). However, as we’ll show below, there are greater factors at play.

RDIMM vs LRDIMM Performance Testing

Using the STREAM memory benchmark, Rick took a look at 1 DIMM and 2 DIMMs per channel configurations using DDR4 LRDIMMS and RDIMMs on a Supermicro X10DAi motherboard with two Intel Xeon E5-2687W v3 CPU’s. Both our WhisperStation and WhisperStation for R are available in this configuration. We also have several Xeon Rackmount Servers which support this configuration.

For each case, the DIMM speed was forced to 2133MHz in the BIOS. Tests were run with both RDIMMs and LRDIMMs in 256GB and 512GB configurations.

LRDIMM Benchmark Results

FunctionBest Rate MB/sAvg. TimeMin. TimeMax. Time
Copy81,383.50.0040050.0039320.004151
Scale95,746.70.0034090.0033420.003561
Add109,661.00.0045050.0043770.004862
Triad109,315.60.0044900.0043910.004771
One LRDIMM Per Channel — 256GB RAM @ 2133MHz

 

FunctionBest Rate MB/sAvg. TimeMin. TimeMax. Time
Copy72,499.20.0044610.0044140.004546
Scale83,572.70.0039010.0038290.004036
Add95,979.50.0051030.0050010.005220
Triad96,541.00.0051050.0049720.005265
Two LRDIMMs Per Channel — 512GB RAM @ 2133MHz*

* for LRDIMMs, the 512GB configuration automatically operates at 2133MHz

LRDIMM Performance Summary

From these tests, we concluded that the latency imposed by the LRDIMMs results in approximately 12% reduction in overall performance when doubling the amount of RAM from 256GB to 512GB.

RDIMM Benchmark Results

Rick then tested RDIMMs using the same system for comparison (with 256GB and 512GB DDR4 memory configurations). Below are the stream results.

FunctionBest Rate MB/sAvg. TimeMin. TimeMax. Time
Copy82,707.50.0039390.0038690.004093
Scale101,973.70.0032430.0031380.003471
Add111,966.30.0045020.0042870.004978
Triad110,881.00.0044680.0043290.004843
One RDIMM Per Channel — 256GB RAM @ 2133MHz

 

FunctionBest Rate MB/sAvg. TimeMin. TimeMax. Time
Copy75,049.10.0043140.0042640.004405
Scale93,812.60.0034600.0034110.003550
Add103,091.10.0047290.0046560.004969
Triad103,493.90.0047040.0046380.004909
Two RDIMMs Per Channel — 512GB RAM @ 2133MHz*

* for RDIMMs, the 512GB configuration requires the memory speed to manually be increased to 2133MHz

RDIMM Performance Summary

Just as we saw with LRDIMMs, there is a reduction in performance between 1 DIMM per channel and 2 DIMMs per channel when using RDIMMs. However, this penalty is reduced to approximately 7% (compared to the 12% penalty suffered by LRDIMMs).

Side-by-Side Comparison of RDIMM and LRDIMM Performance

For clarity, here is a side by side table of DDR4 memory performance comparing LRDIMMs to RDIMMs. Note that RDIMM memory bandwidth is higher than LRDIMM bandwidth in every case.

1 DIMM Per Channel Best Rate (MB/s)2 DIMMs Per Channel Best Rate (MB/s)
FunctionLRDIMMRDIMMLRDIMMRDIMM
Copy81,383.582,707.572,499.275,049.1
Scale95,746.7101,973.783,572.793,812.6
Add109,661.0111,966.395,979.5103,091.1
Triad109,315.6110,881.096,541.0103,493.9
LRDIMMs and RDIMMs Compared

 

When Registered DIMMs (RDIMMs) are Best

Many of our HPC customers are looking for high speed and low latency. In that realm, RDIMMs are the hands down winner. At slightly cheaper cost and with the ability to ramp up memory frequency on certain motherboards, they are the right choice for fast memory performance.

When Load-Reduced DIMMs (LRDIMMs) are Best

When very large quantities of RAM are the goal, then LRDIMMs are the way to go. In this chart from Intel’s Grantly Platform Memory Configuration Guide, you can see that when packing a system full of RAM you can achieve twice the capacity from LRDIMMs. However, 64GB DDR4 LRDIMMs are still quite costly.  There are also specific configurations using 3 DIMMs per channel that require LRDIMMs.  Contact one of our experts to discuss the best options when you are considering servers with more than 512GB memory.

SKUMax DIMMs in PlatformNumber of CPU SocketsRDIMM ConfigLRDIMM Config
E5-1600 v312 DIMMS1384GB (12x32GB) @ 1600MHz768GB (12x64GB) @ 1600MHz
E5-2600 v324 DIMMs2768GB (24x32GB) @ 1600MHz1.5TB (24x64GB) @ 1600MHz
E5-4600 v348 DIMMs41.5TB (48x32GB) @ 1600MHz3TB (48x64GB) @ 1600MHz
Memory Configuration

lots-o-ram

Choosing between LRDIMMs and RDIMMs depends entirely on what performance characteristics meet the needs of your applications. Careful consideration of latency, speed and capacity as applied to your problem will show you the way to go. Our engineering team can help you work your way through this important design choice. Contact us or give us a call for assistance choosing the HPC platform that works best for you.

The post DDR4 RDIMM and LRDIMM Performance Comparison appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/ddr4-rdimm-lrdimm-performance-comparison/feed/ 0
How to Benchmark GROMACS GPU Acceleration on HPC Clusters https://www.microway.com/hpc-tech-tips/benchmark-gromacs-gpu-acceleration-hpc-clusters/ https://www.microway.com/hpc-tech-tips/benchmark-gromacs-gpu-acceleration-hpc-clusters/#respond Tue, 21 Oct 2014 15:42:32 +0000 http://https://www.microway.com/?p=4676 We know that many of our readers are interested in seeing how molecular dynamics applications perform with GPUs, so we are continuing to highlight various packages. This time we will be looking at GROMACS, a well-established and free-to-use (under GNU GPL) application.  GROMACS is a popular choice for scientists interested in simulating molecular interaction. With NVIDIA […]

The post How to Benchmark GROMACS GPU Acceleration on HPC Clusters appeared first on Microway.

]]>
Cropped shot of a GROMACS adh simulation (visualized with VMD)

We know that many of our readers are interested in seeing how molecular dynamics applications perform with GPUs, so we are continuing to highlight various packages. This time we will be looking at GROMACS, a well-established and free-to-use (under GNU GPL) application.  GROMACS is a popular choice for scientists interested in simulating molecular interaction. With NVIDIA Tesla K40 GPUs, it’s common to see 2X and 3X speedups compared to the latest multi-core CPUs.

Logging on to the Test Drive Cluster

To obtain access, fill out this quick and easy form: sign up for a GPU Test Drive. Once you obtain approval, you’ll receive an email with a list of commands to help you get your benchmark running. For your convenience, you can also reference a more detailed step-by-step guide below.

To begin, log in to the Microway Test Drive cluster using SSH. Don’t worry if you’re unfamiliar with SSH – we include an instruction manual for logging in. SSH is built-in on Linux and MacOS; Windows users need to install one application.

Run GROMACS on CPUs and GPUs

This first step is very easy. Simply enter the GROMACS directory and run the default benchmark script which we have pre-written for you:

cd gromacs
sbatch run-gromacs-on-TeslaK40.sh

Remember that Linux is case sensitive!

Managing GROMACS Jobs on the Cluster

Our cluster uses SLURM for resource management. Keeping track of your job is easy using the squeue command. For real-time information on your job, run: watch squeue (hit CTRL+c to exit). Alternatively, you can tell the cluster to e-mail you when your job is finished by editing the GROMACS batch script file (although this must be done before submitting jobs with sbatch). Run:

nano run-gromacs-on-TeslaK40.sh

Within this file, add the following two lines to the #SBATCH section (specifying your own e-mail address):

#SBATCH --mail-user=yourname@example.com
#SBATCH --mail-type=END

If you would like to monitor the compute node which is running your job, examine the output of squeue and take note of which node your job is running on. Log into that node using SSH and then use the tools of your choice to monitor it. For example:

ssh node2
nvidia-smi
htop

(hit q to exit htop)

See the speedup of GPUs vs. CPUs

The results from our benchmark script will be placed in an output file called gromacs-K40.xxxx.output.log – below is a sample of the output running on CPUs:

=======================================================================
= Run CPU-only water scaling benchmark system (1536)
=======================================================================
               Core t (s)   Wall t (s)        (%)
       Time:     1434.957       71.763     1999.6
                 (ns/day)    (hour/ns)
Performance:        1.206       19.894

Just below it is the GPU-accelerated run (showing a ~2.8X speedup):

=======================================================================
= Run Tesla_K40m GPU-accelerated water scaling benchmark system (1536)
=======================================================================
               Core t (s)   Wall t (s)        (%)
       Time:      508.847       25.518     1994.0
                 (ns/day)    (hour/ns)
Performance:        3.393        7.074

Should you require more information on a particular run, it’s available in the benchmarks/water/ directory. If your job has any problems, the errors will be logged to the file gromacs-K40.xxxx.output.errors

The chart below demonstrates the performance improvements between a CPU-only GROMACS run (on two 10-core Ivy Bridge Intel Xeon CPUs) and a GPU-accelerated GROMACS run (on two NVIDIA Tesla K40 GPUs):

GROMACS Speedups on NVIDIA Tesla K40 GPUs

Benchmarking your GROMACS Inputs

If you’re familiar with BASH, you can of course create your own batch script, but we recommend using the run-gromacs-your-files.sh file as a template for when you want to run you own simulations.  You can upload these files yourself or you can build them. If you opt for the latter, you need to load the appropriate software packages by running:

module load cuda/6.5 gcc/4.8.3 openmpi-cuda/1.8.1 gromacs

Once your files are either created or uploaded, you’ll need to ensure that the batch script is referencing the correct input files. The relevant parts of the run-gromacs-your-files.sh file are:

echo  "=================================================================="
echo  "= Run CPU-only water scaling benchmark system (1536)"
echo  "=================================================================="

srun --mpi=pmi2 -n $num_processes -c $num_threads_per_process mdrun_mpi -s topol.tpr -npme 0 -resethway -noconfout -nb cpu -nsteps 10000 -pin on -v

and for execution on GPUs:

echo  "=================================================================="
echo  "= Run ${GPU_TYPE} GPU-accelerated benchmark"
echo  "=================================================================="

srun --mpi=pmi2 -n $num_processes -c $num_threads_per_process mdrun_mpi -s topol.tpr -npme  0 -resethway -noconfout -nsteps 1000 -pin on -v

Although you might not be familiar with all of the above GROMACS flags, you should hopefully recognize the .tpr file.  This binary file contains the atomic-level input of the equilibration, temperature, pressure, and other inputs that the grompp module has processed.  The flags themselves are important for benchmarking and are explained below:

  • -npme  0: This flag is normally used to tell GROMACS how many threads to use.  However, unless you have compute nodes with different numbers of cores, it’s best to let MPI manage the threads.
  • -resethway: As the name suggests, this flag acts as a time reset.  Half way through the job, GROMACS will reset the counter so that any overhead from memory initialization or load balancing won’t affect the benchmark score.
  • -noconfout: For when you want to once again reduce overhead, this flag tells GROMACS to not create a toconfout.gro file.
  • -nsteps 1000: A tag that you’re probably familiar with, this one lets you set the maximum number of integration steps.  It’s useful to change if you don’t want to waste too much time waiting for your benchmark to finish.
  • -pin on: Finally, this tag lets you set affinities for the cores, meaning that threads will remain locked to cores and won’t jump around.

If you’d like to visualize your results, you will need to initialize a graphical session on our cluster. You are welcome to contact us if you’re uncertain of this step. After you have access to an X-session, you can run VMD by typing the following:

module load vmd
vmd

Next Steps for GROMACS GPU Acceleration

As you can see, we’ve set up our Test Drive so that running GROMACS on a GPU cluster isn’t much more difficult than running it on your own workstation. Benchmarking CPU vs GPU performance is also very easy. If you’d like to learn more, contact one of our experts or sign up for a GPU Test Drive today!

solvated alcohol dehydrogenase (ADH) protein in a rectangular box (134,000 atoms)
solvated alcohol dehydrogenase (ADH) protein in a rectangular box (134,000 atoms)

Citation for GROMACS:

https://www.gromacs.org/

Berendsen, H.J.C., van der Spoel, D. and van Drunen, R., GROMACS: A message-passing parallel molecular dynamics implementation, Comp. Phys. Comm. 91 (1995), 43-56.

Lindahl, E., Hess, B. and van der Spoel, D., GROMACS 3.0: A package for molecular simulation and trajectory analysis, J. Mol. Mod. 7 (2001) 306-317.

Featured Illustration:

Solvated alcohol dehydrogenase (ADH) protein in a rectangular box (134,000 atoms)
https://www.gromacs.org/topic/heterogeneous_parallelization.html

Citation for VMD:

Humphrey, W., Dalke, A. and Schulten, K., “VMD – Visual Molecular Dynamics” J. Molec. Graphics 1996, 14.1, 33-38
https://www.ks.uiuc.edu/Research/vmd/

The post How to Benchmark GROMACS GPU Acceleration on HPC Clusters appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/benchmark-gromacs-gpu-acceleration-hpc-clusters/feed/ 0
Benchmark MATLAB GPU Acceleration on NVIDIA Tesla K40 GPUs https://www.microway.com/hpc-tech-tips/benchmark-matlab-gpu-acceleration-nvidia-tesla-k40-gpus/ https://www.microway.com/hpc-tech-tips/benchmark-matlab-gpu-acceleration-nvidia-tesla-k40-gpus/#respond Fri, 17 Oct 2014 20:44:41 +0000 http://https://www.microway.com/?p=4891 MATLAB is a well-known and widely-used application – and for good reason. It functions as a powerful, yet easy-to-use, platform for technical computing. With support for a variety of parallel execution methods, MATLAB also performs well. Support for running MATLAB on GPUs has been built-in for a couple years, with better support in each release. […]

The post Benchmark MATLAB GPU Acceleration on NVIDIA Tesla K40 GPUs appeared first on Microway.

]]>
MATLAB solving a second order wave equation on Tesla GPUs

MATLAB is a well-known and widely-used application – and for good reason. It functions as a powerful, yet easy-to-use, platform for technical computing. With support for a variety of parallel execution methods, MATLAB also performs well. Support for running MATLAB on GPUs has been built-in for a couple years, with better support in each release. If you haven’t tried yet, take this opportunity to test MATLAB performance on GPUs. Microway’s GPU Test Drive makes the process quick and easy. As we’ll show in this post, you can expect to see 3X to 6X performance increases for many tasks (with 30X to 60X speedups on select workloads).

Access a Compute Node with GPU-accelerated MATLAB

Getting started with MATLAB on our GPU cluster is easy: complete this form to sign up for MATLAB GPU benchmarking. We will send you an e-mail with detailed instructions for logging in and starting up MATLAB. Once you’re in, all you need to do is click the MATLAB icon and the latest version of GPU-Accelerated MATLAB will pop up:
Mathworks MATLAB R2014b splashscreen

We use NoMachine to export the graphical sessions from our cluster to your local PC/laptop. This makes login extremely user-friendly, ensures your interactive session performs well and provides a built-in method for file transfers in and out of the GPU cluster. MATLAB is fairly well-known for performing sluggishly over standard Unix/Linux graphical sessions (e.g., X11 forwarding, VNC), but you’ll have no such issues here.

You’ll be dropped into a standard MATLAB workspace. A variety of parallelized demonstrations of GPU usage are included with MATLAB. Pick one and give it a try! You can type paralleldemo_gpu and then hit <TAB> to see the full list of options.

Main MATLAB R2014b window

Measure MATLAB GPU Speedups

Below we show the output from several of the built-in MATLAB parallel GPU demos. A few are text-only, but several include a graphical component or performance plot. The first example runs a quick test on memory transfer speeds and computational throughput. Results from both the GPU and the host (CPUs) are shown:

>> paralleldemo_gpu_benchmark
Using a Tesla K40m GPU.
Achieved peak send speed of 3.44069 GB/s
Achieved peak gather speed of 2.20036 GB/s
Achieved peak read+write speed on the GPU: 233.613 GB/s
Achieved peak read+write speed on the host: 12.9773 GB/s
Achieved peak calculation rates of 398.9 GFLOPS (host), 1345.8 GFLOPS (GPU)

Note that the host results will be impacted by the number of local workers available in the Parallel Computing Toolbox. Since version R2011b, the default has been limited to 12 threads/CPU cores. With the release of R2014a, Mathworks removed that limit. For these tests we changed the number of workers to 20 in the Parallel Preferences dialog box.

The next demo generates plots of the speedup between matrix multiplications on dual 10-core Xeon CPUs versus a single NVIDIA Tesla K40 GPU. Both single-precision and double-precision floating-point calculations were run.

GPU-Accelerated Stencil Operations

MATLAB also includes a couple of Stencil Operation demos running on a GPU. These include both a “generic” implementation and an optimized implementation using GPU shared & texture memory. As shown below, MATLAB GPU speedups can be 30+ times faster than MATLAB on CPUs with properly-optimized algorithms.

>> paralleldemo_gpu_mexstencil
Average time on the GPU: 1.119ms per generation
Average time of 0.038ms per generation (29.4x faster).
Average time of 0.019ms per generation (58.9x faster).
First version using gpuArray:  1.119ms per generation.
MEX with shared memory: 0.038ms per generation (29.4x faster).
MEX with texture memory: 0.019ms per generation (58.9x faster).

Running your own test of MATLAB GPU speedups

To see a list of other useful demos, take a look at the GPU-accelerated examples on Mathworks FileExchange. You’ll find a large number of useful demonstrations, including:

  • GPU acceleration for FFTs
  • Heat transfer equations
  • Navier-Stokes equations for incompressible fluids
  • Anisotropic Diffusion
  • Gradient Vector Flow (GVF) force field calculation
  • 3D linear and trilinear interpolation
  • more than 60 others

Also consider that hundreds of MATLAB’s standard functions support GPU acceleration. . Utilizing these capabilities is quite straightforward: your data must be loaded into a gpuArray. With this done, pass the gpuArray to any of MATLAB’s standard functions and the operations will be carried out on the GPU!

MATLAB paramSweep demo

Will GPU acceleration speed up your research?

With our pre-configured GPU cluster, running MATLAB on high-performance GPUs is as easy as running it on your own workstation. Find out for yourself how much faster you’ll be able to work if you add GPUs to your toolbelt. Sign up for a GPU Test Drive today!


Featured Illustration:

“Solving 2nd Order Wave Equation on the GPU Using Spectral Methods” by Jiro Doke
Mathworks MATLAB Central

The post Benchmark MATLAB GPU Acceleration on NVIDIA Tesla K40 GPUs appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/benchmark-matlab-gpu-acceleration-nvidia-tesla-k40-gpus/feed/ 0
Running GPU Benchmarks of HOOMD-blue on a Tesla K40 GPU-Accelerated Cluster https://www.microway.com/hpc-tech-tips/running-gpu-benchmarks-hoomd-blue-tesla-k40-gpu-accelerated-cluster/ https://www.microway.com/hpc-tech-tips/running-gpu-benchmarks-hoomd-blue-tesla-k40-gpu-accelerated-cluster/#respond Tue, 14 Oct 2014 20:28:19 +0000 http://https://www.microway.com/?p=4862 This short tutorial explains the usage of the GPU-accelerated HOOMD-blue particle simulation toolkit on our GPU-accelerated HPC cluster. Microway allows you to quickly test your codes on the latest high-performance systems – you are free to upload and run your own software, although we also provide a variety of pre-compiled applications with built-in GPU acceleration. […]

The post Running GPU Benchmarks of HOOMD-blue on a Tesla K40 GPU-Accelerated Cluster appeared first on Microway.

]]>
Cropped shot of a HOOMD-blue micellar crystals simulation (visualized with VMD)

This short tutorial explains the usage of the GPU-accelerated HOOMD-blue particle simulation toolkit on our GPU-accelerated HPC cluster. Microway allows you to quickly test your codes on the latest high-performance systems – you are free to upload and run your own software, although we also provide a variety of pre-compiled applications with built-in GPU acceleration. Our GPU Test Drive Cluster is a useful resource for benchmarking the faster performance which can be achieved with NVIDIA Tesla GPUs.

This post demonstrate HOOMD-blue, which comes out of the Glotzer group at the University of Michigan. HOOMD blue supports a wide variety of integrators and potentials, as well as the capability to scale runs up to thousands of GPU compute processors. We’ll demonstrate one server with dual NVIDIA® Tesla®  K40 GPUs delivering speedups over 13X!

Before continuing, please note that successful use of HOOMD-blue will require some familiarity with Python. However, you can reference their excellent Quick Start Tutorial. If you’re already familiar with a different software package, read through our list of pre-installed applications. There may be no need for you to learn a new tool.

Access a Tesla GPU-accelerated Compute Node

Getting started on our GPU system is fast and easy – complete this short form to sign up for HOOMD-blue benchmarking. We will send you an e-mail with a general list of commands when your request is accepted, but this post provides guidelines specific to HOOMD-blue tests.

First, you need SSH to access our GPU cluster. Don’t worry if you’re unfamiliar with SSH – we will send you step-by-step login instructions. Windows users have one extra step, but SSH is built-in on Linux and MacOS.

Run CPU and GPU-accelerated HOOMD-blue

Once you’re logged in, it’s easy to compare CPU and GPU performance: enter the HOOMD-blue directory and run the benchmark batch script which we have pre-written for you:

cd hoomd-blue
sbatch run-hoomd-on-TeslaK40.sh

Waiting for your HOOMD-blue job to finish

Our cluster uses SLURM to manage computational tasks. You should use the squeue command to check the status of your jobs. To watch as your job runs, use: watch squeue (hit CTRL+c to exit). Alternatively, the cluster can e-mail you when your job has finished if you update the HOOMD batch script file (although this must be done before submitting your job). Run:

nano run-hoomd-on-TeslaK40.sh

Within this file, add the following lines to the #SBATCH section (changing the e-mail address to your own):

#SBATCH --mail-user=yourname@example.com
#SBATCH --mail-type=END

If you would like to closely monitor the compute node which is executing your job, run squeue to check which compute node your job is running on. Log into that node via SSH and use one of the following tools to monitor the GPU and system status:

ssh node2
nvidia-smi
htop

(hit q to exit htop)

Check the speedup of HOOMD-blue on GPUs vs. CPUs

The results from the HOOMD-blue benchmark script will be placed in an output file named hoomd-K40.xxxx.output.log – below is a sample of the output running on CPUs:

======================================================
= Run CPU only lj_liquid_bmark_512K
======================================================
Average TPS: 21.90750

and with HOOMD-blue running on two GPUs (demonstrating a 13X speed-up):

======================================================
= Run Tesla_K40m GPU-accelerated lj_liquid_bmark_512K
======================================================
Average TPS: 290.27084

If you would like to examine the full execution sequence of a particular input, you will see that a log file has been created for each of the inputs (e.g., lj_liquid_bmark_512K.20_cpu_cores.output). If the HOOMD-blue job has any problems, the errors will be logged to the file hoomd-K40.xxxx.output.errors

The chart below shows the performance improvements for a CPU-only HOOMD-blue run (on two 10-core Ivy Bridge Intel Xeon CPUs) compared to a GPU-accelerated HOOMD-blue run (on two NVIDIA Tesla K40 GPUs):

Plot of HOOMD-blue performance results on Xeon CPUs and Tesla GPUs

Running your own HOOMD-blue inputs on GPUs

If you’re comfortable with shell scripts you can write your own batch script from scratch, but we recommend using the run-hoomd-your-files.sh file as a template when you’d like to try your own simulations. For most HOOMD-blue runs, the batch script will only reference a single Python script as input (e.g., the lj_liquid_bmark_512K.hoomd script). Reference the HOOMD-blue Quick Start Tutorial.

Once your script is in place in your hoomd-blue/ directory, you’ll need to ensure that the batch script is referencing the correct .hoomd input file. The relevant lines of the run-hoomd-your-files.sh file are:

echo "==============================================================="
echo "= Run CPU-only"
echo "==============================================================="

srun --mpi=pmi2 hoomd input_file.hoomd --mode=cpu > hoomd_output__cpu_run.txt
grep "Average TPS:" hoomd_output__cpu_run.txt

and for execution on GPUs:

echo "==============================================================="
echo "= Run GPU-Accelerated"
echo "==============================================================="

srun --mpi=pmi2 -n $GPUS_PER_NODE hoomd input_file.hoomd > hoomd_output__gpu_run.txt
grep "Average TPS:" hoomd_output__gpu_run.txt

As shown above, both the CPU and GPU runs use the same input file (input_file.hoomd). They will each save their output to a separate text file (hoomd_output__cpu_run.txt and hoomd_output__gpu_run.txt). The final line of each section uses the grep tool to print the performance of that run. HOOMD-blue typically measures performance in millions of particle time steps per second (TPS), where a higher number indicates better performance.

VMD visualization of micellar crystals
VMD visualization of micellar crystals

Will GPU acceleration speed up your research?

With our pre-configured GPU cluster, running HOOMD-blue across an HPC cluster isn’t much more difficult than running it on your own workstation. This makes it easy to compare HOOMD-blue simulations running on CPUs and GPUs. If you’d like to give it a try, contact one of our experts or sign up for a GPU Test Drive today!


Citation for HOOMD-blue:

Joshua A. Anderson, Chris D. Lorenz, and Alex Travesset – ‘General Purpose Molecular Dynamics Fully Implemented on Graphics Processing Units’, Journal of Computational Physics 227 (2008) 5342-5359
https://glotzerlab.engin.umich.edu/hoomd-blue/

Featured Illustration:

“Micellar crystals in solution from molecular dynamics simulations”, J. Chem. Phys. 128, 184906 (2008); DOI:10.1063/1.2913522
https://doi.org/10.1063/1.2913522

Citation for VMD:

Humphrey, W., Dalke, A. and Schulten, K., “VMD – Visual Molecular Dynamics” J. Molec. Graphics 1996, 14.1, 33-38
https://www.ks.uiuc.edu/Research/vmd/

The post Running GPU Benchmarks of HOOMD-blue on a Tesla K40 GPU-Accelerated Cluster appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/running-gpu-benchmarks-hoomd-blue-tesla-k40-gpu-accelerated-cluster/feed/ 0
Benchmarking NAMD on a GPU-Accelerated HPC Cluster with NVIDIA Tesla K40 https://www.microway.com/hpc-tech-tips/benchmarking-namd-gpu-accelerated-hpc-cluster-nvidia-tesla-k40/ https://www.microway.com/hpc-tech-tips/benchmarking-namd-gpu-accelerated-hpc-cluster-nvidia-tesla-k40/#respond Fri, 10 Oct 2014 17:32:04 +0000 http://https://www.microway.com/?p=4846 This is a tutorial on the usage of GPU-accelerated NAMD for molecular dynamics simulations. We make it simple to test your codes on the latest high-performance systems – you are free to use your own applications on our cluster and we also provide a variety of pre-installed applications with built-in GPU support. Our GPU Test […]

The post Benchmarking NAMD on a GPU-Accelerated HPC Cluster with NVIDIA Tesla K40 appeared first on Microway.

]]>
Cropped shot of a NAMD stmv simulation (visualized with VMD)

This is a tutorial on the usage of GPU-accelerated NAMD for molecular dynamics simulations. We make it simple to test your codes on the latest high-performance systems – you are free to use your own applications on our cluster and we also provide a variety of pre-installed applications with built-in GPU support. Our GPU Test Drive Cluster acts as a useful resource for demonstrating the increased application performance which can be achieved with NVIDIA Tesla GPUs.

This post describes the scalable molecular dynamics software NAMD, which comes out of the Theoretical and Computational Biophysics Group at the University of Illinois Urbana-Champaign. NAMD supports a variety of operational modes, including GPU-accelerated runs across large numbers of compute nodes. We’ll demonstrate how a single server with NVIDIA® Tesla®  K40 GPUs can deliver speedups over 4X!

Before continuing, please note that this post assumes you are familiar with NAMD. If you prefer a different molecular dynamics package (e.g., AMBER), read through the list of applications we have pre-installed. There may be no need for you to learn a new tool. If all of these tools are new to you, you will find a number of NAMD tutorials online.

Access the Tesla GPU-accelerated Cluster

Getting started with our GPU Benchmark cluster is fast and easy – fill out this short form to sign up for GPU benchmarking. Although we will send you an e-mail with a general list of commands when your request is accepted, this post goes into further detail.

First, you need to log in to the GPU cluster using SSH. Don’t worry if you haven’t used SSH before – we will send you step-by-step login instructions. Windows users have to perform one additional step, but SSH is built-in on Linux and MacOS.

Run CPU and GPU-accelerated versions of NAMD

Once you’re logged in, it’s easy to compare CPU and GPU performance: enter the NAMD directory and run the NAMD batch script which we have pre-written for you:

cd namd
sbatch run-namd-on-TeslaK40.sh

Waiting for your NAMD job to finish

Our cluster uses SLURM to manage users’ jobs. You can use the squeue command to keep track of your jobs. For real-time information on your job, run: watch squeue (hit CTRL+c to exit). Alternatively, the cluster can e-mail you when your job is finished if you update the NAMD batch script file (although this must be done before submitting your job). Run:

nano run-namd-on-TeslaK40.sh

Within this file, add the following two lines to the #SBATCH section (changing the e-mail address to your own):

#SBATCH --mail-user=yourname@example.com
#SBATCH --mail-type=END

If you would like to closely monitor the compute node which is running your job, check the output of squeue and take note of which compute node your job is running on. Log into that node with SSH and then use one of the following tools to keep an eye on GPU and system status:

ssh node2
nvidia-smi
htop

(hit q to exit htop)

Check the speedup of NAMD on GPUs vs. CPUs

The results from the NAMD batch script will be placed in an output file named namd-K40.xxxx.output.log – below is a sample of the output running on CPUs:

======================================================
= Run CPU only stmv
======================================================
Info: Benchmark time: 20 CPUs 0.531318 s/step 6.14951 days/ns 4769.63 MB memory

and with NAMD running on two GPUs (demonstrating over 4X speed-up):

======================================================
= Run Tesla_K40m GPU-accelerated stmv
======================================================
Info: Benchmark time: 18 CPUs 0.112677 s/step 1.30413 days/ns 2475.9 MB memory

Should you require further details on a particular run, you will see that a separate log file has been created for each of the inputs (e.g., stmv.20_cpu_cores.output). The NAMD output files are available in the benchmarks/ directory (with a separate subdirectory for each test case). If your job has any problems, the errors will be logged to the file namd-K40.xxxx.output.errors

The following chart shows the performance improvements for a CPU-only NAMD run (on two 10-core Ivy Bridge Intel Xeon CPUs) versus a GPU-accelerated NAMD run (on two NVIDIA Tesla K40 GPUs):

Plot comparing NAMD performance on Xeon CPUs and NVIDIA Tesla K40 GPUs

Running your own NAMD inputs on GPUs

If you’re familiar with BASH you can write your own batch script from scratch, but we recommend using the run-namd-your-files.sh file as a template when you’d like to try your own simulations. For most NAMD runs, the batch script will only reference a single input file (e.g., the stmv.namd script). This input script will reference any other input files which NAMD might require:

  • Structure file (e.g., stmv.psf)
  • Coordinates file (e.g., stmv.pdb)
  • Input parameters file (e.g., par_all27_prot_na.inp)

You can upload existing inputs from your own workstation/laptop or you can assemble an input job on the cluster. If you opt for the latter, you need to load the appropriate software packages by running:

module load cuda gcc namd

Once your files are in place in your namd/ directory, you’ll need to ensure that the batch script is referencing the correct .namd input file. The relevant lines of the run-namd-your-files.sh file are:

echo "==============================================================="
echo "= Run CPU-only"
echo "==============================================================="

namd2 +p $num_cores_cpu input_file.namd > namd_output__cpu_run.txt
grep Benchmark namd_output__cpu_run.txt

and for execution on GPUs:

echo "==============================================================="
echo "= Run GPU-Accelerated"
echo "==============================================================="

namd2 +p $num_cores_gpu +devices $CUDA_VISIBLE_DEVICES +idlepoll input_file.namd > namd_output__gpu_run.txt
grep Benchmark namd_output__gpu_run.txt

As is hopefully clear, both the CPU and GPU runs use the same input file (input_file.namd). They will each output to a separate log file (namd_output__cpu_run.txt and namd_output__gpu_run.txt). The final line of each section uses the grep utility to print the performance of each run in days per nanosecond (where a lower number indicates better performance).

If you’d like to visualize your results, you will need an SSH client which properly forwards your X-session. You are welcome to contact us if you’re uncertain of this step. Once that’s done, the VMD visualization tool can be run:

module load vmd
vmd
VMD visualization of the Satellite Tobacco Mosaic Virus
VMD visualization of the Satellite Tobacco Mosaic Virus

Ready to try GPUs?

Once properly configured (which we’ve already done for you), running NAMD on a GPU cluster isn’t much more difficult than running it on your own workstation. This makes it easy to compare NAMD simulations running on CPUs and GPUs. If you’d like to give it a try, contact one of our experts or sign up for a GPU Test Drive today!


Citations for NAMD:

“NAMD was developed by the Theoretical and Computational Biophysics Group in the Beckman Institute for Advanced Science and Technology at the University of Illinois at Urbana-Champaign.”

James C. Phillips, Rosemary Braun, Wei Wang, James Gumbart, Emad Tajkhorshid, Elizabeth Villa, Christophe Chipot, Robert D. Skeel, Laxmikant Kale, and Klaus Schulten. Scalable molecular dynamics with NAMD. Journal of Computational Chemistry, 26:1781-1802, 2005. abstract, journal
https://www.ks.uiuc.edu/Research/namd/

Featured Illustration:

Molecular Dynamics of Viruses – Satellite Tobacco Mosaic Virus (STMV)

Citation for VMD:

Humphrey, W., Dalke, A. and Schulten, K., “VMD – Visual Molecular Dynamics” J. Molec. Graphics 1996, 14.1, 33-38
https://www.ks.uiuc.edu/Research/vmd/

The post Benchmarking NAMD on a GPU-Accelerated HPC Cluster with NVIDIA Tesla K40 appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/benchmarking-namd-gpu-accelerated-hpc-cluster-nvidia-tesla-k40/feed/ 0
Running AMBER on a GPU Cluster https://www.microway.com/hpc-tech-tips/running-amber-gpus/ https://www.microway.com/hpc-tech-tips/running-amber-gpus/#respond Mon, 06 Oct 2014 14:14:39 +0000 http://https://www.microway.com/?p=4813 Welcome to our tutorial on GPU-accelerated AMBER! We make it easy to benchmark your applications and problem sets on the latest hardware. Our GPU Test Drive Cluster provides developers, scientists, academics, and anyone else interested in GPU computing with the opportunity to test their code. While Test Drive users are given free reign to use […]

The post Running AMBER on a GPU Cluster appeared first on Microway.

]]>
Cropped shot of an AMBER nucleosome simulation (visualized with VMD)

Welcome to our tutorial on GPU-accelerated AMBER! We make it easy to benchmark your applications and problem sets on the latest hardware. Our GPU Test Drive Cluster provides developers, scientists, academics, and anyone else interested in GPU computing with the opportunity to test their code. While Test Drive users are given free reign to use their own applications on the cluster, Microway also provides a variety of pre-installed GPU accelerated applications.

In this post, we will look at the molecular dynamics package AMBER. Collaboratively developed by professors at a variety of university labs, the latest versions of AMBER natively support GPU acceleration. We’ll demonstrate how NVIDIA® Tesla®  K40 GPUs can deliver a speedup of up to 86X!

Before we jump in, we should mention that this post assumes you are familiar with AMBER and/or AmberTools. If you are more familiar with another molecular dynamics package (e.g., GROMACS), check to see what we already have pre-installed on our cluster. There may be no need for you to learn a new tool. If you’re new to these tools in general, you can find quite a large number of AMBER tutorials online.

Access our GPU-accelerated Test Cluster

Getting access to the Microway Test Drive cluster is fast and easy – fill out a short form to sign up for a GPU Test Drive. Although our approval e-mail includes a list of commands to help you get your benchmark running, we’ll go over the steps in more detail below.

First, you need to log in to the Microway Test Drive cluster using SSH. Don’t worry if you’re unfamiliar with SSH – we include a step-by-step instruction manual for logging in. SSH is built-in on Linux and MacOS; Windows users need to install one application.

Run CPU and GPU versions of AMBER

This is one of the easiest steps. Just enter the AMBER directory and run the default benchmark script which we have pre-written for you:

cd amber
sbatch run-amber-on-TeslaK40.sh

Waiting for jobs to complete

Our cluster uses SLURM for resource management. Keeping track of your job is easy using the squeue command. For real-time information on your job, run: watch squeue (hit CTRL+c to exit). Alternatively, you can tell the cluster to e-mail you when your job is finished by editing the AMBER batch script file (although this must be done before submitting jobs with sbatch). Run:

nano run-amber-on-TeslaK40.sh

Within this file, add the following two lines to the #SBATCH section (specifying your own e-mail address):

#SBATCH --mail-user=yourname@example.com
#SBATCH --mail-type=END

If you would like to monitor the compute node which is running your job, examine the output of squeue and take note of which node your job is running on. Log into that node using SSH and then use the tools of your choice to monitor it. For example:

ssh node2
nvidia-smi
htop

(hit q to exit htop)

See the speedup of GPUs vs. CPUs

The results from our benchmark script will be placed in an output file called amber-K40.xxxx.output.log – below is a sample of the output running on CPUs:

===============================================================
= Run CPU-only: JAC_PRODUCTION_NVE - 23,558 atoms PME
===============================================================
|         ns/day =      25.95   seconds/ns =    3329.90

and with AMBER running on GPUs (demonstrating a 6X speed-up):

========================================================================
= Run Tesla_K40m GPU-accelerated: JAC_PRODUCTION_NVE - 23,558 atoms PME
========================================================================
|         ns/day =     157.24   seconds/ns =     549.47

Should you require more information on a particular run, it’s available in the benchmarks/ directory (with a separate subdirectory for each test case). If your job has any problems, the errors will be logged to the file amber-K40.xxxx.output.errors

The chart below demonstrates the performance improvements between a CPU-only AMBER run (on two 10-core Ivy Bridge Intel Xeon CPUs) and a GPU-accelerated AMBER run (on two NVIDIA Tesla K40 GPUs):

AMBER Speedups on NVIDIA Tesla K40 GPUs

Running your own AMBER inputs on GPUs

If you’re familiar with BASH, you can of course create your own batch script, but we recommend using the run-amber-your-files.sh file as a template for when you want to run you own simulations. For AMBER, the key files are the prmtop, inpcrd, and mdin files. You can upload these files yourself or you can build them. If you opt for the latter, you need to load the appropriate software packages by running:

module load cuda gcc mvapich2-cuda amber

Once your files are either created or uploaded, you’ll need to ensure that the batch script is referencing the correct input files. The relevant parts of the run-amber-your-files.sh file are:

echo "==============================================================="
echo "= Run CPU-only"
echo "==============================================================="

srun -n $NPROCS pmemd.MPI -O -i mdin -o mdout.cpu -p prmtop -inf mdinfo.cpu -c inpcrd -r restrt.cpu -x mdcrd.cpu
grep "ns/day" mdinfo.cpu | tail -n1

and for execution on GPUs:

echo "==============================================================="
echo "= Run GPU-Accelerated"
echo "==============================================================="

srun -n $GPUS_PER_NODE pmemd.cuda.MPI -O -i mdin -o mdout.cpu -p prmtop -inf mdinfo.gpu -c inpcrd -r restrt.gpu -x mdcrd.gpu
grep "ns/day" mdinfo.gpu | tail -n1

The above script assumes that mdin (control data: variables and simulation options), prmtop (topology: the molecular topology and force field parameters), and inpcrd (coordinates: the atom coordinates, velocities, box dimensions) are the main input files, but you are free to add additional levels of complexity as well. The output files (mdout, mdinfo, restrt, mdcrd) are labeled with the suffixes .cpu and .gpu. The line which lists the grep command is used to populate the amber-K40.xxxx.output.log output file with the ns/day benchmark times (just as shown in the sample output listed above).

If you’d like to visualize your results, you will need an SSH client which properly forwards your X-session. You are welcome to contact us if you’re uncertain of this step. Once that’s done, the VMD visualization tool can be accessed by running:

module load vmd
vmd
VMD visualization of a nucleosome
VMD visualization of a nucleosome

What’s next?

With the right setup (which we’ve already done for you), running AMBER on a GPU cluster isn’t much more difficult than running it on your own workstation. We also make it easy to compare benchmark results between CPUs and GPUs. If you’d like to learn more, contact one of our experts or sign up for a GPU Test Drive today!


Citations for AMBER and AmberTools:

D.A. Case, T.A. Darden, T.E. Cheatham, III, C.L. Simmerling, J. Wang, R.E. Duke, R. Luo, R.C. Walker, W. Zhang, K.M. Merz, B. Roberts, S. Hayik, A. Roitberg, G. Seabra, J. Swails, A.W. Goetz, I. Kolossváry, K.F. Wong, F. Paesani, J. Vanicek, R.M. Wolf, J. Liu, X. Wu, S.R. Brozell, T. Steinbrecher, H. Gohlke, Q. Cai, X. Ye, J. Wang, M.-J. Hsieh, G. Cui, D.R. Roe, D.H. Mathews, M.G. Seetin, R. Salomon-Ferrer, C. Sagui, V. Babin, T. Luchko, S. Gusarov, A. Kovalenko, and P.A. Kollman (2012), AMBER 12, University of California, San Francisco.

PME: Romelia Salomon-Ferrer; Andreas W. Goetz; Duncan Poole; Scott Le Grand; & Ross C. Walker* “Routine microsecond molecular dynamics simulations with AMBER – Part II: Particle Mesh Ewald” , J. Chem. Theory Comput., 2013, 9 (9), pp 3878-3888, DOI: 10.1021/ct400314y

GB: Andreas W. Goetz; Mark J. Williamson; Dong Xu; Duncan Poole; Scott Le Grand; & Ross C. Walker* “Routine microsecond molecular dynamics simulations with AMBER – Part I: Generalized Born”, J. Chem. Theory Comput., (2012), 8 (5), pp 1542-1555, DOI: 10.1021/ct200909j

https://ambermd.org/

Citation for VMD:

Humphrey, W., Dalke, A. and Schulten, K., “VMD – Visual Molecular Dynamics” J. Molec. Graphics 1996, 14.1, 33-38

https://www.ks.uiuc.edu/Research/vmd/

The post Running AMBER on a GPU Cluster appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/running-amber-gpus/feed/ 0