发布于 2018-02-23 HPL ASC

HPL(High Performance Linpcak) Benchmarking Raspberry PIs

介绍

Benchmarking(基准测试) 是在某一系统上运行特定程序进而测定运算速度的过程.有许多特定基准测试程序并且我们在Linux系统上使用一个著名的程序HPL(High Performance Linpcak).

在这篇介绍中，我们将会介绍在单一节点(Raspberry Pi)下HPL测试环境搭建，

之后会介绍多个节点下(一个Raspberry Pi3 代表一个 node) 下测试环境搭建.需要注意的是，多个节点下的环境搭建，前提是每个Raspberry Pi 已经实现了MPI依赖(例如 MPICH 或者 OpenMPI,本文假定已安装MPICH).

最近在做ASC2018比赛，赛题2为HPL/HPCG测试，但学校组委会提供的计算节点(P100 GPU*2)无法登陆，所以打算用树莓派替代.

HPL是什么

HPL是一个软件包，解决了一个（随机）双精度稠密线性系统（64位）的分布式内存计算机算法。HPL包提供了一个测试和定时程序量化得到的解的精度以及计算时间了。这个软件在系统中实现的最佳性能取决于多种因素。这种实现是可伸缩的，在这个意义上，相对于处理器内存使用率，它们的并行效率保持恒定。因此，我们可以用它来并行地测试单个处理器或一系列分布式处理器。所以让我们开始安装HPL。

步骤

Step1 安装依赖

gfortran fortran语言编译器
MPICH2 MPI的一种实现
mpich2-dev 开发工具
BLAS Basic Linear Algebra Subprograms

sudo apt-get update

sudo apt-get install gcc g++ gfortran

sudo apt-get install mpich

sudo apt-get install libatlas-base-dev libmpich-dev

Step2 下载并设置HPL

mkdir /home/username/hpl-test

cd /home/username/hpl-test

wget -O hpl-2.2.tar.gz http://www.netlib.org/benchmark/hpl/hpl-2.2.tar.gz

tar -xvf hpl-2.2.tar.gz
cd hpl-2.2/setup
sh make_generic
cd ..
cp setup/Make.UNKNOWN Make.rpi

Step3 调整Make.rpi 文件

这是非常重要的一步，下面文件中的改变根据你自己系统的情况而定.请注意以下变化参数显示，遍布make.rpi文件.因此,我建议您找到每个参数并替换或添加更改，然后继续到下一个参数.

请详细阅读hpl-2.2中的INSTALL文件

nano Make.rpi

ARCH      = rpi
TOPdir      = $(HOME)/hpl-test/hpl-2.2
MPinc      = # /usr/lib/mpich/include   注释 apt-get 安装 mpich  库文件在 /usr/lib/mpich中
MPlib      = # /usr/lib/mpich/lib/libmpich.a
LAdir      = /usr/lib/atlas-base/
LAlib      = $(LAdir)/libf77blas.a $(LAdir)/libatlas.a

Step4 根据Make.rpi编译安装HPL

1
2
3

cd /home/username/hpl-test/hpl-2.2

make arch=rpi

编译第一次没有通过，原因及解决方案

Step5 创建HPL输入文件

以下是HPL.DAT文件的一个例子.这是HPL运行时的输入文件.该文件中提供的值用于生成和计算问题.您可以直接使用此文件为单个节点运行测试.创建一个文件，在bin/rpi文件夹并命名为HPL.DAT将下面的内容复制到那个文件中.

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
5040         Ns
1            # of NBs
128          NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1            Ps
1            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

该文件的内容必须通过反复试验的方法来改变，直到得到一个令人满意的输出。要了解每一个参数以及如何改变它，请参考这里的一篇论文。要跳过要点，从文档中的第6页开始阅读.

Step6

当hpl.dat文件准备好了,我们可以运行HPL.以上hpl.dat文件是一个单节点或处理器,所以在上述文件的P×Q值的产品1 * 1 = 1，它是一个单一的处理器.

1 2	cd bin/rpi ./xhpl

输出结果类似于

================================================================================
HPLinpack 2.2  --  High-Performance Linpack benchmark  --   February 24, 2016
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :    5040 
NB     :     128 
PMAP   : Row-major process mapping
P      :       1 
Q      :       1 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

终端上的最终输出将类似如下所示,最后一个值给出速度和值，然后显示所提供的不同参数.在下面的内容中,速度是Gflops,它的值是在1.21e-01 GFLOPS,转换是121MFLOPS.

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4        5040   128     1     1             508.58              1.679e-01
HPL_pdgesv() start time Fri Feb 23 10:13:39 2018

HPL_pdgesv() end time   Fri Feb 23 10:22:07 2018

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0021492 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

上面的数据可能会有差异,单个树莓派的运算性能并不是很好

Step7 在多节点上运行HPL

当我们想运行HPL为多个节点,改变hpl.dat文件.假设我们有32个节点,所以p q的乘积应该是32.我选择了p＝4，q＝8.除了这个变化,我们必须改变N的值,从试验和错误,我们得到的最大速度为N = 17400.最后的文件内容如下所示

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
17400         Ns
1            # of NBs
128          NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
4            Ps
8            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

记住，在下面的命令中更改路径以表示系统中的machine file路径

1 2	cd bin/rpi mpiexec -f ~/mpi-testing/machinefile -n 32 ./xhpl

参考博客

上一篇: Tex&LaTex操作指南下一篇:

Share&Joy

Ginger' Blog