CUDA使用二维网格和二位块对矩阵求和

CUDA使用二维网格和二位块对矩阵求和

在本节中,我们将使用一个二维网格和二位块来编写一个矩阵加法核函数。首先,应该编写一个校验主函数以验证矩阵加法核函数是否能得出正确的结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
void sumMatrixOnhost(float *A, float *B, float *C, const int nx, const int ny){
float *ia = A;
float *ib = B;
float *ic = C;

for(int iy=0;iy<ny;iy++){
for(int ix=0;ix<nx;ix++){
ic[ix]=ia[ix]+ib[ix];
}
ia += nx;
ib += nx;
ic += nx;
}
}

然后,创建一个新的核函数,目的是采用一个二维线程块来进行矩阵求和:

1
2
3
4
5
6
7
8
__global__ void sumMatrixOnGPU2D(float *MatA, float *MatB, float *MatC, int nx, int ny){
unsigned int ix = threadIdx.x + blockIdx.x * blockDim.x;
unsigned int iy = threadIdx.y + blockIdx.y * blockDim.y;
ubsigned int idx = iy*nx + ix;

if(ix < nx && iy <ny)
MatC[idx] = MatA[idx] + MatB[idx];
}

这个核函数的关键步骤是将每个线程从它的线程索引映射到全局线性内存索引中,如图2-12所示。

接下来,每个维度下的矩阵大小可以按如下方法设置为16384个元素:

1
2
int nx = 1<<14;
int ny = 1<<14;

然后,使用一个二维网格和二维块按如下方法设置核函数的执行配置:

image-20230309125059653

1
2
3
4
int dimx = 32;
int dimy = 32;
dim3 block(dimx, dimy);
dim3 grid((nx + block.x - 1)/block.x, (ny + block.y - 1)/block.y);

把所有的代码整合到名为sumMatrixOnGPU-2D-grid-2D-block.cu的文件中。主函数代码如代码清单2-7所示。

代码清单2-7 使用一个二维网格和二维块的矩阵加法(sumMatrixOnGPU-2D-grid-2D-block.cu)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
int main(int argc, char **argv){
printf("%s Starting...\n",zrgv[0]);

//set up device
int dev = 0;
cudaDeviceProp deviceProp;
CHECK(cudaGetDeviceProperties(&deviceProp, dev));
printf("Using Device %d: %s\n",dev, deviceProp.name);
CHECK(cudaSetDevice(dev));

//set up data size of matrix
int nx = 1<<14;
int ny = 1<<14;

int nxy = nx*ny;
int nBytes = nxy *sizeof(float);
printf("Matrix size: nx %d ny %d\n",nx,ny);

//malloc host memory
float *h_A, *h_B, *hostRef, *gpuRef;
h_A = (float *)malloc(nBytes);
h_B = (float *)malloc(nBytes);
hostRef = (float *)malloc(nBytes);
gpuRef = (float *)malloc(nBytes);

//initialize data at host side
double iStart = cpuSecond();
initialData (h_A, nxy);
initialData (h_B, nxy);
double iElaps = cpuSecond() - iStart;

memset(hostRef, 0, nBytes);
memset(gpuRef, 0, nBytes);

//add matrix at host side for result checks
iStart = cpuSecond();
sumMatrixOnHost (h_A, h_B, hostRef, nx,ny);
iElaps = cpuSecond() - iStart;

//malloc device global memory
float *d_MatA, *d_MatB, *d_MatC;
cudaMalloc((void **)&d_MatA,nBytes);
cudaMalloc((void **)&d_MatB,nBytes);
cudaMalloc((void **)&d_MatC,nBytes);

//transfer data from host to device
cudaMemcpy(d_MatA, h_A, nBytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_MatB, h_B, nBytes, cudaMemcpyHostToDevice);

//invoke kernel at host side
int dimx = 32;
int dimy = 32;
dim3 block(dimx,dimy);
dim3 grid((nx+block.x-1)/block.x,(ny+block.y-1)/block.y);

iStart = cpuSecond();
sumMatrixOnGPU2D<<<grid,block>>>(d_MatA, d_MatB, d_MatC, nx,ny);
cudaDeviceSynchronize();
iElaps = cpuSecond() - iStart;
printf("sumMatrixOnGPU2D<<<(%d,%d),(%d,%d)>>> elapsed %f sec\n",grid.x, grid.y, block.x,block.y,iElaps);

//copy kernel result back to host side
cudaMemcpy(gpuRef, d_MatC, nBytes, cudaMemcpyDeviceToHost);

//check device results
checkResult(hostRef, gpuRef,nxy);

//free device global memory
cudaFree(d_MatA);
cudaFree(d_MatB);
cudaFree(d_MatC);

//free host memory
free(h_A);
free(h_B);
free(hostRef);
free(gpuRef);

//reset device
cudaDeviceReset();

return (0);
}

用以下命令编译并运行该代码:

1
2
nvcc -arch=sm_20 sumMatrixOnGPU-2D-grid-2D-block.cu -o matrix2D
./matrix2D

在Tesla M2070上运行的结果:

image-20230309224513603

接下来,调整块的尺寸为32×16并重新编译和运行该代码。核函数的执行速度几乎快了两倍:

image-20230309224626772

你可能好奇为什么只是改变了执行配置,内核性能就几乎翻了一倍。直观地说,你可能会觉得这是因为第二次配置的线程块数是第一次配置块数的两倍,所以并行性也是两倍。你的直觉是正确的,但是,如果进一步减小块的大小变为16×16,相比第一次配置你已经将块的数量翻了四倍。如下所示,这种配置比第一个结果好但是不如第二个。

image-20230309224912499

表2-3总结了不同执行配置的性能。结果显示,增加块的数量不一定能提升内核性能。

image-20230309225004531

参考资料

CUDA C编程权威指南 程润伟,Max Grossman(美),Ty Mckercher