AVX向量化学习(二)-内存对齐的应用

AVX指令集的简单操作(内存对齐版)

使用AVX指令集进行2个double型的数组相加操作

常用的内存对齐函数

因为AVX中要求mem__addr必须在32字节边界上对齐,否则可能会产生通用保护异常。

1.

1
double*	a =(double*)memalign(32,9*sizeof(double));

2.

1
double*	a =(double*)_mm_malloc(9*sizeof(double),32);

3.

1
double*	a =(double*)aligned_alloc(32,9*sizeof(double));

4.

1
__attribute__ ((aligned(32)))double a[9]  ={1.1,2.2,3.3,4.4,5.5,6.6,7.7,8.8,2.1};

使用到的AVX函数介绍

1.

1
__m256d _mm256_load_pd (double const * mem_addr)

Description

Load 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from memory into dst. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

1
2
dst[255:0] := MEM[mem_addr+255:mem_addr]
dst[MAX:256] := 0

2.

1
__m256 _mm256_add_ps (__m256 a, __m256 b)

Description

Add packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.

Operation

1
2
3
4
5
FOR j := 0 to 7
i := j*32
dst[i+31:i] := a[i+31:i] + b[i+31:i]
ENDFOR
dst[MAX:256] := 0

3.stream的作用:绕过缓存直接写入内存

1
void _mm256_stream_pd (double * mem_addr, __m256d a)

Description

Store 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from a into memory using a non-temporal memory hint. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Operation

1
MEM[mem_addr+255:mem_addr] := a[255:0]

样例程序举例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#include<stdio.h>
#include<malloc.h>
#include <immintrin.h>
int main()
{
double* a =(double*)memalign(32,9*sizeof(double));
double* b =(double*)memalign(32,4*sizeof(double));
double af[9]={1.1,2.2,3.3,4.4,5.5,6.6,7.7,8.8,2.1} ;
double bf[9]={2.1,3.2,6.4,8.6,3.7,9.9,5.1,4.2,6.6};
double* c =(double*)memalign(32,4*sizeof(double));
for(int i =0;i<9;i++)
{
a[i]=af[i];
b[i]=bf[i];
}
int i=0;
__m256d v0;
__m256d v1;
__m256d v2;
for(;i<9-4;i+=4)
{
v0 = _mm256_load_pd(a+i);
v1 = _mm256_load_pd(b+i);
v2=_mm256_add_pd(v0,v1);
_mm256_stream_pd(c+i,v2);

}
for(;i<9;i++)
{
c[i]=a[i]+b[i];

}
printf("this is c.\n");
for(int i=0;i<9;i++)
{
printf("%lf\n",c[i]);
}
return 0;
}

样例程序输出:

1
2
3
4
5
6
7
8
9
10
this is c.
3.200000
5.400000
9.700000
13.000000
9.200000
16.500000
12.800000
13.000000
8.700000

相关链接

[https://software.intel.com/sites/landingpage/IntrinsicsGuide/]: “Intel® Intrinsics Guide”


本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!