BLAS(Basic Linear Algebra Subprograms)-基础线性代数子程序库
how-to-optimize-gemm
项目地址:flame/how-to-optimize-gemm (github.com)
Computing four elements of C at a time
Hiding computation in a subroutine - Amicoyuan (xingyuanjie.top)
Computing four elements at a time - Amicoyuan (xingyuanjie.top)
Further optimizing - Amicoyuan (xingyuanjie.top)
Computing a 4 x 4 block of C at a time
为了有效地使用向量指令和向量寄存器,我们现在一次计算一个4 x 4的C代码块。其思想如下:作为SSE3指令集的一部分,有一些特殊的指令允许每个时钟周期执行两次“乘法累加”操作(两次乘法和两次加法),每个时钟周期总共执行四个浮点操作。要使用这些,必须将数据放入“向量寄存器”中。有16个这样的向量寄存器,每个向量寄存器可以容纳两个双精度数。因此,我们可以在寄存器中保存32个双精度数。我们将使用其中的16个来保存C的元素,一个4 x 4的块。
Repeating the same optimizations - Amicoyuan (xingyuanjie.top)
Further optimizing - Amicoyuan (xingyuanjie.top)
Blocking to maintain performance - Amicoyuan (xingyuanjie.top)
Packing into contiguous memory - Amicoyuan (xingyuanjie.top)
Acknowledgement
This material was partially sponsored by grants from the National Science Foundation (Awards ACI-1148125/1340293).
Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).
论文阅读
论文阅读:面向国产申威 26010 众核处理器的 SpMV 实现与优化 - Amicoyuan (xingyuanjie.top)
论文阅读:稀疏矩阵向量乘法在申威众核架构上的性能优化 - Amicoyuan (xingyuanjie.top)
参考资料
博客:
矩阵相乘在GPU上的终极优化:深度解析Maxas汇编器工作原理 - 简书 (jianshu.com)
OpenBLAS项目与矩阵乘法优化 | AI 研习社 | 雷峰网 (leiphone.com)
矩阵乘法与 SIMD | Chenfan Blog (jcf94.com)
通用矩阵乘(GEMM)优化算法 | 黎明灰烬 博客 (zhenhuaw.me)
大佬是怎么优雅实现矩阵乘法的? - 知乎 (zhihu.com)
OpenBLAS gemm从零入门 - 知乎 (zhihu.com)
Introduction · cv算法工程师成长之路 (harleyszhang.github.io)
深入浅出GPU优化系列:GEMM优化(一) - 知乎 (zhihu.com)
CUDA 矩阵乘法终极优化指南 - 知乎 (zhihu.com)
矩阵乘法的并行优化(1):OPENMP、CUDA实现 - 知乎 (zhihu.com)
并行计算入门 UIUC ECE408 Lecture 7&8 - 知乎 (zhihu.com)
移动端arm cpu优化学习笔记第4弹–内联汇编入门 - 知乎 (zhihu.com)
(88条消息) #define barrier() asm volatile(“”: : :”memory”) 中的memory是gcc的东西_unbutun的博客-CSDN博客
(88条消息) MIPS指令集:内嵌汇编asm语法介绍_daddu指令_无色云的博客-CSDN博客
论文:
Publications Related to the FLAME Project (utexas.edu)
Anatomy of high-performance matrix multiplication | ACM Transactions on Mathematical Software
[1804.06826] Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking (arxiv.org)
Fast implementation of DGEMM on Fermi GPU | IEEE Conference Publication | IEEE Xplore
社区/论坛:
2. Vector Add — Dive into Deep Learning Compiler 0.1 documentation (d2l.ai)
https://github.com/pytorch/QNNPACK
work/sghpc (index) (uni-ulm.de)
The Science of High-Performance Computing Group (utexas.edu)
GitHub - BBuf/how-to-optimize-gemm
CUDA C++ Programming Guide (nvidia.com)
SGEMM · NervanaSystems/maxas Wiki · GitHub
GitHub - Cjkkkk/CUDA_gemm: A simple high performance CUDA GEMM implementation.
https://developer.nvidia.com/blog/cutlass-linear-algebra-cuda/
Class Schedule - ECE408 - Illinois Wiki
GCC-Inline-Assembly-HOWTO (ibiblio.org)
计算机教育中缺失的一课 · the missing semester of your cs education (missing-semester-cn.github.io)
本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!