BLAS（Basic Linear Algebra Subprograms）-基础线性代数子程序库 - Amicoyuan

how-to-optimize-gemm

项目地址：flame/how-to-optimize-gemm (github.com)

Computing four elements of C at a time

Hiding computation in a subroutine - Amicoyuan (xingyuanjie.top)

Computing four elements at a time - Amicoyuan (xingyuanjie.top)

Further optimizing - Amicoyuan (xingyuanjie.top)

Computing a 4 x 4 block of C at a time

为了有效地使用向量指令和向量寄存器，我们现在一次计算一个4 x 4的C代码块。其思想如下:作为SSE3指令集的一部分，有一些特殊的指令允许每个时钟周期执行两次“乘法累加”操作(两次乘法和两次加法)，每个时钟周期总共执行四个浮点操作。要使用这些，必须将数据放入“向量寄存器”中。有16个这样的向量寄存器，每个向量寄存器可以容纳两个双精度数。因此，我们可以在寄存器中保存32个双精度数。我们将使用其中的16个来保存C的元素，一个4 x 4的块。

Repeating the same optimizations - Amicoyuan (xingyuanjie.top)

Further optimizing - Amicoyuan (xingyuanjie.top)

Blocking to maintain performance - Amicoyuan (xingyuanjie.top)

Packing into contiguous memory - Amicoyuan (xingyuanjie.top)

Acknowledgement

This material was partially sponsored by grants from the National Science Foundation (Awards ACI-1148125/1340293).

Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).

论文阅读

论文阅读：面向国产申威 26010 众核处理器的 SpMV 实现与优化 - Amicoyuan (xingyuanjie.top)

论文阅读：稀疏矩阵向量乘法在申威众核架构上的性能优化 - Amicoyuan (xingyuanjie.top)

参考资料

博客:

矩阵相乘在GPU上的终极优化：深度解析Maxas汇编器工作原理 - 简书 (jianshu.com)

OpenBLAS项目与矩阵乘法优化 | AI 研习社 | 雷峰网 (leiphone.com)

矩阵乘法与 SIMD | Chenfan Blog (jcf94.com)

通用矩阵乘（GEMM）优化算法 | 黎明灰烬博客 (zhenhuaw.me)

大佬是怎么优雅实现矩阵乘法的？ - 知乎 (zhihu.com)

OpenBLAS gemm从零入门 - 知乎 (zhihu.com)

Introduction · cv算法工程师成长之路 (harleyszhang.github.io)

深入浅出GPU优化系列：GEMM优化（一） - 知乎 (zhihu.com)

CUDA 矩阵乘法终极优化指南 - 知乎 (zhihu.com)

矩阵乘法的并行优化（1）：OPENMP、CUDA实现 - 知乎 (zhihu.com)

并行计算入门 UIUC ECE408 Lecture 7&8 - 知乎 (zhihu.com)

移动端arm cpu优化学习笔记第4弹–内联汇编入门 - 知乎 (zhihu.com)

C语言的内嵌汇编 - 知乎 (zhihu.com)

内嵌汇编学习 - 知乎 (zhihu.com)

(88条消息) #define barrier() asm volatile(“”: : :”memory”) 中的memory是gcc的东西_unbutun的博客-CSDN博客

(88条消息) MIPS指令集：内嵌汇编asm语法介绍_daddu指令_无色云的博客-CSDN博客

论文：

Publications Related to the FLAME Project (utexas.edu)

Anatomy of high-performance matrix multiplication | ACM Transactions on Mathematical Software

Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning | Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

[1804.06826] Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking (arxiv.org)

Fast implementation of DGEMM on Fermi GPU | IEEE Conference Publication | IEEE Xplore

High Performance is All about Minimizing Data Movement | Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing (acm.org)

社区/论坛:

2. Vector Add — Dive into Deep Learning Compiler 0.1 documentation (d2l.ai)

Intel® Intrinsics Guide

https://github.com/pytorch/QNNPACK

https://github.com/flame/blis

ulmBLAS (index) (uni-ulm.de)

work/sghpc (index) (uni-ulm.de)

The Science of High-Performance Computing Group (utexas.edu)

GitHub - BBuf/how-to-optimize-gemm

GitHub - Liu-xiandong/How_to_optimize_in_GPU: This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sgemv, sgemm, etc. The performance of these kernels is basically at or near the theoretical limit.

CUDA C++ Programming Guide (nvidia.com)

SGEMM · NervanaSystems/maxas Wiki · GitHub

GitHub - Cjkkkk/CUDA_gemm: A simple high performance CUDA GEMM implementation.

GitHub - yzhaiustc/Optimizing-SGEMM-on-NVIDIA-Turing-GPUs: Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.

https://developer.nvidia.com/blog/cutlass-linear-algebra-cuda/

Class Schedule - ECE408 - Illinois Wiki

GCC-Inline-Assembly-HOWTO (ibiblio.org)

计算机教育中缺失的一课 · the missing semester of your cs education (missing-semester-cn.github.io)

本博客所有文章除特别声明外，均采用 CC BY-SA 4.0 协议，转载请注明出处！

Hiding computation in a subroutine 上一篇

C/C++枚举类型 enum 下一篇

目录