Pytorch学习笔记

前言

以前一直使用TensorFlow 的keras库，但人毕竟要跟上潮流，pytorch早晚能占据绝大部分市场

前置需求——cuda、cudnn、tensorrt版本对应要求

找老半天，终于在onnx官网弄到全部的资料，链接如下：
https://onnxruntime.ai/docs/execution-providers
具体表格如图：（cuda、tensorrT、onnx版本关系）
| ONNX Runtime | TensorRT | CUDA | | | | | | | |
|———————|—————|————-|—-|—-|—-|—-|—-|—-|—-|
| main | 8.5 | 11.6 | | | | | | | |
| 1.14 | 8.5 | 11.6 | | | | | | | |
| 1.12-1.13 | 8.4 | 11.4 | | | | | | | |
| 1.11 | 8.2 | 11.4 | | | | | | | |
| 1.10 | 8.0 | 11.4 | | | | | | | |
| 1.9 | 8.0 | 11.4 | | | | | | | |
| 1.7-1.8 | 7.2 | 11.0.3 | | | | | | | |
| 1.5-1.6 | 7.1 | 10.2 | | | | | | | |
| 1.2-1.4 | 7.0 | 10.1 | | | | | | | |
另外一张，因为太大了只能用图片表示：

Main

卷积神经网络

参数	参数类型
`in_channels`	int	Number of channels in the input image	输入图像通道数
`out_channels`	int	Number of channels produced by the convolution	卷积产生的通道数
`kernel_size`	(int or tuple)	Size of the convolving kernel	卷积核尺寸，可以设为1个int型数或者一个(int, int)型的元组。例如(2,3)是高2宽3卷积核
`stride`	(int or tuple, optional)	Stride of the convolution. Default: 1	卷积步长，默认为1。可以设为1个int型数或者一个(int, int)型的元组。
`padding`	(int or tuple, optional)	Zero-padding added to both sides of the input. Default: 0	填充操作，控制`padding_mode`的数目。
`padding_mode`	(string, optional)	‘zeros’, ‘reflect’, ‘replicate’ or ‘circular’. Default: ‘zeros’	`padding`模式，默认为Zero-padding 。
`dilation`	(int or tuple, optional)	Spacing between kernel elements. Default: 1	扩张操作：控制kernel点（卷积核点）的间距，默认值:1。
`groups`	(int, optional)	Number of blocked connections from input channels to output channels. Default: 1	group参数的作用是控制分组卷积，默认不分组，为1组。
`bias`	(bool, optional)	If True, adds a learnable bias to the output. Default: True	为真，则在输出中添加一个可学习的偏差。默认：True。

vit-pytorch库

使用方法：

import torch
from vit_pytorch import ViT

v = ViT(
    image_size = 256,
    patch_size = 32,
    num_classes = 1000,
    dim = 1024,
    depth = 6,
    heads = 16,
    mlp_dim = 2048,
    dropout = 0.1,
    emb_dropout = 0.1
)

img = torch.randn(1, 3, 256, 256)

preds = v(img) # (1, 1000)

参数：

image_size: int.
图像的大小。如果你有矩形图像，请确保你的图像大小是宽度和高度的最大值
patch_size: int.
补丁数量。’ image_size ‘必须被’ patch_size ‘整除。
补丁数量为:’ n = (image_size // patch_size) 2 ‘，且’ n ‘ 必须大于16**。
num_classes: int.
要分类的类数。
dim: int.
线性变换后输出张量的最后一个维数 nn.Linear(..., dim).
depth: int.
Transformer blocks 数量
heads: int.
多头注意层中的头数。
mlp_dim: int.
MLP(前馈)层的尺寸
channels: int, default 3.
图像的通道数。
dropout: float between [0, 1], default 0..
Dropout 大小，通常0.2左右
emb_dropout: float between [0, 1], default 0.
Embedding dropout rate.
pool: string, either cls token pooling or mean pooling

查看模型方法

对于pytorch模型，记得最开始是pt，后来又出现一堆别名，但总体而言大差不差。由于pytorch“推荐保存权重不保结构”的特性，很多时候我们copy的预训练模型无法直观地分析结构以微调，在此梳理一下解决方法和逻辑。

一、直接print(model),会显示一大串文字，大括号

ViT(
(to_patch_embedding): Sequential(
(0): Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1=32, p2=32)
(1): LayerNorm((3072,), eps=1e-05, elementwise_affine=True)
(2): Linear(in_features=3072, out_features=1024, bias=True)
(3): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(dropout): Dropout(p=0.1, inplace=False)
(transformer): Transformer(
(layers): ModuleList(
(0): ModuleList(
(0): PreNorm(
(norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fn): Attention(
(attend): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(to_qkv): Linear(in_features=1024, out_features=3072, bias=False)
(to_out): Sequential(
(0): Linear(in_features=1024, out_features=1024, bias=True)
(1): Dropout(p=0.1, inplace=False)
)
)
)
(1): PreNorm(
(norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fn): FeedForward(
(net): Sequential(
(0): Linear(in_features=1024, out_features=2048, bias=True)
(1): GELU(approximate=none)
(2): Dropout(p=0.1, inplace=False)
(3): Linear(in_features=2048, out_features=1024, bias=True)
(4): Dropout(p=0.1, inplace=False)
)
)
)
)
(1): ModuleList(
(0): PreNorm(
(norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fn): Attention(
(attend): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(to_qkv): Linear(in_features=1024, out_features=3072, bias=False)
(to_out): Sequential(
(0): Linear(in_features=1024, out_features=1024, bias=True)
(1): Dropout(p=0.1, inplace=False)
)
)
)
(1): PreNorm(
(norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fn): FeedForward(
(net): Sequential(
(0): Linear(in_features=1024, out_features=2048, bias=True)
(1): GELU(approximate=none)
(2): Dropout(p=0.1, inplace=False)
(3): Linear(in_features=2048, out_features=1024, bias=True)
(4): Dropout(p=0.1, inplace=False)
)
)
)
)
(2): ModuleList(
(0): PreNorm(
(norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fn): Attention(
(attend): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(to_qkv): Linear(in_features=1024, out_features=3072, bias=False)
(to_out): Sequential(
(0): Linear(in_features=1024, out_features=1024, bias=True)
(1): Dropout(p=0.1, inplace=False)
)
)
)
(1): PreNorm(
(norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fn): FeedForward(
(net): Sequential(
(0): Linear(in_features=1024, out_features=2048, bias=True)
(1): GELU(approximate=none)
(2): Dropout(p=0.1, inplace=False)
(3): Linear(in_features=2048, out_features=1024, bias=True)
(4): Dropout(p=0.1, inplace=False)
)
)
)
)
(3): ModuleList(
(0): PreNorm(
(norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fn): Attention(
(attend): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(to_qkv): Linear(in_features=1024, out_features=3072, bias=False)
(to_out): Sequential(
(0): Linear(in_features=1024, out_features=1024, bias=True)
(1): Dropout(p=0.1, inplace=False)
)
)
)
(1): PreNorm(
(norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fn): FeedForward(
(net): Sequential(
(0): Linear(in_features=1024, out_features=2048, bias=True)
(1): GELU(approximate=none)
(2): Dropout(p=0.1, inplace=False)
(3): Linear(in_features=2048, out_features=1024, bias=True)
(4): Dropout(p=0.1, inplace=False)
)
)
)
)
(4): ModuleList(
(0): PreNorm(
(norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fn): Attention(
(attend): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(to_qkv): Linear(in_features=1024, out_features=3072, bias=False)
(to_out): Sequential(
(0): Linear(in_features=1024, out_features=1024, bias=True)
(1): Dropout(p=0.1, inplace=False)
)
)
)
(1): PreNorm(
(norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fn): FeedForward(
(net): Sequential(
(0): Linear(in_features=1024, out_features=2048, bias=True)
(1): GELU(approximate=none)
(2): Dropout(p=0.1, inplace=False)
(3): Linear(in_features=2048, out_features=1024, bias=True)
(4): Dropout(p=0.1, inplace=False)
)
)
)
)
(5): ModuleList(
(0): PreNorm(
(norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fn): Attention(
(attend): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(to_qkv): Linear(in_features=1024, out_features=3072, bias=False)
(to_out): Sequential(
(0): Linear(in_features=1024, out_features=1024, bias=True)
(1): Dropout(p=0.1, inplace=False)
)
)
)
(1): PreNorm(
(norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fn): FeedForward(
(net): Sequential(
(0): Linear(in_features=1024, out_features=2048, bias=True)
(1): GELU(approximate=none)
(2): Dropout(p=0.1, inplace=False)
(3): Linear(in_features=2048, out_features=1024, bias=True)
(4): Dropout(p=0.1, inplace=False)
)
)
)
)
)
)
(to_latent): Identity()
(mlp_head): Sequential(
(0): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(1): Linear(in_features=1024, out_features=128, bias=True)
)
)

以VIT_transformer为例，其中每一层都可以通过model.xx的方式访问，如果你想改变某一层，可以先torch.nn.定义层，再直接用等号赋值即可。但print输出的模型结构并不准确，有时混乱的结构很容易找错位置，因此需要第二种方法。

二、使用netron（强烈推荐）

如果直接netron.start预训练模型，就会发现只有显示部分的框框而没有连接线，因此得先随机一个输入向量，转化成onnx模型，再查看，在这里显示的模型名字、权重、方向都很全，而且可以直接在游览器内看某一层用法。当我们想改变某一层时，以vit_transformer的head为例model.heads.head=即可。

转换onnx代码如下：

t = torch.from_numpy(img).cuda()  # torch.Size([600, 800, 4])
t = t.type(torch.float32)
t = torch.reshape(t, (c, w, h))
t = torch.unsqueeze(t, dim=0)
torch.onnx.export(v, t, './models/my_vit.onnx')

其中V指需要被转换的模型，t是一次的输入，随机或样本都可以，只是我们得在输入前加一个维度(等效batchsize=1)，第三个参数是保存路径

如图所示，是transformer模型，更改head为128维度的结果：

关于transformer及其变体的一些研究

没地方放，就全部丢这里，将就下

swin transformer

pytorch自带库，以及一些预训练模型，当然这些模型最好不要直接拿来用，还是得修改修改，链接如下：

SwinTransformer — Torchvision main documentation (pytorch.org)

在该模型后接一个（1000,3）的全连接层。
随后，我做了一个特殊的数据集：
数据集a：一个对象刚好覆盖图片正中央。
数据集b：图片中没有该对象或者该对象未覆盖图片正中央。
从结果来看它能够关联位置信息，这是一件非常了不起的事情，这就意味着：如果不要求高精度，我们就可以将繁杂的目标检测任务变成一个多分类问题。该结论正好符合最初设想：所有任务可以转化成有限个分类问题的叠加。