【小白深度教程 1.10】手把手教你使用深度学习方法（PSMNet）进行视差估计（含 Python 代码解析）

本文链接： https://blog.csdn.net/2401_87064292/article/details/141956535

【小白深度教程 1.10】手把手教你使用深度学习方法（PSMNet）进行视差估计（含 Python 代码解析）

在之前的章节中，我们展示了如何用立体匹配进行双目深度估计创建一个双目立体相机：

手把手教你使用 OpenCV 制作低成本双目立体相机（Python、C++ 代码）

在这里插入图片描述

但是，传统的视差匹配和估计方法存在一定局限性。

因此这里我们介绍基于深度学习的视差估计方法。

1. 引言

传统的视频或图像将三维世界捕捉为二维，丢失了许多关于深度的重要信息，这是许多应用所需的。深度估计是一个具有挑战性的问题，已经有多种方法尝试解决这个问题。

最流行的设置称为立体视觉，它使用一对摄像头，找到两个摄像头中的对应点，并根据视差估计深度。最近，基于深度学习的方法也被用于从立体图像对中估计视差。

在这篇文章中，我们将讨论其中的一些方法。

2. 视差估计的经典方法

用于一对已校正立体图像的经典视差图生成方法是立体匹配技术。该技术通过比较左右图像的像素邻域信息来计算对应像素。使用立体图像对，即可获得视差图。

在这里插入图片描述

2. 基于深度学习的视差估计方法

2.1. 使用数据集

良好的数据集是任何基于深度学习的解决方案的核心，视差估计也不例外。

最受欢迎的数据集是 KITTI 。

其他有用的数据集包括 SceneFlow 、 Middlebury 和 Holopix50K 。

2.2. 使用网络进行视差估计

我们希望通过密集的立体匹配从一对已校正的图像中估计视差。这也可以通过基于深度学习的模型来实现。初期的解决方案之一是匹配成本卷积神经网络 ( MC-CNN )，随后又出现了许多其他方法。

两种常用的方法是：

2.2.1 直接回归方法

直接回归方法试图从输入图像中直接估计每个像素的视差，而不考虑立体匹配中的几何约束。这是一种完全数据驱动的方法，利用大型U形二维卷积网络。然而，不考虑几何约束使得它们在性能上不如体积方法。

2.2.2 体积方法

体积方法利用了半全局匹配的概念，通过连接每个视差偏移的特征来构建4D特征体积。它有四个主要组件：

特征网络，从输入图像中提取特征
成本体积模块，将从左图像和右图像提取的特征进行拼接
匹配网络，通过3D卷积从4D特征体积中计算匹配成本
回归模块，进行视差回归

3. 研究 PSMNet 架构

我们将研究用于视差估计的最著名的体积方法，即 金字塔立体匹配网络 (PSMNet) 。

虽然 PSMNet 并不是立体匹配的最先进网络，但许多用于体积方法的立体匹配网络都受到 PSMNet 的启发，这使它成为基于网络的立体匹配的一个很好的起点。

让我们来看看 PSMNet 的架构：
在这里插入图片描述

3.1. 2D特征匹配

卷积神经网络 (CNN) 帮助从图像中提取特征，这些特征可以是图像中的边缘或纹理。在 PSMNet 中：

CNN 输出的特征图是输入图像大小的 1/4，并总结了输入中检测到的特征。
应用于两个输入图像时，这些 CNNs 共享相同的权重，以确保从两个图像中提取出相似的特征，这对于接下来的步骤至关重要。

下面是对应的代码：

# importing modules
import torch
import torch.nn as nn
import numpy as np
 
# define nn.Sequential for convenience
def convbn(in_planes, out_planes, kernel_size, stride, pad, dilation):
    return nn.Sequential(nn.Conv2D(in_planes, out_planes, kernel_size=kernel_size, stride=stride, padding=dilation if dilation>1 else pad, dilation=dilation, bias=False),
nn.BatchNorm2d(outplanes))
 
# define Basic Block
class BasicBlock(nn.modules):
    expansion = 1
    def __init__(self, inplanes, stride, downsample, pad, dilation):
        super(BasicBlock, self).__init__()
        self.conv1 = nn.Sequential(convbn(inplanes, planes, 3, stride, pad, dilation),
                                   nn.ReLU(inplace=True))
        self.conv2 = convbn(planes, planes, 3, 1, pad, dilation)
        self.downsample = downsample
        self.stride = stride
 
    def forward(self, x):
        out = self.conv1(x)
        out = self.conv2(out)
 
        if self.downsample is not None:
            x = self.downsample(x)
 
        out += x
        return out
 
# Define the 2D Feature Extraction Module
class CNN_2D(nn.Module):
    def __init__(self):
        super(CNN_2D, self).__init__()
        self.inplaces = 32
        self.firstconv = nn.Sequential(convbn(3, 32, 3, 2, 1, 1),
                                       nn.ReLU(inplace=True),
                                       convbn(32, 32, 3, 1, 1, 1),
                                       nn.ReLU(inplace=True),
                                       convbn(32, 32, 3, 1, 1, 1),
                                       nn.ReLU(inplace=True))
 
        self.layer1 = self._make_layer(BasicBlock, 32, 3, 1, 1, 1)
        self.layer2 = self._make_layer(BasicBlock, 64, 16, 2, 1, 1)
        self.layer3 = self._make_layer(BasicBlock, 128, 3, 1, 1, 1)
        self.layer4 = self._make_layer(BasicBlock, 128, 3, 1, 1, 2)
 
    def _make_layer(self, block, planes, blocks, stride, pad, dilation):
        downsample = None
        if(stride!=1 or self.inplanes != planes * block.expansion):
            downsample = nn.Sequential(nn.Conv2d(self.inplanes, planes*block.expansion,
                                                 kernel_size=1, stride=stride,
                                                 bias=False),
                                       nn.BatchNorm2d(planes*block.expansion))
        layers = []
        layers.append(block(self.inplanes, planes, stride, downsample, pad, dilation))
        self.inplanes = planes * block.expansion
        for i in range(1, blocks):
            layers.append(block(self.inplanes, planes, 1, None, pad, dilation))
 
        return nn.Sequential(*layers)
 
    def forward(self, x):
        output = self.firstconv(x)
        output = self.layer1(output)
        output_raw = self.layer2(output)
        output = self.layer3(output_raw)
        output_skip = self.layer4(output)
 
        return ouput_skip
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71

3.2. SPP 模块

SPP（空间金字塔池化）模块提取不同尺度的特征图。下图详细解释了 SPP 模块。

在这里插入图片描述

在 SPP 模块中：

为了在不同尺度下获取上下文信息，特征图经过四种不同分辨率的平均池化过滤器：
- 64X64
- 32X32
- 16X16
- 8X8
然后，这些图通过 1X1 卷积层减少维度。
接下来，这些图被上采样以匹配原始特征图的大小。
最后，这些图与原始特征图进行拼接。

让我们来看看 SPP 模块的代码：

# 定义 SPP 模块
class SPP_Module(nn.Module):
    def __init__(self):
        super(SPP_Module, self).__init__()

        # 定义 4 个过滤器和其他卷积层
        self.branch1 = nn.Sequential(nn.AvgPool2d(64, 64, stride=(64, 64)),
                                     convbn(128, 32, 1, 1, 0, 1),
                                     nn.ReLU(inplace=True))
        self.branch2 = nn.Sequential(nn.AvgPool2d(32, 32, stride=(32, 32)),
                                     convbn(128, 32, 1, 1, 0, 1),
                                     nn.ReLU(inplace=True))
        self.branch3 = nn.Sequential(nn.AvgPool2d(16, 16, stride=(16, 16)),
                                     convbn(128, 32, 1, 1, 0, 1),
                                     nn.ReLU(inplace=True))
        self.branch4 = nn.Sequential(nn.AvgPool2d(8, 8, stride=(8, 8)),
                                     convbn(128, 32, 1, 1, 0, 1),
                                     nn.ReLU(inplace=True))
        self.lastconv = nn.Sequential(convbn(320, 128, 3, 1, 1, 1),
                                      nn.ReLU(inplace=True),
                                      nn.Conv2d(128, 32, kernel_size=1,
                                                padding=0, stride=1, bias=False))

    def forward(self, x):
        # SPP 模块的分支代码
        op1 = self.branch1(x)
        op1 = F.upsample(op1, (x.size()[2], x.size()[3]), mode='bilinear')

        op2 = self.branch2(x)
        op2 = F.upsample(op2, (x.size()[2], x.size()[3]), mode='bilinear')

        op3 = self.branch3(x)
        op3 = F.upsample(op3, (x.size()[2], x.size()[3]), mode='bilinear')

        op4 = self.branch4(x)
        op4 = F.upsample(op4, (x.size()[2], x.size()[3]), mode='bilinear')

        op_concat = torch.cat((x, op4, op3, op2, op1), 1)
        op_concat = self.lastconv(op_concat)

        return op_concat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41

3.3. 成本体积（Cost Volume）

在获得不同尺度的特征图后，我们需要将左图和右图中的特征结合起来。成本体积模块将：

将从左图和右图中提取的特征拼接起来。
存储特征之间的距离。

这会产生一个大小为：高度 × 宽度 ×（最大视差 + 1）×（特征大小）的 4D 数组，称为成本体积。

在这里插入图片描述
每个成本体积的体素都是从左视图和右视图投影的对应特征的匹配成本。计算成本体积需要大量计算和内存。有关各种成本体积技术，请参考这项研究，它还提供了一种优化的成本体积估计方法，可产生“密集”3D 成本体积。

下面是 PSMNet 中计算成本体积的代码：

# 导入模块
import torch
from torch.autograd import Variable

# 定义成本体积模块
cost = Variable(torch.FloatTensor(left_img_ftrs.size()[0],
                                  left_img_ftrs.size()[1]*2,
                                  maxDisparity // 4,
                                  left_img_ftrs.size()[2],
                                  left_img_ftrs.size()[3]).zero_())

# 在成本体积中拼接特征
for i in range(self.maxdisp//4):
    if (i > 0):
        cost[:, :left_img_ftrs.size()[1], i, :, i:] = left_img_ftrs[:, :, :, i:]
        cost[:, left_img_ftrs.size()[1]:, i, :, i:] = rght_img_ftrs[:, :, :, :-i]
    else:
        cost[:, :left_img_ftrs.size()[1], i, :, i:] = left_img_ftrs
        cost[:, left_img_ftrs.size()[1]:, i, :, i:] = rght_img_ftrs

cost = cost.contiguous()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

3.4. 3D CNN 和回归模块

从成本体积模块中，我们获得了 4D 成本体积，现在我们需要沿两个维度连接信息：

视差维度（4D 成本体积中的第三维）
空间维度（4D 成本体积中的第四维）
为此，我们使用 3D CNN。PSMNet 有两种 3D CNN 架构：

第一个是基本架构，使用 12 个 3x3x3 卷积层和跳跃连接，如下图所示：

在这里插入图片描述
同样查看对应代码：

# the 3D Feature matching module 
class CNN_3D_basic(nn.Module):
    def __init__(self):
        super(CNN_3D_basic, self).__init__()
        # define layers for 3D Feature matching module
        self.dres0 = nn.Sequential(convbn_3d(64, 32, 3, 1, 1),
                                   nn.ReLU(inplace=True),
                                   convbn_3d(32, 32, 3, 1, 1),
                                   nn.ReLU(inplace=True))
 
        self.dres1 = nn.Sequential(convbn_3d(32, 32, 3, 1, 1),
                                   nn.ReLU(inplace=True),
                                   convbn_3d(32, 32, 3, 1, 1))
 
        self.dres2 = nn.Sequential(convbn_3d(32, 32, 3, 1, 1),
                                   nn.ReLU(inplace=True),
                                   convbn_3d(32, 32, 3, 1, 1))
 
        self.dres3 = nn.Sequential(convbn_3d(32, 32, 3, 1, 1),
                                   nn.ReLU(inplace=True),
                                   convbn_3d(32, 32, 3, 1, 1))
 
        self.dres4 = nn.Sequential(convbn_3d(32, 32, 3, 1, 1),
                                   nn.ReLU(inplace=True),
                                   convbn_3d(32, 32, 3, 1, 1))
 
        self.dres5 = nn.sequential(convbn_3d(32, 32, 3, 1, 1),
                                   nn.ReLU(inplace=True),
                                   nn.Conv3d(32,1,kernel_size=3,
                                             padding=1,
                                             stride=1,
                                             bias=False))
 
    def forward(self, cost_4D):
        # combine all the layers to make the 3D Feature matching net
        cost0 = self.dres0(cost_4D)
        cost0 = self.dres1(cost0) + cost0
        cost0 = self.dres2(cost0) + cost0
        cost0 = self.dres3(cost0) + cost0
        cost0 = self.dres4(cost0) + cost0
        cost = self.dres5(cost0)
        return cost
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

因为 基本架构无法利用上下文信息 ，PSMNet 还具有基于编码器-解码器的架构变体，称为堆叠沙漏架构。

在这里插入图片描述

网络架构的堆叠沙漏变体的 3D CNN 模块

接下来，检查它的代码：


# Import modules
import torch.nn as nn
import torch.nn.functional as F
 
def convbn_3d(in_planes, out_planes, kernel_size, stride, pad):
    return nn.Sequential(nn.Conv2D(in_planes, out_planes, kernel_size=kernel_size, padding=pad, stride=stride, bias = False), nn.BatchNorm2d(outplanes))
# define the hourglass structures for stacked hourglass architecture
class hourglass(nn.Module):
    def __init__(self, inplanes):
        super(hourglass, self).__init__()
        # define the layers for hourglass
        self.conv1 = nn.Sequential(convbn_3d(inplanes, inplanes*2,
                                             kernel_size=3,
                                             stride=1, pad=1),
                                             nn.ReLU(inplace=True))
 
        self.conv2 = convbn_3d(inplanes, inplanes*2, kernel_size=3,
                               stride=1, pad=1)
 
        self.conv3 = nn.Sequential(convbn_3d(inplanes*2, inplanes*2,
                                             kernel_size=3,
                                             stride=2, pad=1),
                                             nn.ReLU(inplace=True))
 
        self.conv4 = nn.Sequential(convbn_3d(inplanes*2, inplanes*2,
                                             kernel_size=3,
                                             stride=2, pad=1),
                                             nn.ReLU(inplace=True))
 
 
        self.conv5 = nn.Sequential(nn.convTranspose3d(inplanes*2,
                                             inplanes*2,
                                             kernel_size=3,
                                             stride=2, padding=1,
                                             output_padding=1,
                                             bias=False),
                                             nn.BatchNorm3d(inplanes*2))
 
        self.conv6 = nn.Sequential(nn.convTranspose3d(inplanes*2,
                                             inplanes,
                                             kernel_size=3,
                                             stride=2, padding=1,
                                             output_padding=1,
                                             bias=False),
                                             nn.BatchNorm3d(inplanes))
 
    def forward(self, x, out, pre_skip, post_skip):
        # combine all the layers to make the hourglass structures
        out = self.conv1(x)
        pre = self.conv2(out)
 
        if(post_skip is not None):
            pre = pre + post_skip
 
        pre = F.relu(pre, inplace=True)
        out = self.conv3(pre)
        out = self.conv4(pre)
 
        post = self.conv5(out)
        if(pre_skip is not None):
            post = post + pre_skip
        else:
            post = post + pre
 
        post = F.relu(post, inplace=True)
        out = self.conv6(post)
        return out, pre, post
# Disparity regression module
class disparityregression(nn.Module):
    def __init__(self, maxdisp):
        super(disparityregression, self).__init__()
        self.disp = torch.Tensor(np.reshape(np.array(range(maxdisp)),
                                 [1, maxdisp,1,1]))
 
    def forward(self, x):
        out = torch.sum(x*self.disp.data,1,keepdim=True)
        return out
# define stacked hourglass module
class CNN_3D_hourglass(nn.Module):
    def __init__(self, maxdisp):
        self.maxdisp = maxdisp
        super(CNN_3D_hourglass, self).init()
        # define the layers for stacked hourglass module
        self.dres0 = nn.Sequential(convbn_3d(64, 32, 3, 1, 1),
                                   nn.ReLU(inplace=True),
                                   convbn_3d(32, 32, 3, 1, 1),
                                   nn.ReLU(inplace=True))
        self.dres1 = nn.Sequential(convbn_3d(32, 32, 3, 1, 1),
                                   nn.ReLU(inplace=True),
                                   convbn_3d(32, 32, 3, 1, 1))
         # use the hourglass modules created 
         self.dres2 = hourglass(32)
         self.dres3 = hourglass(32)
         self.dres4 = hourglass(32)
        self.dres5 = nn.sequential(convbn_3d(32, 32, 3, 1, 1),
                                   nn.ReLU(inplace=True),
                                   nn.Conv3d(32,1,kernel_size=3,
                                             padding=1,
                                             stride=1,
                                             bias=False))
 
        self.dres6 = nn.sequential(convbn_3d(32, 32, 3, 1, 1),
                                   nn.ReLU(inplace=True),
                                   nn.Conv3d(32,1,kernel_size=3,
                                             padding=1,
                                             stride=1,
                                             bias=False))
 
        self.dres7 = nn.sequential(convbn_3d(32, 32, 3, 1, 1),
                                   nn.ReLU(inplace=True),
                                   nn.Conv3d(32,1,kernel_size=3,
                                             padding=1,
                                             stride=1,
                                             bias=False))
 
    def forward(self, cost):
        # combine the layers to form the stacked hourglass module
        cost0 = self.dres0(cost)
        cost0 = self.dres1(cost0) + cost0
        out1, pre1, post1 = self.dres2(cost0, None, None)
        out1 = out1 + cost0
        out2, pre2, post2 = self.dres3(out1, pre1, post1)
        out2 = out2 + cost0
        out3, pre3, post3 = self.dres4(out2, pre2, post2)
        out3 = out3 + cost0
        cost1 = self.dres4(out1)
        cost2 = self.dres5(out2) + cost1
        cost3 = self.dres5(out3) + cost2
        cost3 = F.upsample(cost3, [self.maxdisp, left.size()[2],
                                   left.size()[3]], mode='trilinear')
        cost3 = torch.squeeze(cost3, 1)
        pred3 = F.softmax(cost3, dim=1)
 
        # also add the disparityregression module
        pred3 = disparityregression(self.maxdisp)(pred3)
        return pred3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137

3D CNN 的输出通过插值上采样到 height x width x (max disparity + 1) 的大小。 三线性插值 是一种上采样方法，它使用最近单元的 3D 特征的距离加权平均来获得新生成单元的值。
插值后，我们得到一个 3D 成本体积，大小为 height x width x (max disparity + 1)，包含 (max disparity + 1) 个大小为 height x width 的视差图像。
使用 softmax 操作 对视差维度的概率进行归一化。
归一化的概率值与视差相乘，以获得最终的视差。

请注意，softmax 是一种逻辑回归函数，通常是用于将网络输出归一化为概率分布的最后一个激活函数。它的数学公式是：

在这里插入图片描述

softmax 插入到加权视差公式中以获得像素的视差，其公式为：

在这里插入图片描述

softmax 是一种广泛研究的网络函数，因此具有所有用于网络训练的理想特性，例如：

它是可微的，适合反向传播。
它不是离散的，因此会产生平滑的视差。

最后，为网络的训练部分，需要一个损失函数来反向传播预测误差。在类似任务中广泛接受的情况下，使用了 Smooth L1 损失函数 而不是 L2 损失。PSMNet 的 Smooth L1 损失函数如下所示：

在这里插入图片描述

回归函数输出两个输入图像的最终视差图，不过它仍需转换为深度图。深度值与视差值成反比。

如果想要获取完整的 PSMNet 训练和测试代码，可以请访问他们的官方代码库。

以下是一些示例输入（左图像）及其各自的视差图像。

在这里插入图片描述

文章知识点与官方知识档案匹配，可进一步学习相关知识

OpenCV技能树 OpenCV中的深度学习图像分类 29586 人正在系统学习中

【小白深度教程 1.10】手把手教你使用深度学习方法（PSMNet）进行视差估计（含 Python 代码解析）

【小白深度教程 1.10】手把手教你使用 深度学习 方法（PSMNet）进行视差估计（含 Python 代码解析）

1. 引言

2. 视差估计的经典方法

2. 基于深度学习的视差估计方法

2.1. 使用数据集

2.2. 使用网络进行视差估计

2.2.1 直接回归方法

2.2.2 体积方法

3. 研究 PSMNet 架构

3.1. 2D特征匹配

3.2. SPP 模块

3.3. 成本体积（Cost Volume）

3.4. 3D CNN 和回归模块

【小白深度教程 1.10】手把手教你使用深度学习方法（PSMNet）进行视差估计（含 Python 代码解析）