【小白深度教程 1.8】手把手教你使用 Depth Anything V2 估计单目深度，并映射到 3D 点云（含 Python 代码）

本文链接： https://blog.csdn.net/2401_87064292/article/details/141940141

【小白深度教程 1.8】手把手教你使用 Depth Anything V2 估计单目深度并映射到 3D 点云

在之前的章节中，我们展示了如何用立体匹配进行双目深度估计，以及 3D 点云生成：

【小白深度教程 1.5】手把手教你用立体匹配进行双目深度估计，以及 3D 点云生成（含 Python 代码解读）

但是，双目图像的获取需要特制的硬件，这在现实中是很难获取的。

因此这次我们尝试使用单目深度估计技术，来进行准确的双目深度估计，并将场景转换成 3D 点云，Python 代码在最后。

最终效果如图：

在这里插入图片描述

1. 单目深度估计简介

单目深度估计（Monocular Depth Estimation）是指从单张 RGB 图像中预测场景深度信息的任务。与双目立体视觉不同，单目深度估计仅依赖于单个视角的图像，这使得任务更加复杂，因为缺少直接的几何约束。

1.1 主要挑战

缺乏几何信息 ：没有立体视觉的深度线索，仅靠单张图像很难直接推测物体距离。
纹理和光照变化 ：图像中的纹理、光照变化、阴影等因素会影响深度估计的准确性。
多解性问题 ：同一个图像可能对应多种不同的深度解释，导致不确定性。

1.2 经典工作

Eigen et al., 2014

论文 : Depth Map Prediction from a Single Image using a Multi-Scale Deep Network
贡献 : 第一次使用深度学习方法解决单目深度估计问题。提出了一个多尺度网络结构，通过逐步细化的方式输出深度图。
方法 : 采用两个卷积神经网络模块，一个负责预测全局粗略深度，另一个用于细化深度图。

Laina et al., 2016

论文 : Deeper Depth Prediction with Fully Convolutional Residual Networks
贡献 : 引入残差网络（ResNet）来提升深度估计的精度。使用改进的全卷积神经网络架构（FCN）处理深度估计问题。
方法 : 通过全卷积和残差连接，增加了网络的深度和表现能力，提高了深度估计的分辨率。

Godard et al., 2017

论文 : Unsupervised Monocular Depth Estimation with Left-Right Consistency
贡献 : 提出了一种无监督学习的方法，通过视差图构建左右一致性约束，优化深度估计。
方法 : 利用左右图像的一致性约束训练网络，不需要真实深度数据作为监督信号。

Fu et al., 2018

论文 : Deep Ordinal Regression Network for Monocular Depth Estimation
贡献 : 提出了顺序回归网络，将深度估计视为一个序列预测问题，提升了估计的细粒度性。
方法 : 通过序列预测和分类的方式提升深度估计的连续性和稳定性。

Ranftl et al., 2021

论文 : Vision Transformers for Dense Prediction
贡献 : 使用视觉变换器（ViT）替代传统的卷积神经网络，极大提升了深度估计的性能。
方法 : 利用变换器模型在捕捉全局上下文信息上的优势，更好地处理复杂场景的深度估计。

2. Depth Anything V2

2.1 前身 Depth Anything

在这里插入图片描述

Depth Anything 是一种最新的深度估计方法，这是一种用于鲁棒单目深度估计的解决方案。目标是建立一个简单而强大的基础模型，在任何情况下处理任何图像。

为此，设计一个数据引擎来收集和自动注释大规模未标记数据（ ～62M ），从而大大扩大了数据覆盖范围，这样能够减少泛化误差，从而扩大数据集的规模。

在这里插入图片描述

作者研究了两种简单而有效的策略，这两种策略使数据增强更有希望。首先，利用数据增强工具创建了一个更具挑战性的优化目标。它迫使模型积极寻求额外的视觉知识并获得稳健的表示。其次，开发了一种辅助监督，强制该模型从预训练的编码器继承丰富的语义先验。

在这里插入图片描述

2.2 Depth Anything V2 的改进

在这里插入图片描述
与V1相比，这个版本通过三个关键实践产生了更精细，更强大的深度预测：

用合成图像替换所有标记的真实图像；
扩大教师模型的能力；
通过大规模伪标记真实图像的桥梁教授学生模型。

在这里插入图片描述

合成图像具有以下优势：

所有精细细节都会得到正确标记；
可以获得具有挑战性的透明物体和反射表面的实际深度。

3. 使用 Depth Anything V2 预测深度

我们使用之前的数据，可以在这里下载：

【小白深度教程 1.5】手把手教你用立体匹配进行双目深度估计，以及 3D 点云生成（含 Python 代码解读）

估计这张图的深度：

在这里插入图片描述
首先安装 Depth Anything V2，可以参考源码：

git clone https://github.com/DepthAnything/Depth-Anything-V2
cd Depth-Anything-V2
pip install -r requirements.txt
1
2
3

然后，从这里 https://github.com/DepthAnything/Depth-Anything-V2#pre-trained-models 下载预训练权重，并放到 checkpoints 下。

然后使用如下代码进行预测：

import cv2
import torch

from depth_anything_v2.dpt import DepthAnythingV2

DEVICE = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'

model_configs = {
    'vits': {'encoder': 'vits', 'features': 64, 'out_channels': [48, 96, 192, 384]},
    'vitb': {'encoder': 'vitb', 'features': 128, 'out_channels': [96, 192, 384, 768]},
    'vitl': {'encoder': 'vitl', 'features': 256, 'out_channels': [256, 512, 1024, 1024]},
    'vitg': {'encoder': 'vitg', 'features': 384, 'out_channels': [1536, 1536, 1536, 1536]}
}

encoder = 'vitl' # or 'vits', 'vitb', 'vitg'

model = DepthAnythingV2(**model_configs[encoder])
model.load_state_dict(torch.load(f'checkpoints/depth_anything_v2_{encoder}.pth', map_location='cpu'))
model = model.to(DEVICE).eval()

raw_img = cv2.imread('your/image/path') # 改成自己的图像路径
depth = model.infer_image(raw_img) # HxW raw depth map in numpy
np.save("monocular_depth.npy", depth.squeeze().cpu().numpy()) # 保存深度到本地
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

4. 将单目深度映射到 3D 点云

disparitymap 等函数可以在之前的章节中找到。

import cv2
import numpy as np
import time 
import matplotlib.pyplot as plt
from depth import depth_map
from configs import img_path1,img_path2
from disparity import disparitymap
from image import Image_processing,downsample_image,create_output


def main():
    img = cv2.imread(img_path1,1)
    img = downsample_image(img, 1)
    
    imgL = Image_processing(img_path1)
    imgR = Image_processing(img_path2)
    
    Map = np.load("disp/demo_middle/left.png.npy") # 这里是之前双目深度估计的结果
    Map = downsample_image(Map, 1) // 2
    
    H, W = img.shape[:2]
    disp = np.load("monocular_depth.npy")
    disp = cv2.resize(disp, (W, H))
    min_m, max_m = Map.min(), Map.max()
    disp = (disp - disp.min()) / (disp.max() - disp.min())
    disp = disp * (max_m - min_m) + min_m
    
    # disp = disp * 25 # 如果没有双目估计深度，可以将上面内容注释，手动设置 scale=25（可以尝试其他数值）
    
    
    coordinates = depth_map(disp, img)
    print('\n Creating the output file... \n')
    create_output(coordinates, 'praxis_mono.ply')
    
main()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

5. 可视化和交互点云

import open3d as o3d

# 读取 .ply 文件
ply_file_path = "praxis_monocular.ply"  # 替换为你的 .ply 文件路径
point_cloud = o3d.io.read_point_cloud(ply_file_path)

# 创建可视化窗口
vis = o3d.visualization.Visualizer()
vis.create_window()

# 将点云添加到可视化窗口
vis.add_geometry(point_cloud)

# 获取渲染选项并调整点的大小
render_option = vis.get_render_option()
render_option.point_size = 2.5  # 调整点大小，默认值通常为 5.0，值越小点越小

# 启动可视化
vis.run()

# 销毁窗口
vis.destroy_window()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

文章知识点与官方知识档案匹配，可进一步学习相关知识

OpenCV技能树 OpenCV中的深度学习图像分类 29465 人正在系统学习中