Python | OpenCV 資料集整理與驗證

📚 前言

在上一篇 LabelImg 標註工具實戰 中，我們完成了圖片的標註。
標註完成後，還需要將資料整理成模型訓練所需的目錄結構，並切分訓練集與驗證集，最後用程式確認標註是否正確。

✂️ 訓練集與驗證集切分

標註完成後，需要將資料分為訓練集與驗證集。有兩種做法：

項目	做法一：拍兩支影片（建議）	做法二：按比例切分
適用情境	有條件重新拍攝	只有一支影片
場景多樣性	✅ 訓練/驗證集場景不同	❌ 來自同一場景與光線
泛化效果	較好，能測試模型是否學到物件本身	較有限
操作方式	分別執行 `collect_from_video.py` 後各自標註，存入不同目錄	執行下方 `split_dataset.py` 自動切分

💡 做法二的切分腳本，將前 80% 分給訓練集、後 20% 分給驗證集，圖片與標籤同步搬移程式碼：

# split_dataset.py
import os
import shutil

src_dir       = "dataset/toy_car"             # 標註完成的原始目錄（含 .jpg 與 .txt）
train_img_dir = "dataset/toy_car/images/train"
val_img_dir   = "dataset/toy_car/images/val"
train_lbl_dir = "dataset/toy_car/labels/train"
val_lbl_dir   = "dataset/toy_car/labels/val"

for d in [train_img_dir, val_img_dir, train_lbl_dir, val_lbl_dir]:
    os.makedirs(d, exist_ok=True)             # 建立目標目錄（已存在則略過）

files = sorted(f for f in os.listdir(src_dir) if f.endswith(".jpg"))
split = int(len(files) * 0.8)                # 前 80% 為訓練集，後 20% 為驗證集

for i, f in enumerate(files):
    img_dst = train_img_dir if i < split else val_img_dir
    lbl_dst = train_lbl_dir if i < split else val_lbl_dir
    shutil.copy(os.path.join(src_dir, f), os.path.join(img_dst, f))  # 複製圖片
    label_f = f.replace(".jpg", ".txt")
    label_src = os.path.join(src_dir, label_f)
    if os.path.exists(label_src):
        shutil.copy(label_src, os.path.join(lbl_dst, label_f))        # 同步複製標籤

print(f"訓練集：{split} 張 → {train_img_dir}")
print(f"驗證集：{len(files) - split} 張 → {val_img_dir}")

圖：split_dataset.py 將圖片與標籤同步切分為訓練集與驗證集

🗃️ 資料目錄結構

圖：分類任務與物件偵測任務的資料目錄結構比較

⚠️ 任務類型不同，目錄結構與整理方式也不同：

🖼️ 分類任務（不需要標註工具）

分類任務針對整張圖片判斷類別，不需要畫邊界框，也不需要 LabelImg。只要手動將圖片放入對應的類別資料夾即可，資料夾名稱就是標籤。

dataset/
├── train/          # 訓練集（手動分類放入）
│   ├── cat/        # 類別資料夾，名稱即為標籤
│   └── dog/
└── val/            # 驗證集（手動分類放入）
    ├── cat/
    └── dog/

📝 PyTorch ImageFolder 與 Keras image_dataset_from_directory 都能自動讀取這個結構。

🔍 物件偵測任務（YOLO 標註格式）

物件偵測需要用 LabelImg 畫邊界框並輸出 .txt 標籤檔，再用上方的 split_dataset.py 切分。圖片與標籤分開存放，兩者檔名相同：

dataset/
├── images/
│   ├── train/      # 訓練集圖片 (.jpg)
│   └── val/        # 驗證集圖片 (.jpg)
└── labels/
    ├── train/      # 訓練集標籤 (.txt)，與圖片同名
    └── val/        # 驗證集標籤 (.txt)，與圖片同名

📝 例如 images/train/0001.jpg 對應 labels/train/0001.txt。

💻 以 Python 驗證標註結果

標註並切分完成後，可用以下程式將標註框繪製在圖片上，確認標註是否正確。

驗證 YOLO 格式標註

# verify_annotation.py
import cv2
import os

image_path = "dataset/toy_car/images/train/00000.jpg"
label_path = "dataset/toy_car/labels/train/00000.txt"

img = cv2.imread(image_path)
if img is None:
    print(f"❌ 無法讀取圖片：{image_path}")
    exit()
if not os.path.exists(label_path):
    print(f"❌ 找不到標籤檔：{label_path}")
    exit()
h, w = img.shape[:2]

with open(label_path, "r") as f:
    for line in f:
        parts = line.strip().split()
        class_id = int(parts[0])
        cx, cy, bw, bh = float(parts[1]), float(parts[2]), float(parts[3]), float(parts[4])

        # YOLO 歸一化座標換算為像素座標
        x1 = int((cx - bw / 2) * w)
        y1 = int((cy - bh / 2) * h)
        x2 = int((cx + bw / 2) * w)
        y2 = int((cy + bh / 2) * h)

        cv2.rectangle(img, (x1, y1), (x2, y2), (0, 255, 0), 2)
        cv2.putText(img, str(class_id), (x1, y1 - 5),
                    cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 255, 0), 2)

# 圖片太大時縮小顯示，避免視窗超出螢幕
max_display = 900
if max(h, w) > max_display:
    scale = max_display / max(h, w)
    display = cv2.resize(img, (int(w * scale), int(h * scale)))
else:
    display = img

cv2.imshow("Annotation Check", display)
cv2.waitKey(0)
cv2.destroyAllWindows()

圖：讀取 YOLO 格式標註檔並將邊界框繪製在圖片上，驗證標註位置是否正確

批次統計標註數量

# count_labels.py
import os

label_dir = "dataset/toy_car/labels/train"  # 要統計的標籤目錄
class_count = {}                             # 儲存各 class_id 的標註數量

for fname in os.listdir(label_dir):
    if not fname.endswith(".txt"):           # 跳過非標籤檔
        continue
    with open(os.path.join(label_dir, fname), "r") as f:
        for line in f:
            class_id = int(line.strip().split()[0])  # 每行第一個數字為 class_id
            class_count[class_id] = class_count.get(class_id, 0) + 1

for cls, count in sorted(class_count.items()):
    print(f"Class {cls}: {count} 個標註")

圖：批次掃描標籤目錄統計各類別的標註數量並輸出結果

💡 class ID 對應的類別名稱記錄在 LabelImg 產生的 classes.txt（與圖片存放於同一目錄），可對照查詢。

⚠️ 注意事項

訓練/驗證集分離：兩者資料絕不能重疊，驗證集應來自不同場景或不同時間拍攝。
類別均衡：各類別數量差距過大時，訓練結果會偏向數量多的類別，可透過補充資料或資料增強改善。
定期備份：標註結果是耗時的人工成果，務必做好版本備份。

🎯 結語

完成標註、切分與驗證後，整個資料準備流程就告一段落。
下一篇進入 模型選擇與訓練，將標註好的資料送入模型進行學習。

📖 如在學習過程中遇到疑問，或是想了解更多相關主題，建議回顧一下 Python | OpenCV 系列導讀，掌握完整的章節目錄，方便快速找到你需要的內容。

註：以上參考了
LabelImg GitHub

Python | OpenCV LabelImg 標註工具實戰

Python | OpenCV 模型選擇與訓練

↑
If you enjoy the article, please feel free to donate~ Thx.
若本文對您有幫助，您也願意支持打賞，謝謝您的鼓勵。

本文由J.J. Huang 創作，採用CC BY 3.0 TW協議進行許可。可自由轉載、引用，但需署名作者且註明文章出處。

J.J.'s Blogs

J.J. Huang 2026-03-25 Python OpenCV 07.物件偵測與辨識篇瀏覽次數：次 {{moment(1774432800000).fromNow()}}

Python | OpenCV 資料集整理與驗證

📚 前言

✂️ 訓練集與驗證集切分

🗃️ 資料目錄結構

🖼️ 分類任務（不需要標註工具）

🔍 物件偵測任務（YOLO 標註格式）

💻 以 Python 驗證標註結果

驗證 YOLO 格式標註

批次統計標註數量

⚠️ 注意事項

🎯 結語

J.J. Huang 2026-03-25 Python OpenCV 07.物件偵測與辨識篇 瀏覽次數：次 {{moment(1774432800000).fromNow()}}

Python | OpenCV 資料集整理與驗證

📚 前言

✂️ 訓練集與驗證集切分

🗃️ 資料目錄結構

🖼️ 分類任務（不需要標註工具）

🔍 物件偵測任務（YOLO 標註格式）

💻 以 Python 驗證標註結果

驗證 YOLO 格式標註

批次統計標註數量

⚠️ 注意事項

🎯 結語

J.J. Huang 2026-03-25 Python OpenCV 07.物件偵測與辨識篇瀏覽次數：次 {{moment(1774432800000).fromNow()}}