用 HDF5 建立AI影像資料集 – 羔羊的實驗紀錄簿

前言

通常建立資料集都要存出大量圖片出來，為了能有效利用空間，也為了方便移動檔案，因此使用H5DF的格式來壓縮影像。當然也可直接用NumPy的npy、npz檔來製作二進制檔案，或是其他方式，有機會再來嘗試看看，目前看來H5DF應該夠我使用。

程式碼

原本程式是在公司寫的，所以這邊有稍微修改才傳上來，不確定能不能直接複製貼上XD。

寫入

with h5py.File(h5file,'w') as h5f:這行去開啟一個h5檔案，接著所有寫入動作都在with底下完成。
使用h5f.create_dataset這個指令去建立一個資料庫，需要輸入dataset key name、dataset shape、data type。這裡我建了images、labels這兩個資料庫。
使用opencv的指令去讀圖，並對影像正規化以及resize，接著就設定到img_ds。
這邊假定只有兩類，直接設定onehot編碼到lbl_ds中。

import glob
import os
import cv2
import h5py
IMG_WIDTH = 96
IMG_HEIGHT = 96

LBL_NUM = 2

h5file = os.path.join(save_path, "dataset.h5")

im_0_path = glob.glob("d:\\image\\0\\*.png")
im_1_path = glob.glob("d:\\image\\1\\*.png")
nfiles_0 = len(im_0_path)
nfiles_1 = len(im_1_path)
print(nfiles_0)
print(nfiles_1)

#1 read image to gray, resize all images and load into "images dataset"
#2 get image ground true and load into "labels dataset"
with h5py.File(h5file,'w') as h5f:
    img_ds = h5f.create_dataset("images",shape=(nfiles_0+nfiles_1, IMG_HEIGHT, IMG_WIDTH), dtype=float)
    lbl_ds = h5f.create_dataset("labels",shape=(nfiles_0+nfiles_1, LBL_NUM), dtype=float)
    for cnt, ifile in enumerate(im_0_path) :
        if cnt%(nfiles_0/10) == 0:
            print("=", end="")
        img = cv2.imread(ifile, 0)
        img_resize = cv2.resize(img.copy(), (IMG_WIDTH, IMG_HEIGHT))
        img_ds[cnt:cnt+1,:,:] = img_resize / 255.
        lbl_ds[cnt:cnt+1,:] = np.array([1,0])

    for cnt, ifile in enumerate(im_1_path) :
        if cnt%(nfiles_1/10) == 0:
            print("=", end="")
        img = cv2.imread(ifile, 0)
        img_resize = cv2.resize(img.copy(), (IMG_WIDTH, IMG_HEIGHT))
        img_ds[nfiles_0+cnt:nfiles_0+cnt+1,:,:] = img_resize / 255.
        lbl_ds[nfiles_0+cnt:nfiles_0+cnt+1,:] = np.array([0,1])

讀取

先用with h5py.File(h5file, "r") as f:去讀取檔案
在with裡面去取出images、labels。(通過key name去取得檔案)

h5file = os.path.join(save_path, "dataset.h5")

with h5py.File(h5file, "r") as f:
    # List all groups
    print("Keys: %s" % f.keys())
    group_img = list(f.keys())[0]
    group_lbl = list(f.keys())[1]

    # Get image
    images = np.array(f[group_img])
    print("\nimages dataset")
    print(type(images))
    print(images.dtype)
    print(images.shape)

    # Get label
    labels = np.array(f[group_lbl])
    print("\nlabels dataset")
    print(type(labels))
    print(labels.dtype)
    print(labels.shape)

結語

使用h5py就像平時讀寫檔案一樣，用起來其實方便不少，再加上寫入資料的方式就像使用NumPy一樣特別好上手，通過with還可以將資料處理也寫在一起，建立完成之後就只要讀取一個檔案就好，讀取速度也很快，最重要搬檔案也方便多了，不然真的弄大型資料庫，動不動就好幾萬張起跳，還不如一次一個檔案就好。