pytorch - GPU

Lecture 24

Dr. Colin Rundel

CUDA

CUDA (or Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing unit (GPU) for general purpose processing, an approach called general-purpose computing on GPUs (GPGPU). CUDA is a software layer that gives direct access to the GPU’s virtual instruction set and parallel computational elements, for the execution of compute kernels.

Core libraries:

cuBLAS
cuSOLVER
cuSPARSE

cuFFT
cuTENSOR
cuRAND

Thrust
cuDNN

CUDA Kernels

// Kernel - Adding two matrices MatA and MatB
__global__ void MatAdd(float MatA[N][N], float MatB[N][N], float MatC[N][N])
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if (i < N && j < N)
        MatC[i][j] = MatA[i][j] + MatB[i][j];
}
 
int main()
{
    ...
    // Matrix addition kernel launch from host code
    dim3 threadsPerBlock(16, 16);
    dim3 numBlocks(
        (N + threadsPerBlock.x -1) / threadsPerBlock.x, 
        (N+threadsPerBlock.y -1) / threadsPerBlock.y
    );
    
    MatAdd<<<numBlocks, threadsPerBlock>>>(MatA, MatB, MatC);
    ...
}

GPU Status

nvidia-smi

Wed Apr 12 10:32:48 2023      
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P100-PCIE-16GB            Off| 00000000:02:00.0 Off |                    0 |
| N/A   39C    P0               31W / 250W|   1002MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE-16GB            Off| 00000000:03:00.0 Off |                    0 |
| N/A   39C    P0               27W / 250W|      2MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                        
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   1825870      C   /usr/lib/rstudio-server/bin/rsession       1000MiB |
+---------------------------------------------------------------------------------------+

Torch GPU Information

torch.cuda.is_available()

True

torch.cuda.device_count()

torch.cuda.get_device_name("cuda:0")

'Tesla P100-PCIE-16GB'

torch.cuda.get_device_name("cuda:1")

'Tesla P100-PCIE-16GB'

torch.cuda.get_device_properties(0)

_CudaDeviceProperties(name='Tesla P100-PCIE-16GB', major=6, minor=0, total_memory=16276MB, multi_processor_count=56)

torch.cuda.get_device_properties(1)

_CudaDeviceProperties(name='Tesla P100-PCIE-16GB', major=6, minor=0, total_memory=16276MB, multi_processor_count=56)

GPU Tensors

Usage of the GPU is governed by the location of the Tensors - to use the GPU we allocate them on the GPU device.

cpu = torch.device('cpu')
cuda0 = torch.device('cuda:0')
cuda1 = torch.device('cuda:1')

x = torch.linspace(0,1,5, device=cuda0); x

tensor([0.0000, 0.2500, 0.5000, 0.7500, 1.0000], device='cuda:0')

y = torch.randn(5,2, device=cuda0); y

tensor([[ 0.6879, -2.3114],
        [ 0.5199,  0.5865],
        [-0.5277,  0.9261],
        [-0.4613, -0.7858],
        [-1.8057,  1.2171]], device='cuda:0')

z = torch.rand(2,3, device=cpu); z

tensor([[0.8585, 0.9918, 0.5125],
        [0.3261, 0.3992, 0.7753]])

x @ y

tensor([-2.2856,  1.2374], device='cuda:0')

y @ z

Error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

y @ z.to(cuda0)

tensor([[-0.1631, -0.2403, -1.4393],
        [ 0.6375,  0.7497,  0.7211],
        [-0.1511, -0.1537,  0.4475],
        [-0.6523, -0.7712, -0.8456],
        [-1.1534, -1.3051,  0.0182]], device='cuda:0')

NN Layers + GPU

NN layers (parameters) also need to be assigned to the GPU to be used with GPU tensors,

nn = torch.nn.Linear(5,5)
X = torch.randn(10,5).cuda()

nn(X)

Error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)

nn.cuda()(X)

tensor([[ 0.0588, -0.4098, -0.6487,  0.4108,  0.0322],
        [-0.0797, -0.4688, -1.7741,  0.2954, -0.0537],
        [-0.2057, -0.4174, -1.3236, -0.0518, -0.4419],
        [ 0.7228, -0.2675,  1.3836, -0.4886,  0.3568],
        [-0.9248,  0.1247, -0.1394, -0.5943,  0.1210],
        [-1.0062, -0.1228, -0.5563,  0.8400,  0.8503],
        [-0.6758, -0.1711,  0.6801,  0.5007,  0.7845],
        [-0.1415, -0.0750, -0.1112,  0.2573,  0.8202],
        [-0.6765,  0.0109,  1.0709, -0.6299, -0.1831],
        [-1.4872, -0.0788,  0.9389,  0.2557,  0.1959]], device='cuda:0',
       grad_fn=<AddmmBackward0>)

nn.to(device="cuda")(X)

tensor([[ 0.0588, -0.4098, -0.6487,  0.4108,  0.0322],
        [-0.0797, -0.4688, -1.7741,  0.2954, -0.0537],
        [-0.2057, -0.4174, -1.3236, -0.0518, -0.4419],
        [ 0.7228, -0.2675,  1.3836, -0.4886,  0.3568],
        [-0.9248,  0.1247, -0.1394, -0.5943,  0.1210],
        [-1.0062, -0.1228, -0.5563,  0.8400,  0.8503],
        [-0.6758, -0.1711,  0.6801,  0.5007,  0.7845],
        [-0.1415, -0.0750, -0.1112,  0.2573,  0.8202],
        [-0.6765,  0.0109,  1.0709, -0.6299, -0.1831],
        [-1.4872, -0.0788,  0.9389,  0.2557,  0.1959]], device='cuda:0',
       grad_fn=<AddmmBackward0>)

Back to MNIST

Same MNIST data from last time (1x8x8 images),

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X, y = digits.data, digits.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, shuffle=True, random_state=1234
)

X_train = torch.from_numpy(X_train).float()
y_train = torch.from_numpy(y_train)
X_test = torch.from_numpy(X_test).float()
y_test = torch.from_numpy(y_test)

To use the GPU for computation we need to copy these tensors to the GPU,

X_train_cuda = X_train.to(device=cuda0)
y_train_cuda = y_train.to(device=cuda0)
X_test_cuda = X_test.to(device=cuda0)
y_test_cuda = y_test.to(device=cuda0)

Convolutional NN

class mnist_conv_model(torch.nn.Module):
    def __init__(self, device):
        super().__init__()
        self.device = torch.device(device)
        
        self.model = torch.nn.Sequential(
          torch.nn.Unflatten(1, (1,8,8)),
          torch.nn.Conv2d(
            in_channels=1, out_channels=8,
            kernel_size=3, stride=1, padding=1
          ),
          torch.nn.ReLU(),
          torch.nn.MaxPool2d(kernel_size=2),
          torch.nn.Flatten(),
          torch.nn.Linear(8 * 4 * 4, 10)
        ).to(device=self.device)
        
    def forward(self, X):
        return self.model(X)
    
    def fit(self, X, y, lr=0.001, n=1000, acc_step=10):
      opt = torch.optim.SGD(self.parameters(), lr=lr, momentum=0.9) 
      losses = []
      for i in range(n):
          opt.zero_grad()
          loss = torch.nn.CrossEntropyLoss()(self(X), y)
          loss.backward()
          opt.step()
          losses.append(loss.item())
      
      return losses
    
    def accuracy(self, X, y):
      val, pred = torch.max(self(X), dim=1)
      return( (pred == y).sum() / len(y) )

CPU vs Cuda

m = mnist_conv_model(device="cpu")
loss = m.fit(X_train, y_train, n=1000)
loss[-1]

0.034776557236909866

m.accuracy(X_test, y_test)

tensor(0.9750)

m_cuda = mnist_conv_model(device="cuda")
loss = m_cuda.fit(X_train_cuda, y_train_cuda, n=1000)
loss[-1]

0.03830884024500847

m_cuda.accuracy(X_test_cuda, y_test_cuda)

tensor(0.9750, device='cuda:0')

Performance

CPU performance:

m = mnist_conv_model(device="cpu")

start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

start.record()
loss = m.fit(X_train, y_train, n=1000)
end.record()

torch.cuda.synchronize()
print(start.elapsed_time(end) / 1000)

2.75747021484375

GPU performance:

m_cuda = mnist_conv_model(device="cuda")

start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

start.record()
loss = m_cuda.fit(X_train_cuda, y_train_cuda, n=1000)
end.record()

torch.cuda.synchronize()
print(start.elapsed_time(end) / 1000)

2.358865234375

Profiling CPU - 1 forward step

m = mnist_conv_model(device="cpu")
with torch.autograd.profiler.profile(with_stack=True, profile_memory=True) as prof_cpu:
    tmp = m(X_train)

print(prof_cpu.key_averages().table(sort_by='self_cpu_time_total', row_limit=5))

---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
         aten::mkldnn_convolution        53.28%       1.576ms        54.06%       1.599ms       1.599ms       2.81 Mb           0 b             1  
    aten::max_pool2d_with_indices        25.90%     766.000us        25.90%     766.000us     766.000us       2.10 Mb       2.10 Mb             1  
                  aten::clamp_min        11.76%     348.000us        11.76%     348.000us     348.000us       2.81 Mb       2.81 Mb             1  
                      aten::addmm         3.11%      92.000us         3.99%     118.000us     118.000us      56.13 Kb      56.13 Kb             1  
                      aten::copy_         0.74%      22.000us         0.74%      22.000us      22.000us           0 b           0 b             1  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.958ms

Profiling GPU - 1 forward step

m_cuda = mnist_conv_model(device="cuda")
with torch.autograd.profiler.profile(with_stack=True) as prof_cuda:
    tmp = m_cuda(X_train_cuda)

print(prof_cuda.key_averages().table(sort_by='self_cpu_time_total', row_limit=5))

-------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                     aten::conv2d        47.36%       1.902ms        47.36%       1.902ms       1.902ms             1  
                          aten::cudnn_convolution        44.82%       1.800ms        45.44%       1.825ms       1.825ms             1  
                                      aten::addmm         1.84%      74.000us         2.39%      96.000us      96.000us             1  
                                 cudaLaunchKernel         1.52%      61.000us         1.52%      61.000us       8.714us             7  
                    aten::max_pool2d_with_indices         0.60%      24.000us         0.80%      32.000us      32.000us             1  
-------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 4.016ms

Profiling CPU - fit

m = mnist_conv_model(device="cpu")
with torch.autograd.profiler.profile(with_stack=True, profile_memory=True) as prof_cpu:
    losses = m.fit(X_train, y_train, n=1000)

print(prof_cpu.key_averages().table(sort_by='self_cpu_time_total', row_limit=5))

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                             aten::convolution_backward        28.11%     847.371ms        28.31%     853.388ms     853.388us     312.50 Kb       2.03 Kb          1000  
                          aten::max_pool2d_with_indices        19.47%     586.731ms        19.47%     586.731ms     586.731us       2.06 Gb       2.06 Gb          1000  
                               aten::mkldnn_convolution        10.07%     303.619ms        10.30%     310.468ms     310.468us       2.74 Gb       8.42 Mb          1000  
                               aten::threshold_backward         6.98%     210.441ms         6.98%     210.441ms     210.441us       2.74 Gb       2.74 Gb          1000  
                                               aten::mm         6.34%     191.005ms         6.34%     191.005ms      95.502us     706.54 Mb     706.54 Mb          2000  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 3.014s

Profiling GPU - fit

m_cuda = mnist_conv_model(device="cuda")
with torch.autograd.profiler.profile(with_stack=True) as prof_cuda:
    losses = m_cuda.fit(X_train_cuda, y_train_cuda, n=1000)

print(prof_cuda.key_averages().table(sort_by='self_cpu_time_total', row_limit=5))

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       cudaLaunchKernel        13.89%     374.915ms        13.89%     374.915ms      15.623us         23998  
                                Optimizer.step#SGD.step        13.57%     366.201ms        19.16%     517.321ms     517.321us          1000  
                                            aten::addmm         4.54%     122.508ms         6.26%     169.086ms     169.086us          1000  
                             aten::convolution_backward         4.10%     110.628ms         8.23%     222.153ms     222.153us          1000  
                                aten::cudnn_convolution         3.89%     105.139ms         5.15%     139.069ms     139.069us          1000  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.700s

CIFAR10

homepage

Loading the data

import torchvision

training_data = torchvision.datasets.CIFAR10(
    root="/data",
    train=True,
    download=True,
    transform=torchvision.transforms.ToTensor()
)

test_data = torchvision.datasets.CIFAR10(
    root="/data",
    train=False,
    download=True,
    transform=torchvision.transforms.ToTensor()
)

CIFAR10 data

training_data.classes

['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']

training_data.data.shape

(50000, 32, 32, 3)

test_data.data.shape

(10000, 32, 32, 3)

training_data[0]

(tensor([[[0.2314, 0.1686, 0.1961,  ..., 0.6196, 0.5961, 0.5804],
         [0.0627, 0.0000, 0.0706,  ..., 0.4824, 0.4667, 0.4784],
         [0.0980, 0.0627, 0.1922,  ..., 0.4627, 0.4706, 0.4275],
         ...,
         [0.8157, 0.7882, 0.7765,  ..., 0.6275, 0.2196, 0.2078],
         [0.7059, 0.6784, 0.7294,  ..., 0.7216, 0.3804, 0.3255],
         [0.6941, 0.6588, 0.7020,  ..., 0.8471, 0.5922, 0.4824]],

        [[0.2431, 0.1804, 0.1882,  ..., 0.5176, 0.4902, 0.4863],
         [0.0784, 0.0000, 0.0314,  ..., 0.3451, 0.3255, 0.3412],
         [0.0941, 0.0275, 0.1059,  ..., 0.3294, 0.3294, 0.2863],
         ...,
         [0.6667, 0.6000, 0.6314,  ..., 0.5216, 0.1216, 0.1333],
         [0.5451, 0.4824, 0.5647,  ..., 0.5804, 0.2431, 0.2078],
         [0.5647, 0.5059, 0.5569,  ..., 0.7216, 0.4627, 0.3608]],

        [[0.2471, 0.1765, 0.1686,  ..., 0.4235, 0.4000, 0.4039],
         [0.0784, 0.0000, 0.0000,  ..., 0.2157, 0.1961, 0.2235],
         [0.0824, 0.0000, 0.0314,  ..., 0.1961, 0.1961, 0.1647],
         ...,
         [0.3765, 0.1333, 0.1020,  ..., 0.2745, 0.0275, 0.0784],
         [0.3765, 0.1647, 0.1176,  ..., 0.3686, 0.1333, 0.1333],
         [0.4549, 0.3686, 0.3412,  ..., 0.5490, 0.3294, 0.2824]]]), 6)

Example data

Data Loaders

batch_size = 100

training_loader = torch.utils.data.DataLoader(
    training_data, 
    batch_size=batch_size,
    shuffle=True,
    num_workers=4,
    pin_memory=True
)

test_loader = torch.utils.data.DataLoader(
    test_data, 
    batch_size=batch_size,
    shuffle=True,
    num_workers=4,
    pin_memory=True
)

Loader generator

training_loader

<torch.utils.data.dataloader.DataLoader object at 0x7f42e510d210>

X, y = next(iter(training_loader))
X.shape

torch.Size([100, 3, 32, 32])

y.shape

torch.Size([100])

CIFAR CNN

class cifar_conv_model(torch.nn.Module):
    def __init__(self, device):
        super().__init__()
        self.device = torch.device(device)
        self.epoch = 0
        self.model = torch.nn.Sequential(
            torch.nn.Conv2d(3, 6, kernel_size=5),
            torch.nn.ReLU(),
            torch.nn.MaxPool2d(2, 2),
            torch.nn.Conv2d(6, 16, kernel_size=5),
            torch.nn.ReLU(),
            torch.nn.MaxPool2d(2, 2),
            torch.nn.Flatten(),
            torch.nn.Linear(16 * 5 * 5, 120),
            torch.nn.ReLU(),
            torch.nn.Linear(120, 84),
            torch.nn.ReLU(),
            torch.nn.Linear(84, 10)
        ).to(device=self.device)
        
    def forward(self, X):
        return self.model(X)
    
    def fit(self, loader, epochs=10, n_report=250, lr=0.001):
        opt = torch.optim.SGD(self.parameters(), lr=lr, momentum=0.9) 
      
        for j in range(epochs):
            running_loss = 0.0
            for i, (X, y) in enumerate(loader):
                X, y = X.to(self.device), y.to(self.device)
                opt.zero_grad()
                loss = torch.nn.CrossEntropyLoss()(self(X), y)
                loss.backward()
                opt.step()
    
                # print statistics
                running_loss += loss.item()
                if i % n_report == (n_report-1):    # print every 100 mini-batches
                    print(f'[Epoch {self.epoch + 1}, Minibatch {i + 1:4d}] loss: {running_loss / n_report:.3f}')
                    running_loss = 0.0
            
            self.epoch += 1

CNN Performance - CPU (1 step)

X, y = next(iter(training_loader))

m_cpu = cifar_conv_model(device="cpu")
tmp = m_cpu(X)

with torch.autograd.profiler.profile(with_stack=True) as prof_cpu:
    tmp = m_cpu(X)

print(prof_cpu.key_averages().table(sort_by='self_cpu_time_total', row_limit=5))

---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
         aten::mkldnn_convolution        86.35%      12.410ms        86.57%      12.441ms       6.221ms             2  
    aten::max_pool2d_with_indices         7.72%       1.109ms         7.72%       1.109ms     554.500us             2  
                  aten::clamp_min         2.67%     383.000us         2.67%     383.000us      95.750us             4  
                      aten::addmm         1.52%     219.000us         1.79%     257.000us      85.667us             3  
                       aten::relu         0.28%      40.000us         2.94%     423.000us     105.750us             4  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 14.371ms

CNN Performance - GPU (1 step)

m_cuda = cifar_conv_model(device="cuda")
Xc, yc = X.to(device="cuda"), y.to(device="cuda")
tmp = m_cuda(Xc)
    
with torch.autograd.profiler.profile(with_stack=True) as prof_cuda:
    tmp = m_cuda(Xc)

print(prof_cuda.key_averages().table(sort_by='self_cpu_time_total', row_limit=5))

-------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                          aten::cudnn_convolution        50.86%       3.451ms        51.70%       3.508ms       1.754ms             2  
                    aten::max_pool2d_with_indices        23.57%       1.599ms        23.79%       1.614ms     807.000us             2  
                                  aten::clamp_min         7.46%     506.000us        11.23%     762.000us     190.500us             4  
                                       aten::relu         6.88%     467.000us        18.11%       1.229ms     307.250us             4  
                                       cudaMalloc         3.17%     215.000us         3.17%     215.000us     215.000us             1  
-------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 6.785ms

CNN Performance - CPU (1 epoch)

m_cpu = cifar_conv_model(device="cpu")

with torch.autograd.profiler.profile(with_stack=True) as prof_cpu:
    m_cpu.fit(loader=training_loader, epochs=1, n_report=501)

print(prof_cpu.key_averages().table(sort_by='self_cpu_time_total', row_limit=5))

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                             aten::convolution_backward        30.65%        1.361s        31.40%        1.394s       1.394ms          1000  
                               aten::mkldnn_convolution        17.39%     772.124ms        17.77%     788.803ms     788.803us          1000  
                          aten::max_pool2d_with_indices         7.89%     350.091ms         7.90%     350.800ms     350.800us          1000  
                               aten::threshold_backward         6.84%     303.662ms         6.85%     304.006ms     152.003us          2000  
enumerate(DataLoader)#_MultiProcessingDataLoaderIter...         6.31%     280.147ms         6.33%     280.930ms     560.739us           501  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 4.439s

CNN Performance - GPU (1 epoch)

m_cuda = cifar_conv_model(device="cuda")

with torch.autograd.profiler.profile(with_stack=True) as prof_cuda:
    m_cuda.fit(loader=training_loader, epochs=1, n_report=501)

print(prof_cuda.key_averages().table(sort_by='self_cpu_time_total', row_limit=5))

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
enumerate(DataLoader)#_MultiProcessingDataLoaderIter...        12.04%     360.143ms        12.09%     361.647ms     721.850us           501  
                                       cudaLaunchKernel        11.75%     351.313ms        11.76%     351.527ms      12.555us         27998  
                                Optimizer.step#SGD.step         7.75%     231.842ms        10.60%     316.884ms     633.768us           500  
                             aten::convolution_backward         5.73%     171.197ms        10.23%     305.927ms     305.927us          1000  
                                            aten::addmm         4.62%     138.160ms         6.21%     185.662ms     123.775us          1500  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.990s

Loaders & Accuracy

def accuracy(model, loader, device):
    total, correct = 0, 0
    with torch.no_grad():
        for X, y in loader:
            X, y = X.to(device=device), y.to(device=device)
            pred = model(X)
            # the class with the highest energy is what we choose as prediction
            val, idx = torch.max(pred, 1)
            total += pred.size(0)
            correct += (idx == y).sum().item()
            
    return correct / total

Model fitting

m = cifar_conv_model("cuda")
m.fit(training_loader, epochs=10, n_report=500, lr=0.01)
## [Epoch 1, Minibatch  500] loss: 2.098
## [Epoch 2, Minibatch  500] loss: 1.692
## [Epoch 3, Minibatch  500] loss: 1.482
## [Epoch 4, Minibatch  500] loss: 1.374
## [Epoch 5, Minibatch  500] loss: 1.292
## [Epoch 6, Minibatch  500] loss: 1.226
## [Epoch 7, Minibatch  500] loss: 1.173
## [Epoch 8, Minibatch  500] loss: 1.117
## [Epoch 9, Minibatch  500] loss: 1.071
## [Epoch 10, Minibatch  500] loss: 1.035

accuracy(m, training_loader, "cuda")
## 0.63444
accuracy(m, test_loader, "cuda")
## 0.572

More epochs

If continue fitting with the existing model,

m.fit(training_loader, epochs=10, n_report=500)
## [Epoch 11, Minibatch  500] loss: 0.885
## [Epoch 12, Minibatch  500] loss: 0.853
## [Epoch 13, Minibatch  500] loss: 0.839
## [Epoch 14, Minibatch  500] loss: 0.828
## [Epoch 15, Minibatch  500] loss: 0.817
## [Epoch 16, Minibatch  500] loss: 0.806
## [Epoch 17, Minibatch  500] loss: 0.798
## [Epoch 18, Minibatch  500] loss: 0.787
## [Epoch 19, Minibatch  500] loss: 0.780
## [Epoch 20, Minibatch  500] loss: 0.773

accuracy(m, training_loader, "cuda")
## 0.73914
accuracy(m, test_loader, "cuda")
## 0.624

More epochs (again)

m.fit(training_loader, epochs=10, n_report=500)
## [Epoch 21, Minibatch  500] loss: 0.764
## [Epoch 22, Minibatch  500] loss: 0.756
## [Epoch 23, Minibatch  500] loss: 0.748
## [Epoch 24, Minibatch  500] loss: 0.739
## [Epoch 25, Minibatch  500] loss: 0.733
## [Epoch 26, Minibatch  500] loss: 0.726
## [Epoch 27, Minibatch  500] loss: 0.718
## [Epoch 28, Minibatch  500] loss: 0.710
## [Epoch 29, Minibatch  500] loss: 0.702
## [Epoch 30, Minibatch  500] loss: 0.698

accuracy(m, training_loader, "cuda")
## 0.76438
accuracy(m, test_loader, "cuda")
## 0.6217

The VGG16 model

class VGG16(torch.nn.Module):
    def make_layers(self):
        cfg = [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512, 'M']
        layers = []
        in_channels = 3
        for x in cfg:
            if x == 'M':
                layers += [torch.nn.MaxPool2d(kernel_size=2, stride=2)]
            else:
                layers += [torch.nn.Conv2d(in_channels, x, kernel_size=3, padding=1),
                           torch.nn.BatchNorm2d(x),
                           torch.nn.ReLU(inplace=True)]
                in_channels = x
        layers += [
            torch.nn.AvgPool2d(kernel_size=1, stride=1),
            torch.nn.Flatten(),
            torch.nn.Linear(512,10)
        ]
        
        return torch.nn.Sequential(*layers).to(self.device)
    
    def __init__(self, device):
        super().__init__()
        self.device = torch.device(device)
        self.model = self.make_layers()
    
    def forward(self, X):
        return self.model(X)

Model

VGG16("cpu").model

Sequential(
  (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (2): ReLU(inplace=True)
  (3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (5): ReLU(inplace=True)
  (6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (7): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (8): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (9): ReLU(inplace=True)
  (10): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (11): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (12): ReLU(inplace=True)
  (13): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (14): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (15): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (16): ReLU(inplace=True)
  (17): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (18): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (19): ReLU(inplace=True)
  (20): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (21): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (22): ReLU(inplace=True)
  (23): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (24): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (25): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (26): ReLU(inplace=True)
  (27): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (28): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (29): ReLU(inplace=True)
  (30): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (31): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (32): ReLU(inplace=True)
  (33): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (34): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (35): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (36): ReLU(inplace=True)
  (37): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (38): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (39): ReLU(inplace=True)
  (40): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (41): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (42): ReLU(inplace=True)
  (43): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (44): AvgPool2d(kernel_size=1, stride=1, padding=0)
  (45): Flatten(start_dim=1, end_dim=-1)
  (46): Linear(in_features=512, out_features=10, bias=True)
)

VGG16 performance - CPU

X, y = next(iter(training_loader))
m_cpu = VGG16(device="cpu")
tmp = m_cpu(X)

with torch.autograd.profiler.profile(with_stack=True) as prof_cpu:
    tmp = m_cpu(X)

print(prof_cpu.key_averages().table(sort_by='self_cpu_time_total', row_limit=5))

---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
         aten::mkldnn_convolution        81.00%     193.414ms        82.43%     196.826ms      15.140ms            13  
          aten::native_batch_norm         8.97%      21.413ms         9.63%      23.005ms       1.770ms            13  
    aten::max_pool2d_with_indices         6.30%      15.053ms         6.30%      15.053ms       3.011ms             5  
                      aten::empty         2.04%       4.881ms         2.04%       4.881ms      37.546us           130  
                 aten::clamp_min_         1.06%       2.530ms         1.06%       2.530ms     194.615us            13  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 238.793ms

VGG16 performance - GPU

m_cuda = VGG16(device="cuda")
Xc, yc = X.to(device="cuda"), y.to(device="cuda")
tmp = m_cuda(Xc)

with torch.autograd.profiler.profile(with_stack=True) as prof_cuda:
    tmp = m_cuda(Xc)

print(prof_cuda.key_averages().table(sort_by='self_cpu_time_total', row_limit=5))

-------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                          aten::cudnn_convolution        38.88%       3.379ms        48.86%       4.246ms     326.615us            13  
                           aten::cudnn_batch_norm        24.98%       2.171ms        33.44%       2.906ms     223.538us            13  
                                       cudaMalloc        11.40%     991.000us        11.40%     991.000us     165.167us             6  
                                 cudaLaunchKernel         6.40%     556.000us         6.40%     556.000us       5.009us           111  
                                       aten::add_         3.43%     298.000us         5.05%     439.000us      16.885us            26  
-------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 8.691ms

VGG16 performance - Apple M1 GPU (mps)

m_mps = VGG16(device="mps")
Xm, ym = X.to(device="mps"), y.to(device="mps")

with torch.autograd.profiler.profile(with_stack=True) as prof_mps:
    tmp = m_mps(Xm)

print(prof_mps.key_averages().table(sort_by='self_cpu_time_total', row_limit=5))

--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
         aten::native_batch_norm        35.71%       3.045ms        35.71%       3.045ms     234.231us            13  
          aten::_mps_convolution        19.67%       1.677ms        19.88%       1.695ms     130.385us            13  
    aten::_batch_norm_impl_index        11.92%       1.016ms        36.02%       3.071ms     236.231us            13  
                     aten::relu_        11.29%     963.000us        11.29%     963.000us      74.077us            13  
                      aten::add_        10.40%     887.000us        10.44%     890.000us      68.462us            13  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 8.526ms

Fitting w/ `lr = 0.01`

m = VGG16(device="cuda")
fit(m, training_loader, epochs=10, n_report=500, lr=0.01)

## [Epoch 1, Minibatch  500] loss: 1.345
## [Epoch 2, Minibatch  500] loss: 0.790
## [Epoch 3, Minibatch  500] loss: 0.577
## [Epoch 4, Minibatch  500] loss: 0.445
## [Epoch 5, Minibatch  500] loss: 0.350
## [Epoch 6, Minibatch  500] loss: 0.274
## [Epoch 7, Minibatch  500] loss: 0.215
## [Epoch 8, Minibatch  500] loss: 0.167
## [Epoch 9, Minibatch  500] loss: 0.127
## [Epoch 10, Minibatch  500] loss: 0.103

accuracy(model=m, loader=training_loader, device="cuda")
## 0.97008
accuracy(model=m, loader=test_loader, device="cuda")
## 0.8318

Fitting w/ `lr = 0.001`

m = VGG16(device="cuda")
fit(m, training_loader, epochs=10, n_report=500, lr=0.001)

## [Epoch 1, Minibatch  500] loss: 1.279
## [Epoch 2, Minibatch  500] loss: 0.827
## [Epoch 3, Minibatch  500] loss: 0.599
## [Epoch 4, Minibatch  500] loss: 0.428
## [Epoch 5, Minibatch  500] loss: 0.303
## [Epoch 6, Minibatch  500] loss: 0.210
## [Epoch 7, Minibatch  500] loss: 0.144
## [Epoch 8, Minibatch  500] loss: 0.108
## [Epoch 9, Minibatch  500] loss: 0.088
## [Epoch 10, Minibatch  500] loss: 0.063

accuracy(model=m, loader=training_loader, device="cuda")
## 0.9815
accuracy(model=m, loader=test_loader, device="cuda")
## 0.7816

Report

from sklearn.metrics import classification_report

def report(model, loader, device):
    y_true, y_pred = [], []
    with torch.no_grad():
        for X, y in loader:
            X = X.to(device=device)
            y_true.append( y.cpu().numpy() )
            y_pred.append( model(X).max(1)[1].cpu().numpy() )
    
    y_true = np.concatenate(y_true)
    y_pred = np.concatenate(y_pred)

    return classification_report(y_true, y_pred, target_names=loader.dataset.classes)

print(report(model=m, loader=test_loader, device="cuda"))

##               precision    recall  f1-score   support
## 
##     airplane       0.82      0.88      0.85      1000
##   automobile       0.95      0.89      0.92      1000
##         bird       0.85      0.70      0.77      1000
##          cat       0.68      0.74      0.71      1000
##         deer       0.84      0.83      0.83      1000
##          dog       0.81      0.73      0.77      1000
##         frog       0.83      0.92      0.87      1000
##        horse       0.87      0.87      0.87      1000
##         ship       0.89      0.92      0.90      1000
##        truck       0.86      0.93      0.89      1000
## 
##     accuracy                           0.84     10000
##    macro avg       0.84      0.84      0.84     10000
## weighted avg       0.84      0.84      0.84     10000

Some state-of-the-art examples

Hugging Face

This is an online community and platform for sharing machine learning models (architectures and weights), data, and related artifacts. They also maintain a number of packages and related training materials that help with building, training, and deploying ML models.

Some notable resources,

transformers - APIs and tools to easily download and train state-of-the-art (pretrained) transformer based models
diffusers - provides pretrained vision and audio diffusion models, and serves as a modular toolbox for inference and training
timm - a library containing SOTA computer vision models, layers, utilities, optimizers, schedulers, data-loaders, augmentations, and training/evaluation scripts

Stable Diffusion

from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
  "stabilityai/stable-diffusion-2-1-base", torch_dtype=torch.float16
).to("cuda")

prompt = "a picture of thomas bayes with a cat on his lap"
generator = [torch.Generator(device="cuda").manual_seed(i) for i in range(6)]
fit = pipe(prompt, generator=generator, num_inference_steps=20, num_images_per_prompt=6)

fit.images

[<PIL.Image.Image image mode=RGB size=512x512 at 0x7FB991D3B910>, <PIL.Image.Image image mode=RGB size=512x512 at
   0x7FB993365850>, <PIL.Image.Image image mode=RGB size=512x512 at 0x7FB9933D0D90>, <PIL.Image.Image image mode=RGB
   size=512x512 at 0x7FB993094110>, <PIL.Image.Image image mode=RGB size=512x512 at 0x7FB991D82410>, <PIL.Image.Image
   image mode=RGB size=512x512 at 0x7FBB53A9C6D0>]

Customizing prompts

prompt = "a picture of thomas bayes with a cat on his lap"
prompts = [
  prompt + t for t in 
  ["in the style of a japanese wood block print",
   "as a hipster with facial hair and glasses",
   "as a simpsons character, cartoon, yellow",
   "in the style of a vincent van gogh painting",
   "in the style of a picasso painting",
   "with flowery wall paper"
  ]
]

generator = [torch.Generator(device="cuda").manual_seed(i) for i in range(6)]
fit = pipe(prompts, generator=generator, num_inference_steps=20, num_images_per_prompt=1)

Increasing inference steps

generator = [torch.Generator(device="cuda").manual_seed(i) for i in range(6)]
fit = pipe(prompts, generator=generator, num_inference_steps=50, num_images_per_prompt=1)

Alpaca LoRA

from transformers import GenerationConfig, LlamaTokenizer, LlamaForCausalLM

tokenizer = LlamaTokenizer.from_pretrained("chainyo/alpaca-lora-7b")

model = LlamaForCausalLM.from_pretrained(
    "chainyo/alpaca-lora-7b",
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
)

generation_config = GenerationConfig(
    temperature=0.2,
    top_p=0.75,
    top_k=40,
    num_beams=4,
    max_new_tokens=128,
)

Generate a prompt

instruction = "Write a short childrens story about Thomas Bayes and his pet cat"
input_ctxt = None 
prompt = generate_prompt(instruction, input_ctxt)
print(prompt)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Write a short childrens story about Thomas Bayes and his pet cat

### Response:

Running the model

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)

with torch.no_grad():
    outputs = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
    )

response = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
print(response)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Write a short childrens story about Thomas Bayes and his pet cat

### Response:
Once upon a time, there was a little boy named Thomas Bayes. He had a pet cat named Fluffy, and 
they were the best of friends. One day, Thomas and Fluffy decided to go on an adventure. They 
traveled far and wide, exploring new places and meeting new people. Along the way, Thomas and 
Fluffy learned many valuable lessons, such as the importance of friendship and the joy of discovery.
Eventually, Thomas and Fluffy made their way back home, where they were welcomed with open arms. 
Thomas and Fluffy had a wonderful time.