DL-Intro

Lucas-TY

HPC|Jan 30, 2024|Last edited: Feb 1, 2024|
type
status
date
slug
summary
tags
category
icon
password

High Performance Deep-learning(CSE 5449)

Intro

History

  • ImageNet
  • AlexNet

Definitions

  • Machine Learning
    • Ability of machines to learn without being programmed
  • Supervised Learning
    • We provide the machine with the “right answers”
      • Classification – Discrete value output (e.g. email is spam or not-spam)
      • Regression – Continuous output values (e.g. house prices)
  • Unsupervised Learning
    • No “right answers” given. Learn yourself; no labels for you!
      • Clustering - Group the data points that are ”close” to each other (e.g. cocktail party problem) • finding structure in data is the key here!

DNN Training

  • Backward Pass
      • update gradient

Essential Concepts: Activation function and Back-propagation

  • Back-propagation
    • involves complicated mathematics.
    • Luckily, most DL Frameworks give you a one line implementation -- model.backward()
  • Activation functions
    • Introducing Non-linearity to the function
      • Non-linearities allow the network to approximate complex, non-linear functions.
    • RELU (a Max function.) is the most common activation function.
    • Sigmoid, Tanh, ReLU, Leaky ReLU

Perimeters vs. Hyperparameters

Parameters

  • Estimated during the training with historical data
  • is part of the model
  • the estimated value is saved with the trained model
  • Dependent on the dataset that the system is trained with

Hyperparameter

  • Values are set beforehand
  • External to the model.
  • Not a part of the trained model and hence the values are not saved.
  • Independent of the dataset

Stochastic Gradient Descent (SGD)

  • Goal of SGD:
    • Minimize a cost function
    • j(\theta) as a function of \theta
  • SGD is iterative
  • Only two equations to remember:
  • Learning rate

Learning Rate(\alpha)

notion image

Batch Size

  • Batch Gradient Descent - Batch Size = N
    • In each iteration, the gradient of the loss function is computed using the entire training dataset
    • Weights are updated once per pass over the entire dataset
    • Since all samples are used, the direction of the gradient is typically stable, but the computation can be very slow, especially with large datasets
  • Stochastic Gradient Descent – Batch Size = 1
    • In each iteration, a single randomly chosen sample from the training dataset is used to compute the gradient of the loss function.
    • Weight updates can be very frequent, with an update occurring for every individual sample.
    • As only one sample is used at a time, the direction of the gradient can be noisy and have high variance, but each iteration is computationally fast.
  • Mini-batch Gradient Descent – Somewhere in the middle – Common
    • Batch Size = 64, 128, 256, etc.
  • Finding the optimal batch size will yield the fastest learning.

Model Size

  • Model Size: # of parameters (weights on edges)
  • Model Size: # of layers (model depth

Accuracy and Throughput (Speed)

  • accuracy of the trained model on “new” data is the metric of success
  • In Computer Vision: images/second is the metric of throughput/speed

Impact of Model Size and Dataset Size

  • Large models → better accuracy
  • More data → better accuracy
  • Single-node Training; good for
    • Small model and small dataset
  • Distributed Training; good for
    • Large models and large datasets

Overfitting and Underfitting

  • Overfitting – model > data
    • so model is not learning but memorizing your data
  • Underfitting – data > model
    • so model is not learning because it cannot capture the complexity of your data

How to Deal with Overfitting

  • Regularization
    • L1 and L2
  • Dropout
  • Data Augmentation
  • Early stopping

Convolution Operation

notion image

Transformer Models

notion image
  • RNN
    • Has a short reference window
  • Attention Mechanism
    • Has an infinite reference window
  • Encoder
    • Output is a continuous vector representation of the inputs
  • Decoder
    • Feed previous outputs into the decoder recurrently until <end> is generated
  1. Input Embedding
      • Each word map to a vector
  1. Positional Encoding
      • word vector + positional encoding = positional input embeddings
      • Every odd time stamp, use cos function; every even time stamp, use sin function
      • They have linear property
  1. Multi headed attention
    1. Self-Attention
        • Feed into three fully connected layer
        • Create query, key, value vectors
          • notion image
        • The higher the scores, the higher the focus
        • To allow more stable gradients
          • notion image
        • Softmax —— let the model to be more confident on witch word to attend to
            notion image
        • Then feed the output into a linear layer to process
  1. Residual Connection, Layer Normalization & Pointwise Feed Forward
notion image

Each encoder can learn different representation

  1. Output Embedding & Positional Encoding
  1. Decoder Multi-Headed Attention
    1. Look-Ahead Mask
          • notion image
        • SoftMax —— probability between 0 and 1
        • Each decoder can take different attention from encoder