DL-Intro

Lucas-TY

HPC|Jan 30, 2024|Last edited: Feb 1, 2024|

type

status

date

slug

summary

tags

category

icon

password

High Performance Deep-learning(CSE 5449)

Intro

History

ImageNet

AlexNet

Definitions

Machine Learning

Ability of machines to learn without being programmed

Supervised Learning

We provide the machine with the “right answers”

Classification – Discrete value output (e.g. email is spam or not-spam)
Regression – Continuous output values (e.g. house prices)

Unsupervised Learning

No “right answers” given. Learn yourself; no labels for you!

Clustering - Group the data points that are ”close” to each other (e.g. cocktail party problem) • finding structure in data is the key here!

DNN Training

Backward Pass

update gradient

Essential Concepts: Activation function and Back-propagation

Back-propagation

involves complicated mathematics.
Luckily, most DL Frameworks give you a one line implementation -- model.backward()

Activation functions

Introducing Non-linearity to the function

Non-linearities allow the network to approximate complex, non-linear functions.

RELU (a Max function.) is the most common activation function.
Sigmoid, Tanh, ReLU, Leaky ReLU

Perimeters vs. Hyperparameters

Parameters

Estimated during the training with historical data

is part of the model

the estimated value is saved with the trained model

Dependent on the dataset that the system is trained with

Hyperparameter

Values are set beforehand

External to the model.

Not a part of the trained model and hence the values are not saved.

Independent of the dataset

Stochastic Gradient Descent (SGD)

Goal of SGD:

Minimize a cost function
j(\theta) as a function of \theta

SGD is iterative

Only two equations to remember:

Learning rate

Learning Rate(\alpha)

notion image

Batch Size

Batch Gradient Descent - Batch Size = N

In each iteration, the gradient of the loss function is computed using the entire training dataset
Weights are updated once per pass over the entire dataset
Since all samples are used, the direction of the gradient is typically stable, but the computation can be very slow, especially with large datasets

Stochastic Gradient Descent – Batch Size = 1

In each iteration, a single randomly chosen sample from the training dataset is used to compute the gradient of the loss function.
Weight updates can be very frequent, with an update occurring for every individual sample.
As only one sample is used at a time, the direction of the gradient can be noisy and have high variance, but each iteration is computationally fast.

Mini-batch Gradient Descent – Somewhere in the middle – Common

Batch Size = 64, 128, 256, etc.

Finding the optimal batch size will yield the fastest learning.

Model Size

Model Size: # of parameters (weights on edges)

Model Size: # of layers (model depth

Accuracy and Throughput (Speed)

accuracy of the trained model on “new” data is the metric of success

In Computer Vision: images/second is the metric of throughput/speed

Impact of Model Size and Dataset Size

Large models → better accuracy

More data → better accuracy

Single-node Training; good for

Small model and small dataset

Distributed Training; good for

Large models and large datasets

Overfitting and Underfitting

Overfitting – model > data

so model is not learning but memorizing your data

Underfitting – data > model

so model is not learning because it cannot capture the complexity of your data

How to Deal with Overfitting

Regularization

L1 and L2

Dropout

Data Augmentation

Early stopping

Convolution Operation

notion image

Transformer Models

notion image

RNN

Has a short reference window

Attention Mechanism

Has an infinite reference window

Encoder

Output is a continuous vector representation of the inputs

Decoder

Feed previous outputs into the decoder recurrently until <end> is generated

Input Embedding

Each word map to a vector

Positional Encoding

word vector + positional encoding = positional input embeddings

Every odd time stamp, use cos function; every even time stamp, use sin function

They have linear property

Multi headed attention

Self-Attention

Feed into three fully connected layer

Create query, key, value vectors

notion image

The higher the scores, the higher the focus

To allow more stable gradients

notion image

Softmax —— let the model to be more confident on witch word to attend to

notion image

Then feed the output into a linear layer to process

Residual Connection, Layer Normalization & Pointwise Feed Forward

notion image

Each encoder can learn different representation

Output Embedding & Positional Encoding

Decoder Multi-Headed Attention

Look-Ahead Mask

notion image

SoftMax —— probability between 0 and 1

Each decoder can take different attention from encoder