HPC-Introduction

Lucas-TY

HPC|Jan 30, 2024|Last edited: Feb 1, 2024|

type

status

date

slug

summary

tags

category

icon

password

HPC Introduction

Von Neumann Computer Architecture

notion image

Flynn's Classical Taxonom

General Parallel Computing Terminology

Node

A standalone "computer in a box." Usually comprised of multiple CPUs/processors/cores, memory, network interfaces, etc. Nodes are networked together to comprise a supercomputer

Task

A logically discrete section of computational work. A task is typically a program or program-like set of instructions that is executed by a processor. A parallel program consists of multiple tasks running on multiple processors

Pipelining

Breaking a task into steps performed by different processor units, with inputs streaming through, much like an assembly line; a type of parallel computing

Shared Memory

Describes a computer architecture where all processors have direct access to common physical memory. In a programming sense, it describes a model where parallel tasks all have the same "picture" of memory and can directly address and access the same logical memory locations regardless of where the physical memory actually exists

Symmetric Multi-Processor (SMP)

Shared memory hardware architecture where multiple processors share a single address space and have equal access to all resources - memory, disk, etc.

Distributed Memory

In hardware, refers to network-based memory access for physical memory that is not common. As a programming model, tasks can only logically "see" local machine memory and must use communications to access memory on other machines where other tasks are executing

Communications

Parallel tasks typically need to exchange data. There are several ways this can be accomplished, such as through a shared memory bus or over a networ

Synchronization

The coordination of parallel tasks in real time, very often associated with communications
Synchronization usually involves waiting by at least one task, and can therefore cause a parallel application's wall clock execution time to increase.

Computational Granularity

Coarse: relatively large amounts of computational work are done between communication events
Fine: relatively small amounts of computational work are done between communication events

Observed Speedup

Observed speedup of a code which has been parallelized, defined as

(wall-clock time of serial execution)/(wall-clock time of parallel execution)

Parallel Overhead

Required execution time that is unique to parallel tasks, as opposed to that for doing useful work. Include factors such as: Task start-up/termination time; Synchronizations & communication; Software overhead imposed by parallel languages, libraries, operating system, etc.

Embarrassingly (IDEALY) Parallel – Solving many similar, but independent tasks simultaneously; little to no need for coordination between the tasks

Scalability

Refers to a parallel system's (hardware and/or software) ability to demonstrate a proportionate increase in parallel speedup with the addition of more resources. Factors that contribute to scalability include:

Hardware: particularly memory/cpu bandwidths and network communication properties
Application algorithm; Parallel overhead related; Characteristics of your specific application

Amdahl’s Law

Speedup = 1 / (1 – P)

If number of processors is increased

Speedup= \frac{1}{\frac{p}{N}+S}
P = parallel fraction
N = number of processors
S = serial fraction

Moore’s Law

The number of transistors in a dense integrated circuit (IC) doubles about every two years

notion image

Scalability

Strong scaling

The total problem size stays fixed as more processors are added
Goal is to run the same problem size faster
Perfect scaling means problem is solved in 1/P time (compared to serial)

Weak scaling (Gustafson)

The problem size per processor stays fixed as more processors are added. The total problem size is proportional to the number of processors used.
Goal is to run larger problem in same amount of time
Perfect scaling means problem Px runs in same time as single processor run

Shared Memory

Ability for all processors to access all memory as global address space

Multiple processors can operate independently but share the same memory resources

Changes in a memory location effected by one processor are visible to all other processors

Classified as Uniform Memory Access (UMA) and Non-Uniform Memory Access (NUMA), based upon memory access times.

Uniform Memory Access (UMA)

Identical processors

Equal access and access times to memory

Sometimes called CC-UMA - Cache Coherent UMA

Cache coherent means if one processor updates a location in shared memory, all the other processors know about the update
Cache coherency is accomplished at the hardware leve

notion image

Non-Uniform Memory Access (NUMA)

Often made by physically linking two or more SMPs

One SMP can directly access memory of another SMP

Not all processors have equal access time to all memories

Memory access across link is slower

If cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent NUMA

Shared Memory: Advantages and Disadvantages

Advantages

user-friendly programming
Data sharing between tasks is both fast and uniform

Disadvantages

lack of scalability
Programmer responsibility for synchronization constructs

Distributed Memory

Requires a communication network to connect inter-processor memory

Processors have their own local memory

Memory addresses in one processor do not map to another processor

No concept of global address space across all processors

No concept of cache coherency

notion image

Distributed Memory: Advantages and Disadvantages

Advantages

scalable
each processor can rapidly access its own memory
Cost effectiveness: can use commodity, off-the-shelf processors and networking

Disadvantages

The programmer is responsible for many of the details
May be difficult to map existing data structures
Non-uniform memory access times

Hybrid Distributed-Shared Memory

Advantages and Disadvantages

Whatever is common to both shared and distributed memory architectures
Increased scalability is an important advantage
Increased programmer complexity is an important disadvantage

Parallel Programming Models

notion image

Synchronous vs. Asynchronous Communications

Synchronous communications require some type of "handshaking" between tasks

Synchronous communications are often referred to as blocking communications

Asynchronous communications are often referred to as non-blocking communications

Scope of Communications

Point-to-point

involves two tasks with one task acting as the sender/producer of data, and the other acting as the receiver/consumer.

Collective

involves data sharing between more than two tasks, which are often specified as being members in a common group, or collective.

Types of Collective Communication

notion image

Synchronization

Types of Synchronization

Barrier

When the last task reaches the barrier, all tasks are synchronized.

Lock/Semaphore

The first task to acquire the lock "sets" it
This task can then safely (serially) access the protected data or code.
Other tasks can attempt to acquire the lock but must wait until the task that owns the lock releases it.

Synchronous communication operations

For example, before a task can perform a send operation, it must first receive an acknowledgment from the receiving task that it is OK to send

“Best Practices” for I/O

Rule #1: Reduce overall I/O as much as possible.

I/O operations are generally regarded as inhibitors to parallelism.

I/O operations require orders of magnitude more time than memory operations.