GPU Architecture Explained: From Silicon to Screen - A Visual Guide

GPU Architecture Explained: From Silicon to Screen - A Visual Guide

Close-up of a modern GPU with cooling fans on a desk, illuminated by colorful ambient lighting and a blurred monitor background. Graphics processing units (GPU) architecture explained starts with understanding a remarkable disparity: while a high-end server might contain 24 to 48 CPU cores, adding just 4 to 8 GPUs to that same system can introduce up to 40,000 additional cores. This massive parallel processing capability is what makes GPUs so powerful for specific tasks.

Modern GPUs contain hundreds to thousands of specialized cores designed to excel at performing repetitive calculations simultaneously. For example, the NVIDIA GeForce GTX 480 Fermi architecture featured 480 stream processors (also called CUDA cores), consequently delivering exceptional performance for graphics rendering and computational tasks. At typical resolution settings, your screen displays more than 2 million pixels, each requiring precise color calculations that GPUs handle efficiently.

The value of GPU technology extends beyond gaming. NVIDIA's market valuation has soared past US$2 trillion as demand for its products continues to surge, primarily because these specialized processors have become essential for artificial intelligence applications. Throughout this article, we will examine GPU design from silicon to screen, breaking down how these complex devices transform mathematical data into visual output.

The Graphics Pipeline: From Vertex to Pixel

Diagram illustrating the real-time graphics pipeline stages from application to display including geometry and rasterization processes.

Image Source: Gamers Nexus

The graphics pipeline forms the core of GPU architecture, transforming 3D model data into pixels on your screen. Unlike CPUs which process instructions sequentially, GPUs are designed specifically for this pipeline's parallel operations.

Vertex Processing and Screen Space Transformation

At the beginning of the pipeline, the vertex shader processes individual vertices from input data. Each vertex undergoes transformations from object space to clip space, along with operations like skinning and morphing. Additionally, this stage attaches necessary vertex attributes and lighting calculations. Although highly parallel, this stage must process each vertex separately before the pipeline can continue.


Primitive Assembly and Clipping

Following vertex processing, individual vertices are assembled into primitives—typically triangles, lines, or points. The primitives are then tested against the view volume. Those completely outside are discarded, whereas partially visible primitives undergo clipping to create new vertices at the boundaries. Modern GPUs implement techniques like guard-band clipping to optimize this process and minimize geometric cracks between adjacent primitives.

Rasterization into Pixel Fragments

Next, the rasterizer converts the geometric primitives into pixel fragments. This stage determines which pixels are covered by each primitive, essentially turning vector information into a raster image. Each fragment corresponds to a pixel in the framebuffer, containing data needed for coloring. For complex scenes, this stage generates millions of fragments per frame.

Fragment Shading and Color Computation

The fragment shader receives these fragments and calculates their final color values. This shader interpolates vertex attributes across the primitive and applies textures, lighting effects, and other visual calculations. On high-end GPUs, hundreds of fragment shaders operate simultaneously, processing multiple fragments in parallel to maintain high frame rates.

Pixel Operations and Framebuffer Output

In the final stage, the render output unit (ROP) performs depth testing, stencil operations, and alpha blending. This component determines which fragments are actually visible when primitives overlap and blends transparent objects appropriately. The results are then written to the framebuffer, which stores the complete image before it's sent to your display.


Inside GPU Architecture: Core Components Explained

Peering beneath the surface of gpu architecture explained reveals an intricate system of computational units working in unison. Understanding these core components unlocks insights into how a gpu works and why they excel at parallel processing tasks.

SIMD Execution Units and ALU Arrays

At the heart of modern GPUs lie Arithmetic Logic Units (ALUs) organized in a Single Instruction Multiple Data (SIMD) architecture. Unlike CPUs, these ALUs execute identical instructions across multiple data points simultaneously. The NVIDIA Fermi architecture, furthermore, featured 32 cores per Streaming Multiprocessor (SM), with each core containing floating-point and integer execution units [1]. This design enables GPUs to process hundreds or thousands of operations in parallel, making them ideal for graphics and AI workloads.

Shader Core Context Switching for Latency Hiding

GPUs employ sophisticated context switching to maximize throughput. Notably, while a full GPU context switch costs 25-50μs, launching individual warps takes less than 10 cycles [2]. This rapid switching between execution contexts effectively hides memory access latency. Instead of waiting for data, the GPU simply switches to another available warp, keeping computational units busy. This technique particularly benefits machine learning applications where data dependencies could otherwise create bottlenecks.


Instruction Stream Sharing Across Fragments

To optimize efficiency, GPUs share instruction streams across multiple threads within warps (NVIDIA terminology) or wavefronts (AMD terminology). Each warp typically consists of 32 threads executing identical instructions in lockstep [1]. This approach reduces control logic overhead since multiple execution units can share a single instruction decoder. The tradeoff is that divergent code paths within a warp force the GPU to execute both branches sequentially, reducing performance.

Memory Hierarchy: Shared, Global, and Texture Memory

GPU memory architecture includes several specialized types:

  • Register Files: Fastest memory, private to each thread
  • Shared Memory: On-chip memory (192KB per SM in A100) shared within thread blocks [3]
  • Global Memory: Largest memory pool accessible by all threads (HBM2 with 1555 GB/sec bandwidth in A100) [3]
  • Texture Memory: Read-only memory optimized for spatial locality and filtering operations [4]

This hierarchy balances performance with capacity, enabling developers to optimize memory access patterns according to their application's needs.


Parallelism in Action: How GPUs Achieve High Throughput

Diagram comparing CPU architecture with centralized control and cache to GPU architecture with multiple control-cache units and shared cache.

Image Source: Khushi Agrawal

The secret to a gpu architecture explained lies in its sophisticated parallelism mechanisms. Modern GPUs achieve extraordinary throughput not through clock speeds, but primarily through massive thread parallelism and clever latency management techniques.

Thread-Level Parallelism in Fragment Processing

Unlike CPUs, GPUs employ a Single Instruction Multiple Thread (SIMT) execution model where hundreds of threads process different data elements using identical instructions. This model groups threads into units that execute concurrently—NVIDIA calls these "warps" while AMD uses "wavefronts" [5]. Indeed, this design allows GPUs to process thousands of fragments simultaneously, making them ideal for graphics rendering where each pixel can be calculated independently.

Warp Scheduling and Execution Contexts

The warp scheduler determines which thread groups execute on available processing units. Moreover, GPU warp scheduling occurs with virtually zero overhead—switching between warps happens in approximately one nanosecond, compared to microseconds on CPUs [6]. This efficiency comes from the fact that each thread maintains its private registers in the SM's register file, eliminating the need to save or restore contexts during switches [6].

Latency Hiding via Interleaved Execution


GPUs excel at hiding two critical latency types: arithmetic latency (6-24 cycles) and memory latency (400-800 cycles) [7]. They accomplish this through multithreading—executing many program instances simultaneously. Whenever a warp stalls due to a memory request, the scheduler immediately switches to another ready warp, keeping computational units busy [8]. First, this technique requires sufficient warp diversity; in fact, some applications need more warps to hide intermediate arithmetic intensity than very low or very high intensity—a phenomenon called "cusp behavior" [7].

Branch Divergence and Coherence Challenges

Branch divergence occurs when threads within a warp take different code paths. Subsequently, the GPU must execute both paths sequentially, significantly reducing efficiency [9]. For instance, in a 32-thread warp, if just one thread takes a different branch, execution time may increase by up to 32 times [10]. This challenge particularly affects raytracing applications, which typically have high thread divergence [11]. Hence, developers must structure algorithms to minimize divergent branches wherever possible.

Case Study: NVIDIA Fermi Architecture (GTX 480)

Block diagram of Nvidia GF100 GPU architecture highlighting 512 CUDA cores, 16 geometry units, and 384-bit GDDR5 memory interface.

Image Source: Bjorn3D.com

To visualize gpu architecture explained principles in action, NVIDIA's Fermi represents a watershed moment in GPU design history. Released in 2010 with the GeForce GTX 480, this architecture marked NVIDIA's most significant leap forward since the original G80 chip [12].

SIMT Execution Model in CUDA Cores

The Fermi GF100 implementation featured 480 CUDA cores (stream processors) organized into 15 Streaming Multiprocessors (SMs) with 32 cores per SM [13]. At the heart of this design lies the Single Instruction Multiple Thread (SIMT) execution model, wherein each SM executes threads in groups of 32 called warps [12]. This arrangement allows identical instructions to run concurrently across different data points. Notably, Fermi executes one instruction per clock per core [14], prioritizing single-threaded program performance while maintaining high throughput for parallel tasks.

Warp Scheduling and Context Storage (128KB)

Each SM featured dual warp schedulers and instruction dispatch units, enabling two warps to be issued and executed simultaneously [12]. This innovation allowed near peak hardware performance without checking for dependencies within instruction streams [12]. The architecture supported up to 48 warps per SM (1,536 threads total) [15], stored in 128KB of execution context memory [16]. This increased the number of registers per thread to 21 compared to 16 in the previous generation [15], substantially improving performance for register-intensive applications.

Shared Scratchpad Memory and Instruction Decoding

One of Fermi's key innovations was its configurable 64KB on-chip memory per SM that could be partitioned as either 48KB shared memory with 16KB L1 cache or vice versa [12]. This flexibility tripled the shared memory available to existing applications [12], while providing cache benefits for irregular access patterns. The entire GPU featured a unified 768KB L2 cache servicing all load, store, and texture requests [12]. Instruction decoding occurred in the dual front-end units, which could select and issue half of a warp every clock cycle [17].

Performance Metrics: GFLOPs and Core Count

The GTX 480 delivered impressive raw compute capabilities:

Throughout testing, Fermi demonstrated up to 4.2x faster performance than its GT200 predecessor in double-precision applications [12], especially for computational workloads requiring IEEE 754-2008 floating-point precision [12].


Conclusion

GPU architecture represents a marvel of modern computing, fundamentally different from traditional CPU design. Throughout this visual journey, we explored how these specialized processors transform mathematical data into vibrant screen images through massive parallelism. The graphics pipeline stands as the backbone of this process, systematically converting 3D model data into pixels through stages like vertex processing, rasterization, and fragment shading.

The core components we examined—SIMD execution units, context switching mechanisms, and specialized memory hierarchies—work together to achieve remarkable throughput. GPUs excel specifically because they handle thousands of simple calculations simultaneously rather than tackling complex sequential tasks. This parallel approach explains why a system with just 4-8 GPUs can effectively add 40,000 processing cores.

Thread-level parallelism emerged as a critical concept throughout our analysis. GPUs manage execution contexts with virtually zero overhead, switching between warps in approximately one nanosecond compared to microseconds on CPUs. This efficiency, coupled with clever latency hiding techniques, enables these processors to maintain computational throughput even when waiting for memory operations.

Our case study of NVIDIA's Fermi architecture highlighted these principles in action. With 480 CUDA cores organized into 15 Streaming Multiprocessors, the GTX 480 delivered 1,345 GFLOPs of single-precision performance—a remarkable achievement for 2010 technology. The configurable on-chip memory and dual warp schedulers demonstrated how hardware innovations addressed specific parallel computing challenges.

Looking beyond gaming applications, GPUs have become essential for artificial intelligence, scientific computing, and data analysis. Their ability to process vast amounts of data simultaneously makes them ideal for these computationally intensive tasks. Undoubtedly, as workloads continue to demand parallel processing capabilities, GPU architecture will remain a cornerstone of modern computing, transforming silicon into stunning visuals and powerful computation for years to come.


FAQs

Q1. What are the key components of GPU architecture? GPU architecture consists of several key components, including SIMD (Single Instruction Multiple Data) execution units, ALU (Arithmetic Logic Unit) arrays, shader cores, and a specialized memory hierarchy. These components work together to enable massive parallel processing of data, making GPUs ideal for graphics rendering and computational tasks.

Q2. How does GPU parallelism differ from CPU processing? While a high-end CPU might have 24 to 48 cores, a GPU can contain thousands of specialized cores designed for parallel processing. GPUs use a SIMT (Single Instruction Multiple Thread) execution model, allowing them to process multiple data elements simultaneously using identical instructions, which is particularly effective for graphics and AI workloads.

Q3. What is the graphics pipeline in GPU processing? The graphics pipeline is the core process in GPU architecture that transforms 3D model data into 2D pixels on your screen. It consists of several stages, including vertex processing, primitive assembly, rasterization, fragment shading, and pixel operations. This pipeline is designed for parallel operations, allowing GPUs to process millions of fragments per frame efficiently.

Q4. How do GPUs manage memory access and latency? GPUs employ sophisticated techniques to manage memory access and hide latency. They use a hierarchy of memory types, including register files, shared memory, and global memory. GPUs also utilize rapid context switching between warps (groups of threads) to hide memory latency. When one warp is waiting for data, the GPU can switch to another available warp, keeping the computational units busy.

Q5. What challenges do GPUs face in processing? One significant challenge in GPU processing is branch divergence. This occurs when threads within a warp take different code paths, forcing the GPU to execute both paths sequentially and potentially increasing execution time significantly. This is particularly problematic in applications like ray tracing. Developers must structure their algorithms to minimize divergent branches for optimal GPU performance.

References

[1] - https://www.anandtech.com/show/7793/imaginations-powervr-rogue-architecture-exposed/2
[2] - https://forums.developer.nvidia.com/t/cuda-context-switching-overhead-of-current-gpu/65918
[3] - https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf
[4] - https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/memorystatisticstexture.htm
[5] - https://developer.nvidia.com/gpugems/gpugems2/part-iv-general-purpose-computation-gpus-primer/chapter-33-implementing-efficient
[6] - https://modal.com/gpu-glossary/device-hardware/warp-scheduler
[7] - https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.pdf
[8] - https://forums.developer.nvidia.com/t/how-to-understand-the-hide-latency/258938
[9] - https://www.sciencedirect.com/topics/computer-science/branch-divergence
[10] - https://www.ece.lsu.edu/koppel/gp/2020/lsli06-br-diverg.pdf
[11] - https://research.nvidia.com/publication/2022-01_gpu-subwarp-interleaving
[12] - https://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf
[13] - https://www.techpowerup.com/gpu-specs/geforce-gtx-480.c268
[14] - https://www.anandtech.com/show/2849/3
[15] - https://www.tomshardware.com/reviews/geforce-gtx-480,2585-18.html
[16] - http://www.cs.cmu.edu/afs/cs/academic/class/15462-f12/www/lec_slides/462_gpus.pdf
[17] - https://www.anandtech.com/show/2977/nvidia-s-geforce-gtx-480-and-gtx-470-6-months-late-was-it-worth-the-wait-/3

Previous Post Next Post

نموذج الاتصال