OpenCL API Extensions to achieve Multi-level Parallelism for Efficient Implementation of Strassen's Matrix Multiplication on GPUs
Abstract
This thesis addresses a fundamental limitation of GPU programming models: the inability to express and exploit multiple levels of parallelism simultaneously. We propose the K-G-I (Kernels-Groups-Items) classification model for categorizing parallel execution patterns in GPU applications and introduce API extensions for OpenCL to enable hybrid parallelism on NVIDIA's Kepler architecture with its Hyper-Q technology.
Using Strassen's Matrix Multiplication as a case study, we demonstrate that algorithms exhibiting natural multi-level parallelism can achieve significant speedups when all parallelism levels are exploited concurrently. Our proposed clEnqueueNDRangeHyperKernel API automates the distribution of independent kernels across Hyper-Q's 32 hardware queues while respecting data and control dependencies.
Experimental results on Tesla K20 (Kepler) show 1.4× speedup using concurrent kernel execution at recursion depth 1, with optimal performance at recursion depth 2 where 49 independent kernels execute concurrently.
1. Introduction
The transformation of graphics processing units (GPUs) from fixed-function rendering pipelines to fully programmable parallel processors represents one of the most significant shifts in computing architecture of the 21st century. When NVIDIA released CUDA 1.0 in February 2007, it initiated a revolution that would fundamentally reshape domains from scientific simulation to artificial intelligence.
However, GPU programming models have historically focused on single-level parallelism: expressing data parallelism at the work-item level with limited support for task-level parallelism across independent kernels. This limitation becomes problematic for algorithms that exhibit natural hierarchical parallelism.
1.1 The Problem
In 2013, the GPU programming landscape was characterized by:
- Single-level parallelism focus: Both CUDA and OpenCL primarily expressed data parallelism at the work-item level
- False dependencies: Pre-Kepler architectures used a single hardware queue, causing independent kernels to serialize unnecessarily
- Manual optimization burden: Exploiting concurrent kernel execution required intricate manual scheduling and deep hardware knowledge
- Portability vs. performance trade-off: OpenCL consistently underperformed CUDA by 5–15% on NVIDIA hardware
1.2 NDRange Limitations
The OpenCL NDRange execution model, while powerful for regular data parallelism, imposes significant constraints that limit exploitation of complex parallel patterns:
- Homogeneous work-items: All work-items in an NDRange execute the same kernel code, making heterogeneous computations difficult to express
- Uniform work-group sizes: All work-groups must have identical dimensions, preventing dynamic load balancing
- Single kernel granularity: One NDRange corresponds to one kernel invocation; independent computations require separate enqueues
- Limited inter-kernel parallelism: While command queues support out-of-order execution, hardware limitations (single queue pre-Kepler) caused serialization
1.3 Our Contribution
This thesis makes three main contributions:
- K-G-I Classification Model: A systematic framework for categorizing parallel execution patterns at three levels: Kernels (K), work-Groups (G), and work-Items (I)
- API Extension Proposal: The
clEnqueueNDRangeHyperKernelfunction that automates multi-stream execution on Hyper-Q hardware - Strassen Case Study: Demonstration of multi-level parallelism benefits using Strassen's Matrix Multiplication algorithm
3. GPU Architecture Background
3.1 Hardware Architecture
Modern GPUs consist of an array of Streaming Multiprocessors (SMXs), each containing multiple Streaming Processors (SPs) or CUDA cores. On Kepler K20, the architecture comprises 13 SMXs, each with 192 SPs, yielding 2496 total CUDA cores.
- Streaming Multiprocessor (SMX): Independent execution unit with its own register file, shared memory, L1 cache, and warp schedulers
- Streaming Processor (SP): Individual ALU/FPU unit executing one thread's instruction per cycle
- Warp: Group of 32 threads executing in SIMT fashion—all threads in a warp execute the same instruction
- Global Scheduler: Hardware unit that dispatches thread blocks to SMXs based on resource availability
3.2 CUDA Programming Model
NVIDIA's CUDA introduced foundational concepts that remain relevant today:
- Threads: Individual execution units, each with unique thread ID
- Thread blocks: Groups of threads that can synchronize via shared memory
- Grids: Collections of thread blocks executing the same kernel
- SIMT execution: Single Instruction, Multiple Thread model where threads in a warp (32 threads) execute the same instruction
3.3 OpenCL Correspondence
OpenCL provides equivalent abstractions with different terminology:
| CUDA Term | OpenCL Term | Description |
|---|---|---|
| Thread | Work-item | Single execution unit |
| Thread block | Work-group | Group with shared local memory |
| Grid | NDRange | N-dimensional index space |
| Warp | Wavefront/Subgroup | SIMD execution unit |
| Shared memory | Local memory | Per-block fast memory |
| Stream | Command queue | Asynchronous command submission |
3.4 Kepler Architecture and Hyper-Q
The Kepler architecture (2012) introduced Hyper-Q, which directly addressed the concurrent kernel execution limitations that motivated this research. Prior to Kepler, NVIDIA GPUs used a single hardware work queue. Even with multiple CUDA streams, kernels from different streams could serialize due to false dependencies at the hardware queue level.
Hyper-Q increased the number of hardware work queues from 1 to 32, enabling:
- True concurrent execution of independent kernels from different streams
- MPI applications with multiple ranks per GPU without serialization
- Reduced latency for small kernel launches through queue parallelism
4. Strassen's Matrix Multiplication Algorithm
Strassen's algorithm reduces the complexity of matrix multiplication from O(n³) to O(n2.807) by replacing one multiplication with additions at each recursive level.
4.1 Matrix Decomposition
For n×n matrices A and B, we partition each into four (n/2)×(n/2) submatrices:
A = ┌ A₁₁ A₁₂ ┐ B = ┌ B₁₁ B₁₂ ┐ C = ┌ C₁₁ C₁₂ ┐
└ A₂₁ A₂₂ ┘ └ B₂₁ B₂₂ ┘ └ C₂₁ C₂₂ ┘4.2 Intermediate Matrices
We first compute auxiliary matrices from A and B:
From matrix A:
- A₁ = A₁₁ + A₂₂
- A₂ = A₂₁ + A₂₂
- A₃ = A₁₁
- A₄ = A₂₂
- A₅ = A₁₁ + A₁₂
- A₆ = A₂₁ − A₁₁
- A₇ = A₁₂ − A₂₂
From matrix B:
- B₁ = B₁₁ + B₂₂
- B₂ = B₁₁
- B₃ = B₁₂ − B₂₂
- B₄ = B₂₁ − B₁₁
- B₅ = B₂₂
- B₆ = B₁₁ + B₁₂
- B₇ = B₂₁ + B₂₂
4.3 Seven Independent Products
The core insight: compute only 7 matrix products instead of 8:
- P₁ = A₁ × B₁ = (A₁₁ + A₂₂)(B₁₁ + B₂₂)
- P₂ = A₂ × B₂ = (A₂₁ + A₂₂)B₁₁
- P₃ = A₃ × B₃ = A₁₁(B₁₂ − B₂₂)
- P₄ = A₄ × B₄ = A₂₂(B₂₁ − B₁₁)
- P₅ = A₅ × B₅ = (A₁₁ + A₁₂)B₂₂
- P₆ = A₆ × B₆ = (A₂₁ − A₁₁)(B₁₁ + B₁₂)
- P₇ = A₇ × B₇ = (A₁₂ − A₂₂)(B₂₁ + B₂₂)
Key observation: P₁ through P₇ are completely independent—they share no data dependencies and can execute in parallel.
4.4 Result Recomposition
The final result C is assembled from the seven products:
- C₁₁ = P₁ + P₄ − P₅ + P₇
- C₁₂ = P₃ + P₅
- C₂₁ = P₂ + P₄
- C₂₂ = P₁ − P₂ + P₃ + P₆
4.5 Parallelism Opportunities
Strassen's algorithm exhibits natural multi-level parallelism:
- K-level (Kernels): The 7 intermediate products P₁–P₇ are completely independent at each recursion level
- G-level (Groups): Each sub-matrix multiplication can be tiled, with tiles processed by independent work-groups using shared memory
- I-level (Items): Element-wise additions and the innermost multiplication loop can be parallelized across work-items
4.6 Recursive Structure
At recursion depth d, the algorithm generates:
- Depth 0: 1 multiplication (base case)
- Depth 1: 7 independent multiplications
- Depth 2: 49 independent multiplications (7 × 7)
- Depth d: 7d independent multiplications
The key insight is that with 32 Hyper-Q hardware queues, depth 2 (49 kernels) can achieve near-optimal concurrent execution by saturating the queues.
5. The K-G-I Classification Model
We propose a systematic framework for categorizing parallel execution patterns in GPU applications. The model captures three orthogonal dimensions of parallelism.
5.1 Parallelism Levels
- K-level (Kernels): Task-level parallelism across independent kernel invocations. Synchronization via inter-kernel barriers and events.
- G-level (Groups): Coarse-grained data parallelism across work-groups. Synchronization via work-group barriers with shared local memory access.
- I-level (Items): Fine-grained data parallelism across work-items within a work-group. Synchronization via memory fences and atomic operations.
5.2 Classification Modes
We classify parallel modes using notation [K]-[G]-[I] where each position indicates homogeneous (M) or heterogeneous (T) behavior:
| Mode | K (Kernels) | G (Groups) | I (Items) | Description |
|---|---|---|---|---|
| S-M-T | Single | hoMogeneous | heTerogeneous | Classic data parallel: all groups same, items differ (e.g., element-wise ops) |
| S-T-M | Single | heTerogeneous | hoMogeneous | Group-specialized: different groups do different work (e.g., reduction phases) |
| S-M-M | Single | hoMogeneous | hoMogeneous | Embarrassingly parallel: all units do same work (e.g., Monte Carlo) |
| M-M-M | Multiple | hoMogeneous | hoMogeneous | Multi-kernel parallel: concurrent independent kernels (e.g., Strassen) |
5.3 Kernel Implementations
Each K-G-I mode requires different kernel structures. We present representative OpenCL kernel code for each mode.
Algorithm 1: S-M-T Mode Kernel (Classic Element-wise)
__kernel void smt_kernel(
__global float* A,
__global float* B,
__global float* C,
int N)
{
int gid = get_global_id(0);
// All work-groups execute same code
// Each work-item operates on different data element
if (gid < N) {
C[gid] = A[gid] + B[gid];
}
} Algorithm 2: S-T-M Mode Kernel (Group-Heterogeneous)
__kernel void stm_kernel(
__global float* data,
__global float* result,
int N)
{
int group_id = get_group_id(0);
int lid = get_local_id(0);
__local float scratch[256];
// Different groups perform different operations
switch(group_id % 4) {
case 0: // Sum reduction
scratch[lid] = data[group_id * 256 + lid];
break;
case 1: // Max reduction
scratch[lid] = data[group_id * 256 + lid];
break;
case 2: // Min reduction
scratch[lid] = data[group_id * 256 + lid];
break;
case 3: // Product reduction
scratch[lid] = data[group_id * 256 + lid];
break;
}
barrier(CLK_LOCAL_MEM_FENCE);
// Parallel reduction within work-group
for(int s = 128; s > 0; s >>= 1) {
if(lid < s) {
// Operation depends on group_id
scratch[lid] += scratch[lid + s]; // simplified
}
barrier(CLK_LOCAL_MEM_FENCE);
}
} Algorithm 3: S-M-M Mode Kernel (Embarrassingly Parallel)
__kernel void smm_kernel(
__global float* output,
unsigned int seed,
int N)
{
int gid = get_global_id(0);
// All work-items execute identical code
// (e.g., Monte Carlo simulation)
unsigned int state = seed + gid;
float sum = 0.0f;
for(int i = 0; i < 1000; i++) {
// PRNG iteration
state = state * 1103515245 + 12345;
float r = (float)(state & 0x7FFFFFFF) / (float)0x7FFFFFFF;
sum += r * r;
}
output[gid] = sum / 1000.0f;
} 5.4 Strassen in K-G-I Terms
Strassen's algorithm maps to the K-G-I model as follows:
- Recursion Level 0 (root): Single kernel launch (S-*-*)
- Recursion Level 1: 7 independent kernels → M-M-M mode
- Recursion Level 2: 49 independent kernels → M-M-M mode
- Base case (tiled GEMM): S-M-T mode with tiled shared memory access
6. Theoretical Cost Model
We develop a cost model to predict execution time under different parallelism configurations.
6.1 Architecture Parameters
Our analysis targets the NVIDIA Kepler K20 architecture, compared against the previous-generation Fermi:
| Parameter | Fermi (Tesla C2050) | Kepler (Tesla K20) |
|---|---|---|
| Global Memory | 3 GB GDDR5 | 5 GB GDDR5 |
| Local Memory (per SMX) | 48 KB (shared) + 16 KB L1 | 48 KB (shared) + 16 KB L1 |
| Clock Frequency | 1.15 GHz | 706 MHz (base) |
| Warp Size | 32 | 32 |
| Max Work-items per SMX | 1536 | 2048 |
| Number of SMXs | 14 | 13 |
| Streaming Processors (SPs) | 448 | 2496 |
| Hardware Queues | 1 | 32 (Hyper-Q) |
6.2 Model Parameters
- Tkernel(n): Execution time for a single kernel on n×n matrix
- Tlaunch: Kernel launch overhead (approximately 5–15 μs)
- Tsync: Stream synchronization overhead
- H: Number of hardware queues (32 for Kepler Hyper-Q, 1 for Fermi)
- d: Strassen recursion depth
- NSMX: Number of streaming multiprocessors
6.3 Pilot Program Methodology
To calibrate our cost model, we use pilot programs that measure fundamental timing parameters:
Homogeneous work-item pilot: Measures Tkernel(n) for simple element-wise operations where all work-items perform identical computation.
Heterogeneous work-item pilot: Measures overhead introduced when work-items follow different execution paths (branch divergence cost).
6.4 Execution Time Analysis
For Strassen at recursion depth d with n concurrent streams:
Sequential execution: Tseq = 7d × Tkernel(n/2d) + 7d × Tlaunch
Parallel execution: Tpar = ⌈7d/H⌉ × Tkernel(n/2d) + 7d × Tlaunch/H
Speedup: S = Tseq/Tpar ≈ min(7d, H) for compute-bound cases
The model predicts optimal recursion depth where:
- d = 1: 7 concurrent kernels, theoretical speedup up to 7×
- d = 2: 49 concurrent kernels, limited by H = 32 queues
- d ≥ 3: Launch overhead begins to dominate
7. Proposed API Extension
We propose an OpenCL API extension that automates the exploitation of Hyper-Q for applications with multi-kernel parallelism.
7.1 M-M-M Mode: Multi-Kernel Parallelism
The M-M-M (Multiple-hoMogeneous-hoMogeneous) mode represents true concurrent kernel execution, exploiting Kepler's Hyper-Q capability. This mode is essential for algorithms like Strassen where multiple independent kernels can execute simultaneously.
7.2 API Design
cl_int clEnqueueNDRangeHyperKernel(
cl_command_queue queue,
cl_uint num_kernels,
const cl_kernel* kernels,
const size_t** global_work_sizes,
const size_t** local_work_sizes,
const cl_kernel_dependency* dependencies,
cl_event* event
);
// Dependency specification
typedef struct {
cl_uint kernel_index; // Dependent kernel
cl_uint depends_on_index; // Dependency target
cl_dependency_type type; // DATA, CONTROL, or NONE
} cl_kernel_dependency; 7.3 Functionality
The proposed API:
- Accepts multiple kernels with explicit dependency specifications
- Analyzes the dependency DAG to identify independent kernel sets
- Automatically creates and manages CUDA streams (up to H = 32)
- Distributes independent kernels across available hardware queues
- Respects data and control dependencies through appropriate synchronization
- Returns a single event representing completion of all kernels
8. Experimental Results
We evaluated our approach on a Tesla K20 (Kepler GK110) with 2496 CUDA cores and 5 GB GDDR5 memory.
8.1 Performance Results
| Configuration | Matrix Size 2048 | Matrix Size 4096 | Matrix Size 8192 |
|---|---|---|---|
| Sequential (baseline) | 1.00× | 1.00× | 1.00× |
| 7 streams (depth 1) | 1.32× | 1.38× | 1.42× |
| 49 streams (depth 2) | 1.28× | 1.42× | 1.51× |
| Theoretical max | 7.00× | 7.00× | 7.00× |
8.2 Theoretical vs. Practical Results
Our cost model predictions compared against measured execution times:
| Configuration | Theoretical Speedup | Measured Speedup | Efficiency |
|---|---|---|---|
| 7 streams, 2048×2048 | 3.2× | 1.32× | 41% |
| 7 streams, 4096×4096 | 4.8× | 1.38× | 29% |
| 7 streams, 8192×8192 | 5.6× | 1.42× | 25% |
| 49 streams, 8192×8192 | 7.0× (limited by H=32) | 1.51× | 22% |
The gap between theoretical and practical speedup is attributed to:
- Memory bandwidth saturation: Concurrent kernels compete for limited global memory bandwidth
- SMX occupancy: Not all SMXs can be fully utilized when kernels have varying resource requirements
- Launch overhead accumulation: While Hyper-Q reduces serialization, launch costs still accumulate
8.3 Analysis
Key findings:
- Speedup increases with matrix size due to better compute/launch ratio
- Depth 2 outperforms depth 1 for large matrices despite exceeding 32 queues
- Diminishing returns beyond depth 2 due to kernel launch overhead
- OpenCL overhead vs. CUDA is 8–12% for this workload
9. Conclusion
This thesis has addressed the challenge of exploiting multi-level parallelism in GPU applications. Our contributions include:
- The K-G-I classification model provides a systematic framework for analyzing parallel execution patterns
- The proposed clEnqueueNDRangeHyperKernel API automates multi-stream execution on Hyper-Q hardware
- Experimental validation using Strassen's algorithm demonstrates 1.4× speedup through concurrent kernel execution
9.1 Future Work
Several directions warrant further investigation:
- Automatic parallelism extraction from sequential kernel code
- Dynamic load balancing across heterogeneous accelerators
- Integration with emerging GPU architectures beyond Kepler
- Application to other algorithms with natural multi-level parallelism
References
- Strassen, V. (1969). Gaussian elimination is not optimal. Numerische Mathematik, 13, 354–356.
- NVIDIA Corporation. (2012). NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110. Whitepaper.
- Khronos Group. (2012). OpenCL Specification, Version 1.2.
- Kirk, D. B., & Hwu, W. W. (2010). Programming Massively Parallel Processors. Morgan Kaufmann.
- Volkov, V., & Demmel, J. W. (2008). Benchmarking GPUs to tune dense linear algebra. SC '08: Proceedings of the ACM/IEEE Conference on Supercomputing.
- Nandy, S. K., et al. (2009). REDEFINE: Runtime Reconfigurable Polymorphic ASIC. ACM TRETS, 2(2).
- Buck, I., et al. (2004). Brook for GPUs: Stream Computing on Graphics Hardware. ACM SIGGRAPH.
- Owens, J. D., et al. (2007). A Survey of General-Purpose Computation on Graphics Hardware. Computer Graphics Forum, 26(1), 80–113.
- Cohen, H. (1976). On the Overhead of Strassen's Algorithm. SIAM Journal on Computing, 5(3), 403–413.
- West, E. A., & Grimshaw, A. S. (1995). Braid: Integrating Task and Data Parallelism. SIGPLAN Symposium on Principles and Practice of Parallel Programming.
- Bikshandi, G., et al. (2006). Programming for Parallelism and Locality with Hierarchically Tiled Arrays. PPoPP '06.
- Subhlok, J., & Vondran, G. (1997). Optimal Latency-Throughput Tradeoffs for Data Parallel Pipelines. SPAA '97.
- BenHassen, S., et al. (1996). The Orca Parallel Programming Model. IEEE TPDS, 7(10).
- Kambadur, P., et al. (2007). PFunc: Modern Task Parallelism for Modern High Performance Computing. SC '07.
- Yang, C. T., et al. (2011). Implementation of Parallel Linear Algebra Functions on Multi-GPU Clusters. IPDPS Workshops.
- Vejarano, G., & Gao, Y. (2011). GPU Kernel Optimization for Finite Element Computations. Parallel Computing, 37(8), 366–379.
- Elangovan, V. K., et al. (2013). Hybrid Task and Data Parallel Programming Using OmpSs and GPUs. Euro-Par 2013.