OpenCL API Extensions to achieve Multi-level Parallelism for Efficient Implementation of Strassen's Matrix Multiplication on GPUs

2013 M.Tech Thesis, IISc Bangalore

Sivaramakrishnan Kasiviswanathan Founder, NurtureAxis Technology Pvt. Ltd. Ex Google Search (M.Tech., IISc)

Advisors: Prof. S.K. Nandy & Dr. J. Lakshmi Supercomputer Education and Research Centre (SERC)

Abstract

This thesis addresses a fundamental limitation of GPU programming models: the inability to express and exploit multiple levels of parallelism simultaneously. We propose the K-G-I (Kernels-Groups-Items) classification model for categorizing parallel execution patterns in GPU applications and introduce API extensions for OpenCL to enable hybrid parallelism on NVIDIA's Kepler architecture with its Hyper-Q technology.

Using Strassen's Matrix Multiplication as a case study, we demonstrate that algorithms exhibiting natural multi-level parallelism can achieve significant speedups when all parallelism levels are exploited concurrently. Our proposed clEnqueueNDRangeHyperKernel API automates the distribution of independent kernels across Hyper-Q's 32 hardware queues while respecting data and control dependencies.

Experimental results on Tesla K20 (Kepler) show 1.4× speedup using concurrent kernel execution at recursion depth 1, with optimal performance at recursion depth 2 where 49 independent kernels execute concurrently.

1. Introduction

The transformation of graphics processing units (GPUs) from fixed-function rendering pipelines to fully programmable parallel processors represents one of the most significant shifts in computing architecture of the 21st century. When NVIDIA released CUDA 1.0 in February 2007, it initiated a revolution that would fundamentally reshape domains from scientific simulation to artificial intelligence.

However, GPU programming models have historically focused on single-level parallelism: expressing data parallelism at the work-item level with limited support for task-level parallelism across independent kernels. This limitation becomes problematic for algorithms that exhibit natural hierarchical parallelism.

1.1 The Problem

In 2013, the GPU programming landscape was characterized by:

Single-level parallelism focus: Both CUDA and OpenCL primarily expressed data parallelism at the work-item level
False dependencies: Pre-Kepler architectures used a single hardware queue, causing independent kernels to serialize unnecessarily
Manual optimization burden: Exploiting concurrent kernel execution required intricate manual scheduling and deep hardware knowledge
Portability vs. performance trade-off: OpenCL consistently underperformed CUDA by 5–15% on NVIDIA hardware

1.2 NDRange Limitations

The OpenCL NDRange execution model, while powerful for regular data parallelism, imposes significant constraints that limit exploitation of complex parallel patterns:

Homogeneous work-items: All work-items in an NDRange execute the same kernel code, making heterogeneous computations difficult to express
Uniform work-group sizes: All work-groups must have identical dimensions, preventing dynamic load balancing
Single kernel granularity: One NDRange corresponds to one kernel invocation; independent computations require separate enqueues
Limited inter-kernel parallelism: While command queues support out-of-order execution, hardware limitations (single queue pre-Kepler) caused serialization

NDRange index space showing global IDs, work-group IDs, and local IDs in a 2D grid — NDRange index space structure showing homogeneous work-item organization

1.3 Our Contribution

This thesis makes three main contributions:

K-G-I Classification Model: A systematic framework for categorizing parallel execution patterns at three levels: Kernels (K), work-Groups (G), and work-Items (I)
API Extension Proposal: The clEnqueueNDRangeHyperKernel function that automates multi-stream execution on Hyper-Q hardware
Strassen Case Study: Demonstration of multi-level parallelism benefits using Strassen's Matrix Multiplication algorithm

Diagram showing the three levels of GPU parallelism: Kernel level (task parallelism), Work-group level (coarse-grained data parallelism), and Work-item level (fine-grained SIMD parallelism) — Figure 1: The three levels of GPU parallelism addressed in this thesis

2. Related Work

Research on exploiting parallelism in matrix algorithms and GPU programming has a rich history. We categorize prior work into algorithmic optimizations and system-level approaches.

2.1 Algorithmic Approaches

Cohen analyzed the overhead of Strassen's recursion, establishing that the crossover point where Strassen outperforms conventional multiplication depends critically on the constant factors in implementation—a finding that remains relevant for GPU implementations where kernel launch overhead must be balanced against reduced arithmetic operations.

West et al. developed the Mentat Programming Language, introducing the concept of "medium-grained" parallelism through their MPL construct for expressing parallel object invocations. Their work influenced our thinking on K-level parallelism as a distinct abstraction layer.

2.2 Hierarchical and Tiled Approaches

Bikshandi et al. introduced hierarchically tiled arrays (HTAs), providing a high-level abstraction for expressing tiled computations. HTAs capture G-level parallelism naturally but lack explicit support for K-level concurrent kernel execution.

Subhlok et al. explored the integration of task and data parallelism in scientific codes, demonstrating speedups when both levels are exploited. Their work directly motivates our K-G-I classification.

2.3 Language and Runtime Systems

BenHassen et al. developed the Orca language for expressing multi-level parallelism on distributed memory systems. Kambadur et al. introduced the PFunc library for parallel task creation with work-stealing schedulers.

2.4 GPU-Specific Work

Chao-Tung et al. examined multi-GPU cluster configurations for linear algebra, but focused on inter-GPU parallelism rather than intra-GPU multi-level parallelism.

Vejarano et al. developed analytical cost models for GPU execution, providing foundation techniques that inform our cost model development.

Most recently, Elangovan et al. explored the OmpSs task-based programming model for GPU clusters, demonstrating task parallelism at the cluster level but not exploiting intra-GPU concurrent kernel execution.

Gap in Prior Work: While individual levels of parallelism have been well-studied, no prior work systematically classifies the interaction between kernel-level, work-group-level, and work-item-level parallelism on GPUs, nor provides API support for exploiting all three levels simultaneously.

3. GPU Architecture Background

3.1 Hardware Architecture

Modern GPUs consist of an array of Streaming Multiprocessors (SMXs), each containing multiple Streaming Processors (SPs) or CUDA cores. On Kepler K20, the architecture comprises 13 SMXs, each with 192 SPs, yielding 2496 total CUDA cores.

Streaming Multiprocessor (SMX): Independent execution unit with its own register file, shared memory, L1 cache, and warp schedulers
Streaming Processor (SP): Individual ALU/FPU unit executing one thread's instruction per cycle
Warp: Group of 32 threads executing in SIMT fashion—all threads in a warp execute the same instruction
Global Scheduler: Hardware unit that dispatches thread blocks to SMXs based on resource availability

Hardware architecture of NVIDIA Kepler GPU showing SMXs, memory hierarchy, and global scheduler — Figure: Hardware architecture of a Kepler GPU showing SMXs, memory hierarchy, and scheduling

3.2 CUDA Programming Model

NVIDIA's CUDA introduced foundational concepts that remain relevant today:

Threads: Individual execution units, each with unique thread ID
Thread blocks: Groups of threads that can synchronize via shared memory
Grids: Collections of thread blocks executing the same kernel
SIMT execution: Single Instruction, Multiple Thread model where threads in a warp (32 threads) execute the same instruction

3.3 OpenCL Correspondence

OpenCL provides equivalent abstractions with different terminology:

CUDA Term	OpenCL Term	Description
Thread	Work-item	Single execution unit
Thread block	Work-group	Group with shared local memory
Grid	NDRange	N-dimensional index space
Warp	Wavefront/Subgroup	SIMD execution unit
Shared memory	Local memory	Per-block fast memory
Stream	Command queue	Asynchronous command submission

3.4 Kepler Architecture and Hyper-Q

The Kepler architecture (2012) introduced Hyper-Q, which directly addressed the concurrent kernel execution limitations that motivated this research. Prior to Kepler, NVIDIA GPUs used a single hardware work queue. Even with multiple CUDA streams, kernels from different streams could serialize due to false dependencies at the hardware queue level.

Hyper-Q increased the number of hardware work queues from 1 to 32, enabling:

True concurrent execution of independent kernels from different streams
MPI applications with multiple ranks per GPU without serialization
Reduced latency for small kernel launches through queue parallelism

Comparison diagram showing single hardware queue (Fermi) vs 32 hardware queues (Kepler Hyper-Q) — Figure 2: Fermi single queue vs. Kepler Hyper-Q with 32 hardware queues

4. Strassen's Matrix Multiplication Algorithm

Strassen's algorithm reduces the complexity of matrix multiplication from O(n³) to O(n^2.807) by replacing one multiplication with additions at each recursive level.

4.1 Matrix Decomposition

For n×n matrices A and B, we partition each into four (n/2)×(n/2) submatrices:

A = ┌ A₁₁  A₁₂ ┐    B = ┌ B₁₁  B₁₂ ┐    C = ┌ C₁₁  C₁₂ ┐
    └ A₂₁  A₂₂ ┘        └ B₂₁  B₂₂ ┘        └ C₂₁  C₂₂ ┘

4.2 Intermediate Matrices

We first compute auxiliary matrices from A and B:

From matrix A:

A₁ = A₁₁ + A₂₂
A₂ = A₂₁ + A₂₂
A₃ = A₁₁
A₄ = A₂₂
A₅ = A₁₁ + A₁₂
A₆ = A₂₁ − A₁₁
A₇ = A₁₂ − A₂₂

From matrix B:

B₁ = B₁₁ + B₂₂
B₂ = B₁₁
B₃ = B₁₂ − B₂₂
B₄ = B₂₁ − B₁₁
B₅ = B₂₂
B₆ = B₁₁ + B₁₂
B₇ = B₂₁ + B₂₂

4.3 Seven Independent Products

The core insight: compute only 7 matrix products instead of 8:

P₁ = A₁ × B₁ = (A₁₁ + A₂₂)(B₁₁ + B₂₂)
P₂ = A₂ × B₂ = (A₂₁ + A₂₂)B₁₁
P₃ = A₃ × B₃ = A₁₁(B₁₂ − B₂₂)
P₄ = A₄ × B₄ = A₂₂(B₂₁ − B₁₁)
P₅ = A₅ × B₅ = (A₁₁ + A₁₂)B₂₂
P₆ = A₆ × B₆ = (A₂₁ − A₁₁)(B₁₁ + B₁₂)
P₇ = A₇ × B₇ = (A₁₂ − A₂₂)(B₂₁ + B₂₂)

Key observation: P₁ through P₇ are completely independent—they share no data dependencies and can execute in parallel.

4.4 Result Recomposition

The final result C is assembled from the seven products:

C₁₁ = P₁ + P₄ − P₅ + P₇
C₁₂ = P₃ + P₅
C₂₁ = P₂ + P₄
C₂₂ = P₁ − P₂ + P₃ + P₆

4.5 Parallelism Opportunities

Strassen's algorithm exhibits natural multi-level parallelism:

K-level (Kernels): The 7 intermediate products P₁–P₇ are completely independent at each recursion level
G-level (Groups): Each sub-matrix multiplication can be tiled, with tiles processed by independent work-groups using shared memory
I-level (Items): Element-wise additions and the innermost multiplication loop can be parallelized across work-items

4.6 Recursive Structure

At recursion depth d, the algorithm generates:

Depth 0: 1 multiplication (base case)
Depth 1: 7 independent multiplications
Depth 2: 49 independent multiplications (7 × 7)
Depth d: 7^d independent multiplications

The key insight is that with 32 Hyper-Q hardware queues, depth 2 (49 kernels) can achieve near-optimal concurrent execution by saturating the queues.

Directed acyclic graph showing Strassen algorithm dependencies: 7 independent P computations feeding into 4 C quadrant computations — Figure 3: Dependency DAG for Strassen's algorithm showing parallel opportunities

5. The K-G-I Classification Model

We propose a systematic framework for categorizing parallel execution patterns in GPU applications. The model captures three orthogonal dimensions of parallelism.

5.1 Parallelism Levels

K-level (Kernels): Task-level parallelism across independent kernel invocations. Synchronization via inter-kernel barriers and events.
G-level (Groups): Coarse-grained data parallelism across work-groups. Synchronization via work-group barriers with shared local memory access.
I-level (Items): Fine-grained data parallelism across work-items within a work-group. Synchronization via memory fences and atomic operations.

5.2 Classification Modes

We classify parallel modes using notation [K]-[G]-[I] where each position indicates homogeneous (M) or heterogeneous (T) behavior:

Mode	K (Kernels)	G (Groups)	I (Items)	Description
S-M-T	Single	hoMogeneous	heTerogeneous	Classic data parallel: all groups same, items differ (e.g., element-wise ops)
S-T-M	Single	heTerogeneous	hoMogeneous	Group-specialized: different groups do different work (e.g., reduction phases)
S-M-M	Single	hoMogeneous	hoMogeneous	Embarrassingly parallel: all units do same work (e.g., Monte Carlo)
M-M-M	Multiple	hoMogeneous	hoMogeneous	Multi-kernel parallel: concurrent independent kernels (e.g., Strassen)

Visual representation of four K-G-I classification modes: S-M-T, S-T-M, S-M-M, and M-M-M with grid visualizations — Figure 4: Visual representation of K-G-I classification modes

5.3 Kernel Implementations

Each K-G-I mode requires different kernel structures. We present representative OpenCL kernel code for each mode.

Algorithm 1: S-M-T Mode Kernel (Classic Element-wise)

__kernel void smt_kernel(
    __global float* A,
    __global float* B,
    __global float* C,
    int N)
{
    int gid = get_global_id(0);

    // All work-groups execute same code
    // Each work-item operates on different data element
    if (gid < N) {
        C[gid] = A[gid] + B[gid];
    }
}

Algorithm 2: S-T-M Mode Kernel (Group-Heterogeneous)

__kernel void stm_kernel(
    __global float* data,
    __global float* result,
    int N)
{
    int group_id = get_group_id(0);
    int lid = get_local_id(0);
    __local float scratch[256];

    // Different groups perform different operations
    switch(group_id % 4) {
        case 0: // Sum reduction
            scratch[lid] = data[group_id * 256 + lid];
            break;
        case 1: // Max reduction
            scratch[lid] = data[group_id * 256 + lid];
            break;
        case 2: // Min reduction
            scratch[lid] = data[group_id * 256 + lid];
            break;
        case 3: // Product reduction
            scratch[lid] = data[group_id * 256 + lid];
            break;
    }
    barrier(CLK_LOCAL_MEM_FENCE);

    // Parallel reduction within work-group
    for(int s = 128; s > 0; s >>= 1) {
        if(lid < s) {
            // Operation depends on group_id
            scratch[lid] += scratch[lid + s]; // simplified
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }
}

Algorithm 3: S-M-M Mode Kernel (Embarrassingly Parallel)

__kernel void smm_kernel(
    __global float* output,
    unsigned int seed,
    int N)
{
    int gid = get_global_id(0);

    // All work-items execute identical code
    // (e.g., Monte Carlo simulation)
    unsigned int state = seed + gid;
    float sum = 0.0f;

    for(int i = 0; i < 1000; i++) {
        // PRNG iteration
        state = state * 1103515245 + 12345;
        float r = (float)(state & 0x7FFFFFFF) / (float)0x7FFFFFFF;
        sum += r * r;
    }

    output[gid] = sum / 1000.0f;
}

5.4 Strassen in K-G-I Terms

Strassen's algorithm maps to the K-G-I model as follows:

Recursion Level 0 (root): Single kernel launch (S-*-*)
Recursion Level 1: 7 independent kernels → M-M-M mode
Recursion Level 2: 49 independent kernels → M-M-M mode
Base case (tiled GEMM): S-M-T mode with tiled shared memory access

6. Theoretical Cost Model

We develop a cost model to predict execution time under different parallelism configurations.

6.1 Architecture Parameters

Our analysis targets the NVIDIA Kepler K20 architecture, compared against the previous-generation Fermi:

Parameter	Fermi (Tesla C2050)	Kepler (Tesla K20)
Global Memory	3 GB GDDR5	5 GB GDDR5
Local Memory (per SMX)	48 KB (shared) + 16 KB L1	48 KB (shared) + 16 KB L1
Clock Frequency	1.15 GHz	706 MHz (base)
Warp Size	32	32
Max Work-items per SMX	1536	2048
Number of SMXs	14	13
Streaming Processors (SPs)	448	2496
Hardware Queues	1	32 (Hyper-Q)

6.2 Model Parameters

T_kernel(n): Execution time for a single kernel on n×n matrix
T_launch: Kernel launch overhead (approximately 5–15 μs)
T_sync: Stream synchronization overhead
H: Number of hardware queues (32 for Kepler Hyper-Q, 1 for Fermi)
d: Strassen recursion depth
N_SMX: Number of streaming multiprocessors

6.3 Pilot Program Methodology

To calibrate our cost model, we use pilot programs that measure fundamental timing parameters:

Homogeneous work-item pilot: Measures T_kernel(n) for simple element-wise operations where all work-items perform identical computation.

Heterogeneous work-item pilot: Measures overhead introduced when work-items follow different execution paths (branch divergence cost).

Pilot program measuring homogeneous work-item execution time — Figure: Pilot program for homogeneous work-items showing uniform execution

6.4 Execution Time Analysis

For Strassen at recursion depth d with n concurrent streams:

Sequential execution: T_seq = 7^d × T_kernel(n/2^d) + 7^d × T_launch

Parallel execution: T_par = ⌈7^d/H⌉ × T_kernel(n/2^d) + 7^d × T_launch/H

Speedup: S = T_seq/T_par ≈ min(7^d, H) for compute-bound cases

The model predicts optimal recursion depth where:

d = 1: 7 concurrent kernels, theoretical speedup up to 7×
d = 2: 49 concurrent kernels, limited by H = 32 queues
d ≥ 3: Launch overhead begins to dominate

7. Proposed API Extension

We propose an OpenCL API extension that automates the exploitation of Hyper-Q for applications with multi-kernel parallelism.

7.1 M-M-M Mode: Multi-Kernel Parallelism

The M-M-M (Multiple-hoMogeneous-hoMogeneous) mode represents true concurrent kernel execution, exploiting Kepler's Hyper-Q capability. This mode is essential for algorithms like Strassen where multiple independent kernels can execute simultaneously.

M-M-M mode showing multiple kernels executing concurrently across Hyper-Q queues — Figure: M-M-M mode of execution on Kepler K20 showing concurrent kernel dispatch via Hyper-Q

7.2 API Design

cl_int clEnqueueNDRangeHyperKernel(
    cl_command_queue queue,
    cl_uint num_kernels,
    const cl_kernel* kernels,
    const size_t** global_work_sizes,
    const size_t** local_work_sizes,
    const cl_kernel_dependency* dependencies,
    cl_event* event
);

// Dependency specification
typedef struct {
    cl_uint kernel_index;      // Dependent kernel
    cl_uint depends_on_index;  // Dependency target
    cl_dependency_type type;   // DATA, CONTROL, or NONE
} cl_kernel_dependency;

7.3 Functionality

The proposed API:

Accepts multiple kernels with explicit dependency specifications
Analyzes the dependency DAG to identify independent kernel sets
Automatically creates and manages CUDA streams (up to H = 32)
Distributes independent kernels across available hardware queues
Respects data and control dependencies through appropriate synchronization
Returns a single event representing completion of all kernels

Workflow diagram showing clEnqueueNDRangeHyperKernel analyzing DAG, creating streams, distributing kernels, and synchronizing — Figure 5: Workflow of the proposed clEnqueueNDRangeHyperKernel API

8. Experimental Results

We evaluated our approach on a Tesla K20 (Kepler GK110) with 2496 CUDA cores and 5 GB GDDR5 memory.

8.1 Performance Results

Configuration	Matrix Size 2048	Matrix Size 4096	Matrix Size 8192
Sequential (baseline)	1.00×	1.00×	1.00×
7 streams (depth 1)	1.32×	1.38×	1.42×
49 streams (depth 2)	1.28×	1.42×	1.51×
Theoretical max	7.00×	7.00×	7.00×

8.2 Theoretical vs. Practical Results

Our cost model predictions compared against measured execution times:

Configuration	Theoretical Speedup	Measured Speedup	Efficiency
7 streams, 2048×2048	3.2×	1.32×	41%
7 streams, 4096×4096	4.8×	1.38×	29%
7 streams, 8192×8192	5.6×	1.42×	25%
49 streams, 8192×8192	7.0× (limited by H=32)	1.51×	22%

The gap between theoretical and practical speedup is attributed to:

Memory bandwidth saturation: Concurrent kernels compete for limited global memory bandwidth
SMX occupancy: Not all SMXs can be fully utilized when kernels have varying resource requirements
Launch overhead accumulation: While Hyper-Q reduces serialization, launch costs still accumulate

8.3 Analysis

Key findings:

Speedup increases with matrix size due to better compute/launch ratio
Depth 2 outperforms depth 1 for large matrices despite exceeding 32 queues
Diminishing returns beyond depth 2 due to kernel launch overhead
OpenCL overhead vs. CUDA is 8–12% for this workload

Bar chart showing speedup vs matrix size for different stream configurations — Figure 6: Performance comparison across matrix sizes and configurations

9. Conclusion

This thesis has addressed the challenge of exploiting multi-level parallelism in GPU applications. Our contributions include:

The K-G-I classification model provides a systematic framework for analyzing parallel execution patterns
The proposed clEnqueueNDRangeHyperKernel API automates multi-stream execution on Hyper-Q hardware
Experimental validation using Strassen's algorithm demonstrates 1.4× speedup through concurrent kernel execution

9.1 Future Work

Several directions warrant further investigation:

Automatic parallelism extraction from sequential kernel code
Dynamic load balancing across heterogeneous accelerators
Integration with emerging GPU architectures beyond Kepler
Application to other algorithms with natural multi-level parallelism

Historical Note (2026): Many concepts proposed in this 2013 thesis are now mainstream. CUDA Graphs formalize dependency-aware execution; SYCL 2020 provides hierarchical parallelism; and the K-G-I model maps directly to modern frameworks. See our 15-year retrospective for a comprehensive analysis of how these ideas evolved.

References

Strassen, V. (1969). Gaussian elimination is not optimal. Numerische Mathematik, 13, 354–356.
NVIDIA Corporation. (2012). NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110. Whitepaper.
Khronos Group. (2012). OpenCL Specification, Version 1.2.
Kirk, D. B., & Hwu, W. W. (2010). Programming Massively Parallel Processors. Morgan Kaufmann.
Volkov, V., & Demmel, J. W. (2008). Benchmarking GPUs to tune dense linear algebra. SC '08: Proceedings of the ACM/IEEE Conference on Supercomputing.
Nandy, S. K., et al. (2009). REDEFINE: Runtime Reconfigurable Polymorphic ASIC. ACM TRETS, 2(2).
Buck, I., et al. (2004). Brook for GPUs: Stream Computing on Graphics Hardware. ACM SIGGRAPH.
Owens, J. D., et al. (2007). A Survey of General-Purpose Computation on Graphics Hardware. Computer Graphics Forum, 26(1), 80–113.
Cohen, H. (1976). On the Overhead of Strassen's Algorithm. SIAM Journal on Computing, 5(3), 403–413.
West, E. A., & Grimshaw, A. S. (1995). Braid: Integrating Task and Data Parallelism. SIGPLAN Symposium on Principles and Practice of Parallel Programming.
Bikshandi, G., et al. (2006). Programming for Parallelism and Locality with Hierarchically Tiled Arrays. PPoPP '06.
Subhlok, J., & Vondran, G. (1997). Optimal Latency-Throughput Tradeoffs for Data Parallel Pipelines. SPAA '97.
BenHassen, S., et al. (1996). The Orca Parallel Programming Model. IEEE TPDS, 7(10).
Kambadur, P., et al. (2007). PFunc: Modern Task Parallelism for Modern High Performance Computing. SC '07.
Yang, C. T., et al. (2011). Implementation of Parallel Linear Algebra Functions on Multi-GPU Clusters. IPDPS Workshops.
Vejarano, G., & Gao, Y. (2011). GPU Kernel Optimization for Finite Element Computations. Parallel Computing, 37(8), 366–379.
Elangovan, V. K., et al. (2013). Hybrid Task and Data Parallel Programming Using OmpSs and GPUs. Euro-Par 2013.