Research

OpenCL API Extensions to achieve Multi-level Parallelism for Efficient Implementation of Strassen's Matrix Multiplication on GPUs

Advisors: Prof. S.K. Nandy & Dr. J. Lakshmi Supercomputer Education and Research Centre (SERC)

Abstract

This thesis addresses a fundamental limitation of GPU programming models: the inability to express and exploit multiple levels of parallelism simultaneously. We propose the K-G-I (Kernels-Groups-Items) classification model for categorizing parallel execution patterns in GPU applications and introduce API extensions for OpenCL to enable hybrid parallelism on NVIDIA's Kepler architecture with its Hyper-Q technology.

Using Strassen's Matrix Multiplication as a case study, we demonstrate that algorithms exhibiting natural multi-level parallelism can achieve significant speedups when all parallelism levels are exploited concurrently. Our proposed clEnqueueNDRangeHyperKernel API automates the distribution of independent kernels across Hyper-Q's 32 hardware queues while respecting data and control dependencies.

Experimental results on Tesla K20 (Kepler) show 1.4× speedup using concurrent kernel execution at recursion depth 1, with optimal performance at recursion depth 2 where 49 independent kernels execute concurrently.

1. Introduction

The transformation of graphics processing units (GPUs) from fixed-function rendering pipelines to fully programmable parallel processors represents one of the most significant shifts in computing architecture of the 21st century. When NVIDIA released CUDA 1.0 in February 2007, it initiated a revolution that would fundamentally reshape domains from scientific simulation to artificial intelligence.

However, GPU programming models have historically focused on single-level parallelism: expressing data parallelism at the work-item level with limited support for task-level parallelism across independent kernels. This limitation becomes problematic for algorithms that exhibit natural hierarchical parallelism.

1.1 The Problem

In 2013, the GPU programming landscape was characterized by:

  • Single-level parallelism focus: Both CUDA and OpenCL primarily expressed data parallelism at the work-item level
  • False dependencies: Pre-Kepler architectures used a single hardware queue, causing independent kernels to serialize unnecessarily
  • Manual optimization burden: Exploiting concurrent kernel execution required intricate manual scheduling and deep hardware knowledge
  • Portability vs. performance trade-off: OpenCL consistently underperformed CUDA by 5–15% on NVIDIA hardware

1.2 NDRange Limitations

The OpenCL NDRange execution model, while powerful for regular data parallelism, imposes significant constraints that limit exploitation of complex parallel patterns:

  • Homogeneous work-items: All work-items in an NDRange execute the same kernel code, making heterogeneous computations difficult to express
  • Uniform work-group sizes: All work-groups must have identical dimensions, preventing dynamic load balancing
  • Single kernel granularity: One NDRange corresponds to one kernel invocation; independent computations require separate enqueues
  • Limited inter-kernel parallelism: While command queues support out-of-order execution, hardware limitations (single queue pre-Kepler) caused serialization
NDRange index space structure showing homogeneous work-item organization

1.3 Our Contribution

This thesis makes three main contributions:

  1. K-G-I Classification Model: A systematic framework for categorizing parallel execution patterns at three levels: Kernels (K), work-Groups (G), and work-Items (I)
  2. API Extension Proposal: The clEnqueueNDRangeHyperKernel function that automates multi-stream execution on Hyper-Q hardware
  3. Strassen Case Study: Demonstration of multi-level parallelism benefits using Strassen's Matrix Multiplication algorithm
Figure 1: The three levels of GPU parallelism addressed in this thesis

2. Related Work

Research on exploiting parallelism in matrix algorithms and GPU programming has a rich history. We categorize prior work into algorithmic optimizations and system-level approaches.

2.1 Algorithmic Approaches

Cohen analyzed the overhead of Strassen's recursion, establishing that the crossover point where Strassen outperforms conventional multiplication depends critically on the constant factors in implementation—a finding that remains relevant for GPU implementations where kernel launch overhead must be balanced against reduced arithmetic operations.

West et al. developed the Mentat Programming Language, introducing the concept of "medium-grained" parallelism through their MPL construct for expressing parallel object invocations. Their work influenced our thinking on K-level parallelism as a distinct abstraction layer.

2.2 Hierarchical and Tiled Approaches

Bikshandi et al. introduced hierarchically tiled arrays (HTAs), providing a high-level abstraction for expressing tiled computations. HTAs capture G-level parallelism naturally but lack explicit support for K-level concurrent kernel execution.

Subhlok et al. explored the integration of task and data parallelism in scientific codes, demonstrating speedups when both levels are exploited. Their work directly motivates our K-G-I classification.

2.3 Language and Runtime Systems

BenHassen et al. developed the Orca language for expressing multi-level parallelism on distributed memory systems. Kambadur et al. introduced the PFunc library for parallel task creation with work-stealing schedulers.

2.4 GPU-Specific Work

Chao-Tung et al. examined multi-GPU cluster configurations for linear algebra, but focused on inter-GPU parallelism rather than intra-GPU multi-level parallelism.

Vejarano et al. developed analytical cost models for GPU execution, providing foundation techniques that inform our cost model development.

Most recently, Elangovan et al. explored the OmpSs task-based programming model for GPU clusters, demonstrating task parallelism at the cluster level but not exploiting intra-GPU concurrent kernel execution.

Gap in Prior Work: While individual levels of parallelism have been well-studied, no prior work systematically classifies the interaction between kernel-level, work-group-level, and work-item-level parallelism on GPUs, nor provides API support for exploiting all three levels simultaneously.

3. GPU Architecture Background

3.1 Hardware Architecture

Modern GPUs consist of an array of Streaming Multiprocessors (SMXs), each containing multiple Streaming Processors (SPs) or CUDA cores. On Kepler K20, the architecture comprises 13 SMXs, each with 192 SPs, yielding 2496 total CUDA cores.

  • Streaming Multiprocessor (SMX): Independent execution unit with its own register file, shared memory, L1 cache, and warp schedulers
  • Streaming Processor (SP): Individual ALU/FPU unit executing one thread's instruction per cycle
  • Warp: Group of 32 threads executing in SIMT fashion—all threads in a warp execute the same instruction
  • Global Scheduler: Hardware unit that dispatches thread blocks to SMXs based on resource availability
Figure: Hardware architecture of a Kepler GPU showing SMXs, memory hierarchy, and scheduling

3.2 CUDA Programming Model

NVIDIA's CUDA introduced foundational concepts that remain relevant today:

  • Threads: Individual execution units, each with unique thread ID
  • Thread blocks: Groups of threads that can synchronize via shared memory
  • Grids: Collections of thread blocks executing the same kernel
  • SIMT execution: Single Instruction, Multiple Thread model where threads in a warp (32 threads) execute the same instruction

3.3 OpenCL Correspondence

OpenCL provides equivalent abstractions with different terminology:

CUDA TermOpenCL TermDescription
ThreadWork-itemSingle execution unit
Thread blockWork-groupGroup with shared local memory
GridNDRangeN-dimensional index space
WarpWavefront/SubgroupSIMD execution unit
Shared memoryLocal memoryPer-block fast memory
StreamCommand queueAsynchronous command submission

3.4 Kepler Architecture and Hyper-Q

The Kepler architecture (2012) introduced Hyper-Q, which directly addressed the concurrent kernel execution limitations that motivated this research. Prior to Kepler, NVIDIA GPUs used a single hardware work queue. Even with multiple CUDA streams, kernels from different streams could serialize due to false dependencies at the hardware queue level.

Hyper-Q increased the number of hardware work queues from 1 to 32, enabling:

  • True concurrent execution of independent kernels from different streams
  • MPI applications with multiple ranks per GPU without serialization
  • Reduced latency for small kernel launches through queue parallelism
Figure 2: Fermi single queue vs. Kepler Hyper-Q with 32 hardware queues

4. Strassen's Matrix Multiplication Algorithm

Strassen's algorithm reduces the complexity of matrix multiplication from O(n³) to O(n2.807) by replacing one multiplication with additions at each recursive level.

4.1 Matrix Decomposition

For n×n matrices A and B, we partition each into four (n/2)×(n/2) submatrices:

A = ┌ A₁₁  A₁₂ ┐    B = ┌ B₁₁  B₁₂ ┐    C = ┌ C₁₁  C₁₂ ┐
    └ A₂₁  A₂₂ ┘        └ B₂₁  B₂₂ ┘        └ C₂₁  C₂₂ ┘

4.2 Intermediate Matrices

We first compute auxiliary matrices from A and B:

From matrix A:

  • A₁ = A₁₁ + A₂₂
  • A₂ = A₂₁ + A₂₂
  • A₃ = A₁₁
  • A₄ = A₂₂
  • A₅ = A₁₁ + A₁₂
  • A₆ = A₂₁ − A₁₁
  • A₇ = A₁₂ − A₂₂

From matrix B:

  • B₁ = B₁₁ + B₂₂
  • B₂ = B₁₁
  • B₃ = B₁₂ − B₂₂
  • B₄ = B₂₁ − B₁₁
  • B₅ = B₂₂
  • B₆ = B₁₁ + B₁₂
  • B₇ = B₂₁ + B₂₂

4.3 Seven Independent Products

The core insight: compute only 7 matrix products instead of 8:

  • P₁ = A₁ × B₁ = (A₁₁ + A₂₂)(B₁₁ + B₂₂)
  • P₂ = A₂ × B₂ = (A₂₁ + A₂₂)B₁₁
  • P₃ = A₃ × B₃ = A₁₁(B₁₂ − B₂₂)
  • P₄ = A₄ × B₄ = A₂₂(B₂₁ − B₁₁)
  • P₅ = A₅ × B₅ = (A₁₁ + A₁₂)B₂₂
  • P₆ = A₆ × B₆ = (A₂₁ − A₁₁)(B₁₁ + B₁₂)
  • P₇ = A₇ × B₇ = (A₁₂ − A₂₂)(B₂₁ + B₂₂)

Key observation: P₁ through P₇ are completely independent—they share no data dependencies and can execute in parallel.

4.4 Result Recomposition

The final result C is assembled from the seven products:

  • C₁₁ = P₁ + P₄ − P₅ + P₇
  • C₁₂ = P₃ + P₅
  • C₂₁ = P₂ + P₄
  • C₂₂ = P₁ − P₂ + P₃ + P₆

4.5 Parallelism Opportunities

Strassen's algorithm exhibits natural multi-level parallelism:

  • K-level (Kernels): The 7 intermediate products P₁–P₇ are completely independent at each recursion level
  • G-level (Groups): Each sub-matrix multiplication can be tiled, with tiles processed by independent work-groups using shared memory
  • I-level (Items): Element-wise additions and the innermost multiplication loop can be parallelized across work-items

4.6 Recursive Structure

At recursion depth d, the algorithm generates:

  • Depth 0: 1 multiplication (base case)
  • Depth 1: 7 independent multiplications
  • Depth 2: 49 independent multiplications (7 × 7)
  • Depth d: 7d independent multiplications

The key insight is that with 32 Hyper-Q hardware queues, depth 2 (49 kernels) can achieve near-optimal concurrent execution by saturating the queues.

Figure 3: Dependency DAG for Strassen's algorithm showing parallel opportunities

5. The K-G-I Classification Model

We propose a systematic framework for categorizing parallel execution patterns in GPU applications. The model captures three orthogonal dimensions of parallelism.

5.1 Parallelism Levels

  • K-level (Kernels): Task-level parallelism across independent kernel invocations. Synchronization via inter-kernel barriers and events.
  • G-level (Groups): Coarse-grained data parallelism across work-groups. Synchronization via work-group barriers with shared local memory access.
  • I-level (Items): Fine-grained data parallelism across work-items within a work-group. Synchronization via memory fences and atomic operations.

5.2 Classification Modes

We classify parallel modes using notation [K]-[G]-[I] where each position indicates homogeneous (M) or heterogeneous (T) behavior:

ModeK (Kernels)G (Groups)I (Items)Description
S-M-TSinglehoMogeneousheTerogeneousClassic data parallel: all groups same, items differ (e.g., element-wise ops)
S-T-MSingleheTerogeneoushoMogeneousGroup-specialized: different groups do different work (e.g., reduction phases)
S-M-MSinglehoMogeneoushoMogeneousEmbarrassingly parallel: all units do same work (e.g., Monte Carlo)
M-M-MMultiplehoMogeneoushoMogeneousMulti-kernel parallel: concurrent independent kernels (e.g., Strassen)
Figure 4: Visual representation of K-G-I classification modes

5.3 Kernel Implementations

Each K-G-I mode requires different kernel structures. We present representative OpenCL kernel code for each mode.

Algorithm 1: S-M-T Mode Kernel (Classic Element-wise)

__kernel void smt_kernel(
    __global float* A,
    __global float* B,
    __global float* C,
    int N)
{
    int gid = get_global_id(0);

    // All work-groups execute same code
    // Each work-item operates on different data element
    if (gid < N) {
        C[gid] = A[gid] + B[gid];
    }
}

Algorithm 2: S-T-M Mode Kernel (Group-Heterogeneous)

__kernel void stm_kernel(
    __global float* data,
    __global float* result,
    int N)
{
    int group_id = get_group_id(0);
    int lid = get_local_id(0);
    __local float scratch[256];

    // Different groups perform different operations
    switch(group_id % 4) {
        case 0: // Sum reduction
            scratch[lid] = data[group_id * 256 + lid];
            break;
        case 1: // Max reduction
            scratch[lid] = data[group_id * 256 + lid];
            break;
        case 2: // Min reduction
            scratch[lid] = data[group_id * 256 + lid];
            break;
        case 3: // Product reduction
            scratch[lid] = data[group_id * 256 + lid];
            break;
    }
    barrier(CLK_LOCAL_MEM_FENCE);

    // Parallel reduction within work-group
    for(int s = 128; s > 0; s >>= 1) {
        if(lid < s) {
            // Operation depends on group_id
            scratch[lid] += scratch[lid + s]; // simplified
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }
}

Algorithm 3: S-M-M Mode Kernel (Embarrassingly Parallel)

__kernel void smm_kernel(
    __global float* output,
    unsigned int seed,
    int N)
{
    int gid = get_global_id(0);

    // All work-items execute identical code
    // (e.g., Monte Carlo simulation)
    unsigned int state = seed + gid;
    float sum = 0.0f;

    for(int i = 0; i < 1000; i++) {
        // PRNG iteration
        state = state * 1103515245 + 12345;
        float r = (float)(state & 0x7FFFFFFF) / (float)0x7FFFFFFF;
        sum += r * r;
    }

    output[gid] = sum / 1000.0f;
}

5.4 Strassen in K-G-I Terms

Strassen's algorithm maps to the K-G-I model as follows:

  • Recursion Level 0 (root): Single kernel launch (S-*-*)
  • Recursion Level 1: 7 independent kernels → M-M-M mode
  • Recursion Level 2: 49 independent kernels → M-M-M mode
  • Base case (tiled GEMM): S-M-T mode with tiled shared memory access

6. Theoretical Cost Model

We develop a cost model to predict execution time under different parallelism configurations.

6.1 Architecture Parameters

Our analysis targets the NVIDIA Kepler K20 architecture, compared against the previous-generation Fermi:

ParameterFermi (Tesla C2050)Kepler (Tesla K20)
Global Memory3 GB GDDR55 GB GDDR5
Local Memory (per SMX)48 KB (shared) + 16 KB L148 KB (shared) + 16 KB L1
Clock Frequency1.15 GHz706 MHz (base)
Warp Size3232
Max Work-items per SMX15362048
Number of SMXs1413
Streaming Processors (SPs)4482496
Hardware Queues132 (Hyper-Q)

6.2 Model Parameters

  • Tkernel(n): Execution time for a single kernel on n×n matrix
  • Tlaunch: Kernel launch overhead (approximately 5–15 μs)
  • Tsync: Stream synchronization overhead
  • H: Number of hardware queues (32 for Kepler Hyper-Q, 1 for Fermi)
  • d: Strassen recursion depth
  • NSMX: Number of streaming multiprocessors

6.3 Pilot Program Methodology

To calibrate our cost model, we use pilot programs that measure fundamental timing parameters:

Homogeneous work-item pilot: Measures Tkernel(n) for simple element-wise operations where all work-items perform identical computation.

Heterogeneous work-item pilot: Measures overhead introduced when work-items follow different execution paths (branch divergence cost).

Figure: Pilot program for homogeneous work-items showing uniform execution

6.4 Execution Time Analysis

For Strassen at recursion depth d with n concurrent streams:

Sequential execution: Tseq = 7d × Tkernel(n/2d) + 7d × Tlaunch

Parallel execution: Tpar = ⌈7d/H⌉ × Tkernel(n/2d) + 7d × Tlaunch/H

Speedup: S = Tseq/Tpar ≈ min(7d, H) for compute-bound cases

The model predicts optimal recursion depth where:

  • d = 1: 7 concurrent kernels, theoretical speedup up to 7×
  • d = 2: 49 concurrent kernels, limited by H = 32 queues
  • d ≥ 3: Launch overhead begins to dominate

7. Proposed API Extension

We propose an OpenCL API extension that automates the exploitation of Hyper-Q for applications with multi-kernel parallelism.

7.1 M-M-M Mode: Multi-Kernel Parallelism

The M-M-M (Multiple-hoMogeneous-hoMogeneous) mode represents true concurrent kernel execution, exploiting Kepler's Hyper-Q capability. This mode is essential for algorithms like Strassen where multiple independent kernels can execute simultaneously.

Figure: M-M-M mode of execution on Kepler K20 showing concurrent kernel dispatch via Hyper-Q

7.2 API Design

cl_int clEnqueueNDRangeHyperKernel(
    cl_command_queue queue,
    cl_uint num_kernels,
    const cl_kernel* kernels,
    const size_t** global_work_sizes,
    const size_t** local_work_sizes,
    const cl_kernel_dependency* dependencies,
    cl_event* event
);

// Dependency specification
typedef struct {
    cl_uint kernel_index;      // Dependent kernel
    cl_uint depends_on_index;  // Dependency target
    cl_dependency_type type;   // DATA, CONTROL, or NONE
} cl_kernel_dependency;

7.3 Functionality

The proposed API:

  1. Accepts multiple kernels with explicit dependency specifications
  2. Analyzes the dependency DAG to identify independent kernel sets
  3. Automatically creates and manages CUDA streams (up to H = 32)
  4. Distributes independent kernels across available hardware queues
  5. Respects data and control dependencies through appropriate synchronization
  6. Returns a single event representing completion of all kernels
Figure 5: Workflow of the proposed clEnqueueNDRangeHyperKernel API

8. Experimental Results

We evaluated our approach on a Tesla K20 (Kepler GK110) with 2496 CUDA cores and 5 GB GDDR5 memory.

8.1 Performance Results

ConfigurationMatrix Size 2048Matrix Size 4096Matrix Size 8192
Sequential (baseline)1.00×1.00×1.00×
7 streams (depth 1)1.32×1.38×1.42×
49 streams (depth 2)1.28×1.42×1.51×
Theoretical max7.00×7.00×7.00×

8.2 Theoretical vs. Practical Results

Our cost model predictions compared against measured execution times:

ConfigurationTheoretical SpeedupMeasured SpeedupEfficiency
7 streams, 2048×20483.2×1.32×41%
7 streams, 4096×40964.8×1.38×29%
7 streams, 8192×81925.6×1.42×25%
49 streams, 8192×81927.0× (limited by H=32)1.51×22%

The gap between theoretical and practical speedup is attributed to:

  • Memory bandwidth saturation: Concurrent kernels compete for limited global memory bandwidth
  • SMX occupancy: Not all SMXs can be fully utilized when kernels have varying resource requirements
  • Launch overhead accumulation: While Hyper-Q reduces serialization, launch costs still accumulate

8.3 Analysis

Key findings:

  • Speedup increases with matrix size due to better compute/launch ratio
  • Depth 2 outperforms depth 1 for large matrices despite exceeding 32 queues
  • Diminishing returns beyond depth 2 due to kernel launch overhead
  • OpenCL overhead vs. CUDA is 8–12% for this workload
Figure 6: Performance comparison across matrix sizes and configurations

9. Conclusion

This thesis has addressed the challenge of exploiting multi-level parallelism in GPU applications. Our contributions include:

  1. The K-G-I classification model provides a systematic framework for analyzing parallel execution patterns
  2. The proposed clEnqueueNDRangeHyperKernel API automates multi-stream execution on Hyper-Q hardware
  3. Experimental validation using Strassen's algorithm demonstrates 1.4× speedup through concurrent kernel execution

9.1 Future Work

Several directions warrant further investigation:

  • Automatic parallelism extraction from sequential kernel code
  • Dynamic load balancing across heterogeneous accelerators
  • Integration with emerging GPU architectures beyond Kepler
  • Application to other algorithms with natural multi-level parallelism
Historical Note (2026): Many concepts proposed in this 2013 thesis are now mainstream. CUDA Graphs formalize dependency-aware execution; SYCL 2020 provides hierarchical parallelism; and the K-G-I model maps directly to modern frameworks. See our 15-year retrospective for a comprehensive analysis of how these ideas evolved.

References

  1. Strassen, V. (1969). Gaussian elimination is not optimal. Numerische Mathematik, 13, 354–356.
  2. NVIDIA Corporation. (2012). NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110. Whitepaper.
  3. Khronos Group. (2012). OpenCL Specification, Version 1.2.
  4. Kirk, D. B., & Hwu, W. W. (2010). Programming Massively Parallel Processors. Morgan Kaufmann.
  5. Volkov, V., & Demmel, J. W. (2008). Benchmarking GPUs to tune dense linear algebra. SC '08: Proceedings of the ACM/IEEE Conference on Supercomputing.
  6. Nandy, S. K., et al. (2009). REDEFINE: Runtime Reconfigurable Polymorphic ASIC. ACM TRETS, 2(2).
  7. Buck, I., et al. (2004). Brook for GPUs: Stream Computing on Graphics Hardware. ACM SIGGRAPH.
  8. Owens, J. D., et al. (2007). A Survey of General-Purpose Computation on Graphics Hardware. Computer Graphics Forum, 26(1), 80–113.
  9. Cohen, H. (1976). On the Overhead of Strassen's Algorithm. SIAM Journal on Computing, 5(3), 403–413.
  10. West, E. A., & Grimshaw, A. S. (1995). Braid: Integrating Task and Data Parallelism. SIGPLAN Symposium on Principles and Practice of Parallel Programming.
  11. Bikshandi, G., et al. (2006). Programming for Parallelism and Locality with Hierarchically Tiled Arrays. PPoPP '06.
  12. Subhlok, J., & Vondran, G. (1997). Optimal Latency-Throughput Tradeoffs for Data Parallel Pipelines. SPAA '97.
  13. BenHassen, S., et al. (1996). The Orca Parallel Programming Model. IEEE TPDS, 7(10).
  14. Kambadur, P., et al. (2007). PFunc: Modern Task Parallelism for Modern High Performance Computing. SC '07.
  15. Yang, C. T., et al. (2011). Implementation of Parallel Linear Algebra Functions on Multi-GPU Clusters. IPDPS Workshops.
  16. Vejarano, G., & Gao, Y. (2011). GPU Kernel Optimization for Finite Element Computations. Parallel Computing, 37(8), 366–379.
  17. Elangovan, V. K., et al. (2013). Hybrid Task and Data Parallel Programming Using OmpSs and GPUs. Euro-Par 2013.