PGI CUDA C x86 Compiler


The NVIDIA CUDA architecture was developed to enable offloading computationally intensive kernels to massively parallel GPUs. Through API function calls and language extensions, CUDA gives developers explicit control over the mapping of general-purpose computational kernels to GPUs, as well as the placement and movement of data between an x86 processor and the GPU. First introduced in 2007, CUDA is the most popular GPGPU parallel programming model.

The PGI CUDA C compiler for x86 platforms will allow developers using CUDA to compile and optimize their CUDA applications to run on x86-based workstations, servers and clusters with or without an NVIDIA GPU accelerator. When run on x86-based systems without a GPU, PGI CUDA C applications will use multiple cores and the streaming SIMD (Single Instruction Multiple Data) capabilities of Intel and AMD CPUs for parallel execution.

PGI CUDA C for Multi-core x86


The PGI CUDA C compiler will implement the current NVIDIA CUDA C language for GPUs, and it will closely track the evolution of CUDA C moving forward. PGI CUDA C for x86 implementation will proceed in phases:

  1. Prototype demonstration at SC10 in New Orleans (November 2010)
  2. First production release in Q2 2011 with most CUDA C functionality; this will not be a performance release
  3. Performance release in Q4 2011 leveraging multi-core and SSE/AVX to implement low-overhead native parallel/SIMD execution

Longer term, the PGI CUDA C for x86 compiler will support execution of device kernels on NVIDIA CUDA-enabled GPUs. In addition, PGI Unified Binary technology will enable developers to build one binary that will use NVIDIA GPUs when present or default to using multi-core x86 only if no GPU is present.

Implementation Overview

The PGI CUDA C for x86 compiler processes CUDA C as a native parallel programming language for multi-core x86 including:

  • Inlining device kernel functions
  • Translating chevron syntax to parallel/vector loops
  • Using multiple cores and SSE/AVX instructions

At run-time, CUDA C programs compiled for x86 will execute each CUDA thread block using a single host core, eliminating synchronization where possible. CUDA host code will support all PGI optimizations for Intel/AMD processors. PGI believes that well-structured CUDA C for multi-core x86 programs can approach the efficiency and performance of the same algorithm written in OpenMP.