Publications

Analysis of the RISC-V Vector Extension for Vulkan Graphics Kernels

Authors: Martin Troiber, Blaise Tine, Martin Schulz, Hyesoon Kim
ISPASS poster (2025)

SoftCUDA: Running CUDA on Softcore GPU

Authors: Chihyo Ahn, Ruobing Han, Udit Subramanya , Jisheng Zhao, Hyesoon Kim
FCCM (2025)

SparseWeaver: Converting Sparse Operations as Dense Operations on GPUs for Graph Workloads

Authors: Shinnung Jeong, Liam Paul Cooper, Ju Min Lee, Heelim Choi, Nicholas Parnenzini, Chihyo Ahn, Yongwoo Lee, Hanjun Kim, Hyesoon Kim
HPCA (2025)

Thanks to their scalable parallel processing capability, GPUs are promising computing resources for graph processing, in which identical operations are applied to a large number of edges and vertices. However, the sparsity and skewness of real-world graphs cause imbalanced workloads across GPU threads within the same warp, thus impeding efficient processing on the GPU. To mitigate this workload imbalance problem, existing works propose workload balancing hardware and software schemes. However, these solutions often suffer from additional memory overhead or increased computations and communication overheads during inter-warp and intra-warp synchronization. This work proposes a new hardware-software collaborative graph processing framework, SparseWeaver, that converts sparse operations in graph processing into dense operations using graph topology and makes the workloads balanced across GPU threads. Based on the analysis of common patterns in software schemes, we propose Weaver, a new lightweight GPU functional unit microarchitecture that fully leverages the benefits of the GPU architecture and exploits memory access locality. We prototype SparseWeaver on the open-source RISC-V Vortex GPU and demonstrate 2.36 times faster execution time compared to state-of-the-art schemes while incurring a low area overhead of 0.045% from increased dedicated logic registers.

See Publication

Comparative Analysis of Executing GPU Applications on FPGA: HLS vs. Soft GPU Approaches

Authors: Chihyo Ahn, Shinnung Jeong, Liam Cooper, Nicholas Parnenzini, Hyesoon Kim
IPDPS CGRA4HPC Workshop (2024)

With the development of the GPU, parallel languages are widely used for developing modern parallel applications. Given its low energy cost and programmable hardware, the FPGA emerges as a promising candidate to run GPU applications. Therefore, executing applications described in GPU programming languages on FPGA can offer new opportunities in terms of performance and energy efficiency. However, the gap between GPU programming languages and hardware description languages (HDL) poses a significant challenge for this transition. To overcome this problem, existing works have attempted to bridge this gap through high-level synthesis (HLS) or soft GPU. In this paper, we examine how HLS and soft GPU compile GPU languages for FPGA by discussing the detailed compilation and execution flow of two representative works: Intel FPGA SDK for OpenCL and Vortex. This paper also evaluates the coverage of both approaches and discusses methods for addressing the challenges each approach faces. Consequently, this paper explores the challenges HLS and GPU encounter, aiming to identify new problems and opportunities each approach introduces.

See Publication

Skybox: Open-Source Graphic Rendering on Programmable RISC-V GPUs

Authors: Blaise Tine, Varun Saxena, Santosh Srivatsan, Joshua R. Simpson, Fadi Alzammar, Liam Cooper, and Hyesoon Kim
ASPLOS (2023)

Abstract: Graphics rendering remains one of the most compute intensive and memory bound applications of GPUs and has been driving their push for performance and energy efficiency since its inception. Early GPU architectures focused only on accelerating graphics rendering and implemented dedicated fixed- function rasterizer hardware to speed-up their rendering pipeline. As GPUs have become more programmable and ubiquitous in other application domains such as scientific computing, machine learning, graph analytics, and cryptocurrency, generalizing GPU microarchitectures for area and power efficiency becomes necessary, especially for mobile and IoT devices. In this work, we present Skybox, a full-stack open-source GPU architecture with integrated software, compiler, hardware, and simulation environment, that enables end-to-end GPU research. Using Skybox, we explore the design space of software versus hardware graphics rendering and propose and hybrid micro-architecture that accelerates the state-of-the art Vulkan graphics API. Skybox also introduces novel compiler and system optimizations to support its unique RISC-V ISA baseline. We evaluated Skybox on high- end Altera and also Xilinx FPGAs. We were able to generate and execute a 32 cores (512 threads) Skybox graphics processor on Altera Stratix 10 FPGA, delivering a peak fill rate of 3.7 GPixels at 230 MHz. Skybox is the first open-source full-stack GPU software and hardware implementation that supports the Vulkan API.

See Publication

The Tip of Iceberg in Open-Source Hardware GPU

Authors: Blaise Tine, Ruobing Han and Hyesoon Kim
OSCAR (2022)

A workshop presentation.

See Publication

Implementing Hardware Extensions for Multicore RISC-V GPUs

Authors: Blaise Tine, Hyesoon Kim
CARRV (2022)

Abstract: As silicon technology scaling is approaching its limits, the semiconductor industry has adopted hardware specialization as a workaround to continue improving the performance of processors, creating new brands of multi-core processor architectures with specialized execution units and fixed-function hardware. For instance, modern GPUs today have dedicated custom hardware for image processing, 3D graphics, graph analytics, and machine learning acceleration. In recent years, the RISC-V ISA adoption has increased, and several implementations of GPUs based on the RISC-V ISA were introduced that integrate multiple cores [5] [12] [6]. Extending a RISC-V-based GPU to support custom hardware acceleration while still maintaining compatibility with the RISC-V ISA is not a trivial task. Part of the challenge involves extending the instruction set andregister file, but the other part is to figure out how the hardware addition will interface with the existing processor pipeline. In this work, we present a generalized methodology for implementing hardware extensions for multi-core RISC-V-based GPUs. We discuss the various hardware extension architectures on GPUs and propose possible implementations on RISC-V. Our generalized solution addresses both the ISA and microarchitecture changes. We also provide a generalized solution for supporting hardware performance monitoring counters for platforms with multiple custom accelerators onboard. We showcase some applications of our methodology with a custom hardware extension implementation on RISC-V-based GPUs

See Publication

CuPBoP: CUDA for Parallelized and Broad-range Processors

Authors: Ruobing Han, Jun Chen, Bhanu Garg, Jeffrey Young, Jaewoong Sim, Hyesoon Kim
arxiv preprint (2022)

Abstract: CUDA is one of the most popular choices for GPU programming, but it can only be executed on NVIDIA GPUs. Executing CUDA on non-NVIDIA devices not only benefits the hardware community, but also allows data-parallel computation in heterogeneous systems. To make CUDA programs portable, some researchers have proposed using source-to-source translators to translate CUDA to portable programming languages that can be executed on non-NVIDIA devices. However, most CUDA translators require additional manual modifications on the translated code, which imposes a heavy workload on developers. In this paper, CuPBoP is proposed to execute CUDA on non-NVIDIA devices without relying on any portable programming languages. Compared with existing work that executes CUDA on non-NVIDIA devices, CuPBoP does not require manual modification of the CUDA source code, but it still achieves the highest coverage (69.6%), much higher than existing frameworks (56.6%) on the Rodinia benchmark. In particular, for CPU backends, CuPBoP supports several ISAs (e.g., X86, RISC-V, AArch64) and has close or even higher performance compared with other projects. We also compare and analyze the performance among CuPBoP, manually optimized OpenMP/MPI programs, and CUDA programs on the latest Ampere architecture GPU, and show future directions for supporting CUDA programs on non-NVIDIA devices with high performance.

See Publication

COX: CUDA on X86 by Exposing Warp-Level Functions to CPUs

Authors: Ruobing Han, Jaewon Lee, Jaewoong Sim, Hyesoon Kim
arxiv preprint (2021)

Abstract: As CUDA programs become the de facto program among data parallel applications such as high-performance computing or machine learning applications, running CUDA on other platforms has been a compelling option. Although several efforts have attempted to support CUDA on other than NVIDIA GPU devices, due to extra steps in the translation, the support is always behind a few years from supporting CUDA's latest features. The examples are DPC, Hipfy, where CUDA source code have to be translated to their native supporting language and then they are supported. In particular, the new CUDA programming model exposes the warp concept in the programming language, which greatly changes the way the CUDA code should be mapped to CPU programs. In this paper, hierarchical collapsing that emph{correctly} supports CUDA warp-level functions on CPUs is proposed. Based on hierarchical collapsing, a framework, COX, is developed that allows CUDA programs with the latest features to be executed efficiently on CPU platforms. COX consists of a compiler IR transformation (new LLVM pass) and a runtime system to execute the transformed programs on CPU devices. COX can support the most recent CUDA features, and the application coverage is much higher (90%) than for previous frameworks (68%) with comparable performance. We also show that the warp-level functions in CUDA can be efficiently executed by utilizing CPU SIMD (AVX) instructions.

See Publication

Vortex: Extending the RISC-V ISA for GPGPU and 3D-Graphics Research

Authors: Blaise Tine, Fares Elsabbagh, Krishna Yalamarthy, Hyesoon Kim
MICRO 2021 (2021)

Abstract: The importance of open-source hardware and software has been increasing. However, despite GPUs being one of the more popular accelerators across various applications, there is very little open-source GPU infrastructure in the public domain. We argue that one of the reasons for the lack of open-source infrastructure for GPUs is rooted in the complexity of their ISA and software stacks. In this work, we first propose an ISA extension to RISC-V that supports GPGPUs and graphics. The main goal of the ISA extension proposal is to minimize the ISA changes so that the corresponding changes to the open-source ecosystem are also minimal, which makes for a sustainable development ecosystem. To demonstrate the feasibility of the minimally extended RISC-V ISA, we implemented the complete software and hardware stacks of Vortex on FPGA. Vortex is a PCIe-based soft GPU that supports OpenCL and OpenGL. Vortex can be used in a variety of applications, including machine learning, graph analytics, and graphics rendering. Vortex can scale up to 32 cores on an Altera Stratix 10 FPGA, delivering a peak performance of 25.6 GFlops at 200 Mhz.

See Publication

A Scalable Multicore RISC-V GPGPU Accelerator for High-End FPGAs

Authors: Blaise Tine, Fares Elsabbagh, Apurve Chawda, Will Gulian, Yaotian Feng, Da Eun Shim, Priyadarshini Roshan, Ethan Lyons, Lingjun Zhu, Sung Kyu Lim, Seyong Lee, Jeff Vetter, Hyesoon Kim
DAC (2021)

The recent generation of high-end FPGAs provide a high computational capability, and energy-efficient makes them attractive in high-performance computation and architecture research thanks to their reconfigurability. Exploring the implementation of a high-bandwidth accelerator on these large FPGAs presents a unique challenge in how to best utilize the available variety of computing resources to achieve the maximum performance. In this work, we implemented Vortex, a full-scale PCIe-based GPGPU accelerator on modern high-end Intel FPGAs. Vortex implements the RISC-V ISA with an extension to support Single-Instruction Multiple-threads (SIMT) execution model. Vortex implements a multi-core architecture with high-bandwidth fully pipelined non-blocking caches, and a scratchpad shared memory to achieve maximum throughput. We leveraged the hardened Integer arithmetic and Floating-Point DSPs available on the FPGA to achieve maximum computational efficiency. The Vortex platform is highly customizable and scalable with a complete open-source compiler, driver, and runtime software stack with OpenCL support to enable research in GPU architectures. We managed to fit a 16- core processor configuration with high-bandwidth caches on Intel Arria10 FPGA, clocking at 203-234 MHz, making Vortex a practical framework for GPU hardware research.

Bringing OpenCL to Commodity RISC-V CPUs

Authors: Tine Blaise, Seyong Lee, Jeff Vetter, Hyesoon Kim
Fifth Workshop on Computer Architecture Research with RISC-V (2021)

Abstract: The importance of open-source hardware has been increasing in recent years with the introduction of the RISC-V Open ISA. This has also accelerated the push for support of the open-source software stack from compiler tools to full-blown operating systems. Parallel computing with today’s Application Programming Interfaces such as OpenCL has proven to be effective at leveraging the parallelism in commodity multi-core processors and programmable parallel accelerators. However, to the best of our knowledge, there is currently no publicly available implementation of OpenCL targeting commodity RISC-V processors that is accessible to the open-source community. Besides opening RISC-V to the existing rich variety of scientific parallel applications, OpenCL also provides access to a unique genre of benchmarks useful in computer architecture research. In this work, we extended an Open-source implementation of OpenCL to target RISC-V CPUs. Our work not only covers commodity multi-core RISC-V processors, but also plethora of low-profile embedded RISC-V CPUs that often do not support atomic instructions or multi-threading.

[video](https://carrv.github.io/2021/videos/CARRV2021_full_36_Blaise.html)

[slides](https://carrv.github.io/2021/slides/CARRV2021_slides_36_Blaise.pdf)

See Publication

Supporting CUDA for an extended RISC-V GPU architecture

Authors: Ruobing Han, Blaise Tine, Jaewon Lee, Jaewoong Sim, Hyesoon Kim
Fifth Workshop on Computer Architecture Research with RISC-V (2021)

Abstract: With the rapid development of scientific computation, more and more researchers and developers are committed to implementing various workloads/operations on different devices. Among all these devices, NVIDIA GPU is the most popular choice due to its comprehensive documentation and excellent development tools. As a result, there are abundant resources for hand-writing high-performance CUDA codes. However, CUDA is mainly supported by only commercial products and there has been no support for open-source H/W platforms. RISC-V is the most popular choice for hardware ISA, thanks to its elegant design and open-source license. In this project, we aim to utilize these existing CUDA codes with RISC-V devices. More specifically, we design and implement a pipeline that can execute CUDA source code on an RISC-V GPU architecture. We have succeeded in executing CUDA kernels with several important features, like multi-thread and atomic instructions, on an RISC-V GPU architecture.

[video](https://carrv.github.io/2021/videos/CARRV2021_full_66_Han.html)

[slides](https://carrv.github.io/2021/slides/CARRV2021_slides_66_Han.pdf)

See Publication

Cryptography Acceleration in a RISC-V GPGPU

Authors: Austin Adams, Pulkit Gupta, Blaise Tine, Hyesoon Kim
Fifth Workshop on Computer Architecture Research with RISC-V (2021)

Abstract: AES and the SHA family are popular cryptographic algorithms for symmetric encryption and hashing, respectively. Highly parallel use cases for calling both AES and SHA exist, making hardware accelerated execution on a GPGPU appealing. We extend an existing GPGPU with a cryptography execute unit that will accelerate key elements of these algorithms. We implement a subset of the RISC-V cryptography extensions draft specification on the Vortex GPGPU, seeing 1.6× speedup for SHA-256 and 6.6× speedup for AES-256 on average over pure software implementations on Vortex.

[video](https://carrv.github.io/2021/videos/CARRV2021_full_87_Adas.html)

[slides](https://carrv.github.io/2021/slides/CARRV2021_slides_87_Adams.pdf)

See Publication

Vortex: OpenCL Compatible RISC-V GPGPU

Authors: Fares Elsabbagh, Blaise Tine, Priyadarshini Roshan, Ethan Lyons, Euna Kim, Da Eun Shim, Lingjun Zhu, Sung Kyu Lim, Hyesoon Kim
arXiv:2002.12151 (2020)

Abstract: The current challenges in technology scaling are pushing the semiconductor industry towards hardware specialization, creating a proliferation of heterogeneous systems-on-chip, delivering orders of magnitude performance and power benefits compared to traditional general-purpose architectures. This transition is getting a significant boost with the advent of RISC-V with its unique modular and extensible ISA, allowing a wide range of lowcost processor designs for various target applications. In addition, OpenCL is currently the most widely adopted programming framework for heterogeneous platforms available on mainstream CPUs, GPUs, as well as FPGAs and custom DSP. In this work, we present Vortex, a RISC-V General-Purpose GPU that supports OpenCL. Vortex implements a SIMT architecture with a minimal ISA extension to RISC-V that enables the execution of OpenCL programs. We also extended OpenCL runtime framework to use the new ISA. We evaluate this design using 15nm technology. We also show the performance and energy numbers of running them with a subset of benchmarks from the Rodinia Benchmark suite.

See Publication

Vortex: An Open Source Reconfigurable RISC-V GPGPU Accelerator for Architecture Research

Authors: Fares Elsabbagh, Blaise Tine, Apurve Chawda, Will Gulian, Yaotian Feng, Da Eun Shim, Priyadarshini Roshan, Ethan Lyons, Lingjun Zhu, Sung Kyu Lim, Hyesoon Kim
Hot Chips 32 (2020)

Abstract: The emergence of data parallel architectures have enabled new opportunities to address the power limitations and scalability of multi-core processors, allowing new ways to exploit the abundant data parallelism present in emerging big-data and machine learning applications. This transition is getting a significant boost with the advent of RISC-V with its unique modular and extensible ISA, allowing a wide range of low-cost processor designs. In this work, we present Vortex, a full-stack RISC-V GPGPU processor with OpenCL support. The Vortex platform is highly customizable and scalable with a complete open-source compiler, driver, and runtime software stack to enable research in GPU architectures. We evaluated this design using 15 nm technology. We also show the preliminary performance and energy numbers of running them with a subset of benchmarks from the Rodinia Benchmark suite.

See Publication

Vortex RISC-V GPGPU System: Extending the ISA, Synthesizing the Microarchitecture, and Modeling the Software Stack

Authors: Fares Elsabbagh, Bahar Asgari, Hyesoon Kim and Sudhakar Yalamanchili
Third Workshop on Computer Architecture Research with RISC-V (2019)

Abstract: The open-source RISC-V instruction set architecture (ISA) has enabled computer architects to propose several innovative processors for a wide range of applications. One of the domains of processor design that can take advantage from RISC-V, but has not seen enough attention, is general-purpose GPU (GPGPU) design. To support the development of open source GPGPU system, we present Vortex, our solution for building single instruction, multiple thread (SIMT) execution using RISC-V. In addition to a synthesizable microarchitecture model, we propose a GPU ISA extension to RISC-V and a software model, in the form of a runtime kernel, which makes Vortex practical and easy to integrate. As a result, Vortex does not require any modifications to the current RISC-V compilers.

See Publication