Abstract: Graphics rendering remains one of the most compute intensive and memory bound applications of GPUs and has been driving their push for performance and energy efficiency since its inception. Early GPU architectures focused only on accelerating graphics rendering and implemented dedicated fixed- function rasterizer hardware to speed-up their rendering pipeline. As GPUs have become more programmable and ubiquitous in other application domains such as scientific computing, machine learning, graph analytics, and cryptocurrency, generalizing GPU microarchitectures for area and power efficiency becomes necessary, especially for mobile and IoT devices. In this work, we present Skybox, a full-stack open-source GPU architecture with integrated software, compiler, hardware, and simulation environment, that enables end-to-end GPU research. Using Skybox, we explore the design space of software versus hardware graphics rendering and propose and hybrid micro-architecture that accelerates the state-of-the art Vulkan graphics API. Skybox also introduces novel compiler and system optimizations to support its unique RISC-V ISA baseline. We evaluated Skybox on high- end Altera and also Xilinx FPGAs. We were able to generate and execute a 32 cores (512 threads) Skybox graphics processor on Altera Stratix 10 FPGA, delivering a peak fill rate of 3.7 GPixels at 230 MHz. Skybox is the first open-source full-stack GPU software and hardware implementation that supports the Vulkan API.
Abstract: As silicon technology scaling is approaching its limits, the semiconductor industry has adopted hardware specialization as a workaround to continue improving the performance of processors, creating new brands of multi-core processor architectures with specialized execution units and fixed-function hardware. For instance, modern GPUs today have dedicated custom hardware for image processing, 3D graphics, graph analytics, and machine learning acceleration. In recent years, the RISC-V ISA adoption has increased, and several implementations of GPUs based on the RISC-V ISA were introduced that integrate multiple cores [5] [12] [6]. Extending a RISC-V-based GPU to support custom hardware acceleration while still maintaining compatibility with the RISC-V ISA is not a trivial task. Part of the challenge involves extending the instruction set andregister file, but the other part is to figure out how the hardware addition will interface with the existing processor pipeline. In this work, we present a generalized methodology for implementing hardware extensions for multi-core RISC-V-based GPUs. We discuss the various hardware extension architectures on GPUs and propose possible implementations on RISC-V. Our generalized solution addresses both the ISA and microarchitecture changes. We also provide a generalized solution for supporting hardware performance monitoring counters for platforms with multiple custom accelerators onboard. We showcase some applications of our methodology with a custom hardware extension implementation on RISC-V-based GPUs
Abstract: CUDA is one of the most popular choices for GPU programming, but it can only be executed on NVIDIA GPUs. Executing CUDA on non-NVIDIA devices not only benefits the hardware community, but also allows data-parallel computation in heterogeneous systems. To make CUDA programs portable, some researchers have proposed using source-to-source translators to translate CUDA to portable programming languages that can be executed on non-NVIDIA devices. However, most CUDA translators require additional manual modifications on the translated code, which imposes a heavy workload on developers. In this paper, CuPBoP is proposed to execute CUDA on non-NVIDIA devices without relying on any portable programming languages. Compared with existing work that executes CUDA on non-NVIDIA devices, CuPBoP does not require manual modification of the CUDA source code, but it still achieves the highest coverage (69.6%), much higher than existing frameworks (56.6%) on the Rodinia benchmark. In particular, for CPU backends, CuPBoP supports several ISAs (e.g., X86, RISC-V, AArch64) and has close or even higher performance compared with other projects. We also compare and analyze the performance among CuPBoP, manually optimized OpenMP/MPI programs, and CUDA programs on the latest Ampere architecture GPU, and show future directions for supporting CUDA programs on non-NVIDIA devices with high performance.
Abstract: As CUDA programs become the de facto program among data parallel applications such as high-performance computing or machine learning applications, running CUDA on other platforms has been a compelling option. Although several efforts have attempted to support CUDA on other than NVIDIA GPU devices, due to extra steps in the translation, the support is always behind a few years from supporting CUDA's latest features. The examples are DPC, Hipfy, where CUDA source code have to be translated to their native supporting language and then they are supported. In particular, the new CUDA programming model exposes the warp concept in the programming language, which greatly changes the way the CUDA code should be mapped to CPU programs. In this paper, hierarchical collapsing that emph{correctly} supports CUDA warp-level functions on CPUs is proposed. Based on hierarchical collapsing, a framework, COX, is developed that allows CUDA programs with the latest features to be executed efficiently on CPU platforms. COX consists of a compiler IR transformation (new LLVM pass) and a runtime system to execute the transformed programs on CPU devices. COX can support the most recent CUDA features, and the application coverage is much higher (90%) than for previous frameworks (68%) with comparable performance. We also show that the warp-level functions in CUDA can be efficiently executed by utilizing CPU SIMD (AVX) instructions.
Abstract: The importance of open-source hardware and software has been increasing. However, despite GPUs being one of the more popular accelerators across various applications, there is very little open-source GPU infrastructure in the public domain. We argue that one of the reasons for the lack of open-source infrastructure for GPUs is rooted in the complexity of their ISA and software stacks. In this work, we first propose an ISA extension to RISC-V that supports GPGPUs and graphics. The main goal of the ISA extension proposal is to minimize the ISA changes so that the corresponding changes to the open-source ecosystem are also minimal, which makes for a sustainable development ecosystem. To demonstrate the feasibility of the minimally extended RISC-V ISA, we implemented the complete software and hardware stacks of Vortex on FPGA. Vortex is a PCIe-based soft GPU that supports OpenCL and OpenGL. Vortex can be used in a variety of applications, including machine learning, graph analytics, and graphics rendering. Vortex can scale up to 32 cores on an Altera Stratix 10 FPGA, delivering a peak performance of 25.6 GFlops at 200 Mhz.
The recent generation of high-end FPGAs provide a high computational capability, and energy-efficient makes them attractive in high-performance computation and architecture research thanks to their reconfigurability. Exploring the implementation of a high-bandwidth accelerator on these large FPGAs presents a unique challenge in how to best utilize the available variety of computing resources to achieve the maximum performance. In this work, we implemented Vortex, a full-scale PCIe-based GPGPU accelerator on modern high-end Intel FPGAs. Vortex implements the RISC-V ISA with an extension to support Single-Instruction Multiple-threads (SIMT) execution model. Vortex implements a multi-core architecture with high-bandwidth fully pipelined non-blocking caches, and a scratchpad shared memory to achieve maximum throughput. We leveraged the hardened Integer arithmetic and Floating-Point DSPs available on the FPGA to achieve maximum computational efficiency. The Vortex platform is highly customizable and scalable with a complete open-source compiler, driver, and runtime software stack with OpenCL support to enable research in GPU architectures. We managed to fit a 16- core processor configuration with high-bandwidth caches on Intel Arria10 FPGA, clocking at 203-234 MHz, making Vortex a practical framework for GPU hardware research.
Abstract: The importance of open-source hardware has been increasing in recent years with the introduction of the RISC-V Open ISA. This has also accelerated the push for support of the open-source software stack from compiler tools to full-blown operating systems. Parallel computing with today’s Application Programming Interfaces such as OpenCL has proven to be effective at leveraging the parallelism in commodity multi-core processors and programmable parallel accelerators. However, to the best of our knowledge, there is currently no publicly available implementation of OpenCL targeting commodity RISC-V processors that is accessible to the open-source community. Besides opening RISC-V to the existing rich variety of scientific parallel applications, OpenCL also provides access to a unique genre of benchmarks useful in computer architecture research. In this work, we extended an Open-source implementation of OpenCL to target RISC-V CPUs. Our work not only covers commodity multi-core RISC-V processors, but also plethora of low-profile embedded RISC-V CPUs that often do not support atomic instructions or multi-threading.
[video](https://carrv.github.io/2021/videos/CARRV2021_full_36_Blaise.html)
[slides](https://carrv.github.io/2021/slides/CARRV2021_slides_36_Blaise.pdf)
Abstract: With the rapid development of scientific computation, more and more researchers and developers are committed to implementing various workloads/operations on different devices. Among all these devices, NVIDIA GPU is the most popular choice due to its comprehensive documentation and excellent development tools. As a result, there are abundant resources for hand-writing high-performance CUDA codes. However, CUDA is mainly supported by only commercial products and there has been no support for open-source H/W platforms. RISC-V is the most popular choice for hardware ISA, thanks to its elegant design and open-source license. In this project, we aim to utilize these existing CUDA codes with RISC-V devices. More specifically, we design and implement a pipeline that can execute CUDA source code on an RISC-V GPU architecture. We have succeeded in executing CUDA kernels with several important features, like multi-thread and atomic instructions, on an RISC-V GPU architecture.
[video](https://carrv.github.io/2021/videos/CARRV2021_full_66_Han.html)
[slides](https://carrv.github.io/2021/slides/CARRV2021_slides_66_Han.pdf)
Abstract: AES and the SHA family are popular cryptographic algorithms for symmetric encryption and hashing, respectively. Highly parallel use cases for calling both AES and SHA exist, making hardware accelerated execution on a GPGPU appealing. We extend an existing GPGPU with a cryptography execute unit that will accelerate key elements of these algorithms. We implement a subset of the RISC-V cryptography extensions draft specification on the Vortex GPGPU, seeing 1.6× speedup for SHA-256 and 6.6× speedup for AES-256 on average over pure software implementations on Vortex.
[video](https://carrv.github.io/2021/videos/CARRV2021_full_87_Adas.html)
[slides](https://carrv.github.io/2021/slides/CARRV2021_slides_87_Adams.pdf)
Abstract: The current challenges in technology scaling are pushing the semiconductor industry towards hardware specialization, creating a proliferation of heterogeneous systems-on-chip, delivering orders of magnitude performance and power benefits compared to traditional general-purpose architectures. This transition is getting a significant boost with the advent of RISC-V with its unique modular and extensible ISA, allowing a wide range of lowcost processor designs for various target applications. In addition, OpenCL is currently the most widely adopted programming framework for heterogeneous platforms available on mainstream CPUs, GPUs, as well as FPGAs and custom DSP. In this work, we present Vortex, a RISC-V General-Purpose GPU that supports OpenCL. Vortex implements a SIMT architecture with a minimal ISA extension to RISC-V that enables the execution of OpenCL programs. We also extended OpenCL runtime framework to use the new ISA. We evaluate this design using 15nm technology. We also show the performance and energy numbers of running them with a subset of benchmarks from the Rodinia Benchmark suite.
Abstract: The emergence of data parallel architectures have enabled new opportunities to address the power limitations and scalability of multi-core processors, allowing new ways to exploit the abundant data parallelism present in emerging big-data and machine learning applications. This transition is getting a significant boost with the advent of RISC-V with its unique modular and extensible ISA, allowing a wide range of low-cost processor designs. In this work, we present Vortex, a full-stack RISC-V GPGPU processor with OpenCL support. The Vortex platform is highly customizable and scalable with a complete open-source compiler, driver, and runtime software stack to enable research in GPU architectures. We evaluated this design using 15 nm technology. We also show the preliminary performance and energy numbers of running them with a subset of benchmarks from the Rodinia Benchmark suite.
Abstract: The open-source RISC-V instruction set architecture (ISA) has enabled computer architects to propose several innovative processors for a wide range of applications. One of the domains of processor design that can take advantage from RISC-V, but has not seen enough attention, is general-purpose GPU (GPGPU) design. To support the development of open source GPGPU system, we present Vortex, our solution for building single instruction, multiple thread (SIMT) execution using RISC-V. In addition to a synthesizable microarchitecture model, we propose a GPU ISA extension to RISC-V and a software model, in the form of a runtime kernel, which makes Vortex practical and easy to integrate. As a result, Vortex does not require any modifications to the current RISC-V compilers.