Simx Implementation of Cache in Vortex and a HW Prefetcher Design

Author: Vora Mihir Ketan (undergraduate intern at BITS Pilani)

I have been working in the Vortex group as a remote research intern at Georgia Institute of Technology. This blog talks about my work done during the internship. It explains the current cache design for Vortex and the Simx implementation of Cache. It also includes a prefetcher design and my progress on it’s implementation.
Current Cache Design
The current Vortex memory hierarchy is as follows: Each core has a private data cache/ L1 cache. All the cores in a cluster have a shared L2 cache. All the clusters together share a L3 cache. Dram memory resides post L3 cache.
p1 A prefetcher can be implemented in any of these levels of memory hierarchy. This blog talks about the implementation of a hardware prefetcher in the data cache. The cache memory is divided into multiple banks. These banks process memory requests in parallel. Hence multiple memory requests can be processed at a given time. Each bank has a Arbiter, Tag RAM, Data RAM and MSHR (Miss Handling Status Register). The incoming memory requests from the core are redirected to an appropriate bank based on the requested memory address. These are sent through the Arbiter and checked for a hit. In case of a hit the requested Data is sent as a core response. For a miss, a DRAM request is sent out while recording an entry for the same in the MSHR. Once a DRAM response is received for an MSHR entry, the entry is evicted and data is sent as a core response.p2
Current Cache Implementation
The most basic struct in the implementation is block_t consisting of fields valid, dirty, tag and LRU counter.set_t is implemented as a vector of block. The struct bank_req_t holds the fields used to create a new memory access request at a bank. A MSHR entry is made up of block id and bank_req_t fields.p3 The MSHR has been implemented as a separate class. It consists of functions: 1. Lookup – To find an existing valid MSHR entry corresponding to a particular memory request. 2. Allocate - To allocate a new MSHR entry for a memory request. 3. Replay – To mark a valid entry for replay. 4. Pop – To evict a valid entry marked for replay. 5. Clear – To invalidate all MSHR entries. The bank_t struct consists of set_t and a MSHR. The creation of a memory request is done by the tick function inside class Cache. This function is executed repeatedly every clock cycle. A vector named pipeline_req of type bank_req_t and of size of number of banks is created. First the MSHR entries marked for replay are popped out. Post that memory response port is checked and fill requests are processed if any. Once a fill request is processed, the corresponding MSHR entry is marked for replay. After these operations complete, a check is performed for new core requests and a new memory request is created using the fields in bank_req_t. Post that a check is performed to make sure that there are no pending actions for the memory request already present in the pipeline_req at that bank_id. In case of no pending actions, a new request is inserted into the pipeline and the bank request is processed using the function processBankRequest.
p4 In processBankRequest, the pipeline_req is first checked for MSHR replay. If the MSHR replay is marked as true, it means that the fill request for this memory location is already processed. Hence a core response can be sent directly. Else the cache is checked for a hit/miss. In case of a read miss, a DRAM request is initiated for the memory location and a MSHR entry is recorded for the same using lookup and allocate functions.
Prefetcher Design
A stride prefetcher has been implemented in the data cache. It is capable of detecting a single sequence having a constant stride. It prefetches if the stride is maintained for three or more memory requests. The flowchart shows the prefetcher implemented with one of the banks. All the banks have one common prefetcher unlike the TAG access or MSHR. The prefetch buffer can store one prefetch request per bank. Hence if there’s a pending prefetch request at some bank, a new prefetch request for that bank isn’t created. Also if the bank MSHR is already full by taking care of core requests, it isn’t burdened further and no prefetch request is created for that bank.p5 On every cache read miss, the possibility of a prefetch is checked. If a continuous stride is maintained for three or more memory requests and the previously mentioned conditions are satisfied, a prefetch request is created and added to the prefetch buffer. The prefetch buffer will execute a prefetch request for a bank only in the absence of a core request on that bank. This way core requests take priority over prefetch requests and performance isn’t significantly hampered even in case of a false prefetch.
Prefetcher Implementation
The prefetcher has been implemented as a separate class. A new field has been added in bank_req_t called prefetch. It is 1 if the memory request being created is a prefetch request, else it is 0. The prefetcher->update function is called on every cache read miss. This function checks on stride and stride count. If the conditions are satisfied, it creates a prefetch request using fields in bank_req_t with prefetch as 1. Prefetch buffer has been implemented as a vector of bank_req_t, similar to pipeline_req.
p6 The created prefetch request for a bank is added to the buffer after confirming that bank MSHR is not full and no pending prefetch requests are present on that bank.
p7 While processing bank requests it is checked if a valid core request is not present and a valid prefetch request is present for the bank. If so, the invalid pipeline_req at that bank is replaced by the valid prefetch_req and the prefetch_req is marked invalid making space for the next prefetch request. The if conditions for sending core responses have also been modified to make sure that a core response is not sent for a prefetch request.
Testing and Debugging
Already existing benchmarks in Vortex can be used for verifying the implementation. It should be first tested with basic tests under regression and debugged there for failures. Post that it can be verified with more complex opencl tests. All the tests under a given folder can be executed using a single command line (for e.g. “make -C tests/opencl run-simx”). As of now the prefetcher implementation doesn’t pass all the benchmarks and will have to be debugged.
For debugging the Run and Debug feature of Visual Studio Code is utilized. To run a visual debug, the C/C++ extension of VS Code is needed. To enable the debug, append the debug flag (-g) while running a make. In the launch.json file, an entry has to be added for the test you want to run a debug for. In the entry, the stop at Entry is marked true. This will stop the execution at the entry point of the main function for the associated test. Post that you can step through or go to set breakpoints, analyze values for different variable and debug.