GSoC '22: Enhancing AnyCore, a superscalar RISC-V processor

Project introduction

This year I participated in Google Summer of Code - a mentorship program for newcomers to open source development. I was contributing to The Free and Open Source Silicon Foundation. This post summarizes my experience, and the work I’ve done.

AnyCore

AnyCore is a 64 bit processor, implementing the RISC-V ISA. What sets apart AnyCore from other open source RISC-V cores, is that it is a superscalar core with out-of-order execution capabilities. This alone is a rare trait, as it is more challenging to implement, than an in-order core.

AnyCore is highly configurable: the code base allows configuring traits like the Fetch Width, Issue Width, length of Load/Store Queues, and more. With certain configurations, the core is measured to have IPC (instructions executed per clock cycle) values above 3, the most measured being 3.67. With these options, the core can also be configured to use less resources.

Another interesting feature of AnyCore is dynamic adaptivity. This means that the core can disable clock signals and power to certain parts of the pipeline, to better suit the needs of software being executed at the moment. In fact, one of the goals of the research project behind AnyCore was studying the overheads of dynamic adaptivity. AnyCore was released by NC State University, where R. Basu Roy Chowdhury worked on his PhD thesis titled AnyCore: Design, Fabrication, and Evaluation of Comprehensively Adaptive Superscalar Processors.

AnyCore layout

OpenPiton integration

OpenPiton is the world’s first open source, general purpose, multithreaded manycore processor. It is a tiled manycore framework scalable from one to 1/2 billion cores.

Since its release, many different processors have been integrated into OpenPiton in place of the original OpenSPARC T1 core, AnyCore being one of them. This simplifies development of AnyCore, and completes the core with the needed uncore components, and enables AnyCore to be a manycore system.

OpenPiton system

OpenPiton tile with a modified OpenSPARC core

OpenPiton tools, workflow

OpenPiton comes with scripts that simplify development. Simulations are done with sims, which supports many commercial, and open source simulators. I used Verilator. At some points Jonathan Balkind also tested the design with VCS, as it found some errors, that Verilator didn’t. With the right flags passed to sims, Verilator can generate a vcd file, which contains the waveforms of the simulation. I used GTKwave to view these.

To verify the correct execution of RISC-V instructions, the RISC-V Foundation maintains the riscv-tests repository. In most cases, the simulations are targeting one of these tests.

Identified issues, goals for the project

While AnyCore has an interesting microarchitecture, and good performance, it also has certain limitations:

Besides fixing the mentioned issues, the other (more long-term goal) of the project is to develop AnyCore into an application class processor, which means the ability to boot a complex operating systems - for example Linux. To work towards this goal, we made the following decisions during my project:

I will write about these in more detail in the later paragraphs.

RV64I: 64 bit Base Integer Instruction Set

RISC-V has a modular ISA, and software generally targets RV32G or RV64G, which are combination of a 32 bit or 64 bit base ISA, plus the standard extensions A, for atomics, M for integer multiplication/division, F and D for single and double precision floating point arithmetics, Zicsr for CSR instructions and Zifencei for the FENCE.I instruction.

The first, and easiest fix was 3 failing RV64I instructions: SRAIW, SRLIW, and SRLW. These are 32 bit shift instructions. This was a minor fix in Simple_ALU.sv: the >> (logical shift) and >>> (arithmetic shift) operators worked a bit differently on 32 bit subsets of 64 bit wires, than on 32 bit wires.

MRET is used to return from a trap handler executing in M mode. This instruction is part of the base ISA, but it turned out that it was missing from AnyCore. Only SRET, it’s S mode counterpart was implemented. Adding this instruction was relatively straight-forward, I basically copied how SRET worked. Both are decoded in the Decode_RISCV module, their flags are forwarded down the pipeline, and they generate a pulse in ActiveList, which is the module responsible for commiting instructions. These pulses are used in SupRegFile to know when to return from a trap (as seen below).

Another failing instruction was SH, which stands for store half, as it stores a half word into memory. After some debugging, I realized that the data read back from memory is not the correct length - instead of 4 bytes, only the last 2 bytes were actually written. It turned out that the memory system of OpenPiton used different codes for the load-store sizes than AnyCore, and the translation between the two was incorrect.

// DCache_Controller.sv

case (stSize_i)
	`LDST_BYTE:
	begin
-        piton_stSize = {1'b0, `LDST_BYTE};
+        piton_stSize = `MSG_DATA_SIZE_1B;
	end

	`LDST_HALF_WORD:
	begin
-        piton_stSize = {1'b0, `LDST_HALF_WORD};
+        piton_stSize = `MSG_DATA_SIZE_2B;
	end

	`LDST_WORD:
	begin
-        piton_stSize = {1'b0, `LDST_WORD};
+        piton_stSize = `MSG_DATA_SIZE_4B;
	end

	`LDST_DOUBLE_WORD:
	  begin
-        piton_stSize = {1'b0, `LDST_DOUBLE_WORD};
+        piton_stSize = `MSG_DATA_SIZE_8B;
	  end
endcase

The MRET and SH fixes are part of a bigger PR, that’s linked below.

RISC-V M extension: Standard Extension for Integer Multiplication and Division

Besides the base instruction set, the M extension is also mandatory for running an OS. This is supported, but as I mentioned, it uses proprietary modules. It also turned out that even with the Designware license, not all tests passed. Some of these issues were for example not sign extending the 32 bit result to 64 bits, or not checking if the divisor is 0. Another issue was that the Complex_ALU module had two multipliers instantiated: one for signed and one for unsigned multiplication. This design made it impossible to correctly execute MULHSU, which multiplies a signed and an unsigned operand.

The multiplication was fairly simple, as it is a synthesizable operation. The division and modulo operations should not be synthesized with the “/” and “%” operators, as the tools would generate a huge combinational logic, which would slow down the execution. These operations are normally pipelined instead.

I used a divider module, which I found on GitHub. This was extended with the logic to handle signed values correctly, as it was designed only for unsigned operands. My wrapper module converted the input operands to unsigned, did the operation using the divider, and converted back to signed values afterwards.

Pull Request at the AnyCore repository

Pull Request with small changes in the BYOC repository

Privilege levels in RISC-V, and Control Status Registers

There are four privilege levels defined in the RISC-V Privileged Specification: Machine mode (M), Hypervisor mode (H), Supervisor mode (S), and User mode (U). Simpler processors might only implement M mode. These are generally used in embedded projects. For running a Unix-like OS - generally Linux - M, S and U modes are required.

Control Status Registers (CSRs) hold information related to the state of the processor, which includes things like current privilege, return address after exiting a trap handler, etc. These registers are inside the SupRegFile of AnyCore. Some were implemented before, but the implementation was outdated, so most had to be redone. Some registers have side effects on writes, or have special masks for reads. This part of the project was going through the Privileged Specification, and adding each needed register one by one, with their read or write masks. Not all registers had to implemented, as some are optional. For example, according to the openSBI GitHub page, Physical Memory Protection registers could be left out:

The PMP CSRs are optional. If PMP CSRs are not implemented then OpenSBI cannot protect M-mode firmware and secured memory regions.

I ended up doing mhartid, fcsr, frm, fflags, mtvec, stvec, mie, sie, mideleg, medeleg, mstatus, sstatus, mepc, sepc, mip, sip, mcause, scause, mscratch, mtime, minstret, mvendorid, marchid, mimpid, misa, mconfigptr, menvcfg, mtval and stval.

The SupRegFile also handles traps: the logic to jump to a trap handler had to be implemented, and to return from a trap, with the SRET or MRET instructions.

  // Returning from a trap
  if (mretFlag_i) begin
    // get the previous machine interrupt enable flag
    csr_mstatus_next.mie  = csr_mstatus.mpie;
    // restore the previous privilege level
    priv_lvl_next  = csr_mstatus.mpp;
    // set mpp to user mode
    csr_mstatus_next.mpp  = USER_PRIVILEGE;
    csr_mstatus_next.mpie = 1'b1;
  end
  else if (sretFlag_i) begin
    // return the previous supervisor interrupt enable flag
    csr_mstatus_next.sie  = csr_mstatus.spie;
    // restore the previous privilege level
    priv_lvl_next = {1'b0, csr_mstatus.spp}; //spp is 1 bit
    // set spp to user mode
    csr_mstatus_next.spp  = 1'b0;
    csr_mstatus_next.spie = 1'b1;
  end

OpenPiton includes most of the tests from the riscv-tests repository. All tests include the riscv_test.h header file, which includes useful macros. The reset_vector included below is used in all tests.

As we did not have working CSRs before, the following part from the riscv_test.h file was commented out, and after the MRET, we just had a j test_2 to jump to the test cases. After adding all the needed CSRs, the tests could also run with this part included, which was a way to test some of the SupRegFile functionality.

reset_vector:                                                           \
        INIT_XREG;                                                      \
        RISCV_MULTICORE_DISABLE;                                        \
        INIT_SATP;                                                      \
        INIT_PMP;                                                       \
        DELEGATE_NO_TRAPS;                                              \
        li TESTNUM, 0;                                                  \
        la t0, trap_vector;                                             \
        csrw mtvec, t0;                                                 \
        CHECK_XLEN;                                                     \
        la t0, stvec_handler;                                           \
        beqz t0, 1f;                                                    \
        csrw stvec, t0;                                                 \
        li t0, (1 << CAUSE_LOAD_PAGE_FAULT) |                           \
               (1 << CAUSE_STORE_PAGE_FAULT) |                          \
               (1 << CAUSE_FETCH_PAGE_FAULT) |                          \
               (1 << CAUSE_MISALIGNED_FETCH) |                          \
               (1 << CAUSE_USER_ECALL) |                                \
               (1 << CAUSE_BREAKPOINT);                                 \
        csrw medeleg, t0;                                               \
1:      csrwi mstatus, 0;                                               \
        init;                                                           \
        EXTRA_INIT;                                                     \
        EXTRA_INIT_TIMER;                                               \
        la t0, 1f;                                                      \
        csrw mepc, t0;                                                  \
        csrr a0, mhartid;                                               \
        mret;                                                           \

Besides uncommenting the reset_vector function, and making sure tests still pass, I also tested the SupRegFile with the rv64si tests. These are testing the Supervisor ISA. As not every CSR from the specification were implemented, some of these tests were expected to fail. With the scall and sbreak tests - which are instructions for environment calls and breakpoints - I managed to find, and fix a bug in ActiveList, where the incoming exceptionFlag is only checked if its delayed version - exceptionFlag_reg - is 0. As both can be true at the same time, I had to move exceptionFlag check before the reg check, to set the recoverPC to the correct value.

This was fixed in the PR that adds the new TRI Interface to AnyCore.

Zifencei - FENCE.I instruction

FENCE.I was part of the base instruction set architecture, but it was moved to it’s own ISA extension in the RISC-V standard. As it is part of RV64G, it also had to be implemented. The Unprivileged ISA specification has tips on implementing FENCE.I:

A simple implementation can flush the local instruction cache and the instruction pipeline when the FENCE.I is executed.

This is what I’ve done. The pipeline flush is done in the ActiveList, similarly to how a branch mispredict behaves. This means, that fenceFlag is used in a lot of places in ActiveList (for example to reset certain queues), so to keep it simple, I’m only including the most important parts here.

recoverFlag_o signals to the Fetch stage to fetch from recoverPC.

// ActiveList.sv

assign recoverFlag_o    = violateFlag_reg | mispredFlag_reg
  | exceptionFlag_reg | fenceFlag_reg;
assign recoverPC_o      = (mispredFlag_reg) ? targetPC : recoverPC;

In the case of FENCE.I, mispredFlag_reg is 0, so the recoverPC_o is recoverPC, so we have to set the address of the next instruction to that.

  if (fenceFlag[0] & ~stallStCommit_i)
  begin
    fenceFlag_reg       <= 1'b1;
    // targetAddr is pc_p4 from Ctrl ALU
    recoverPC           <= targetAddr;
  end

And finally, the signal to flush the Icache is set by the fenceFlag_reg.

assign icFlush_o = fenceFlag_reg;

The I$ is flushed by invalidating the entries.

// ICache_controller.sv

  always_ff @(posedge clk or posedge reset)
  begin
    if(reset | icFlush_i)
    begin
      int i;
      for(i = 0; i < `ICACHE_NUM_LINES;i++)
        valid_array[i] <= 1'b0;
    end
    else if(mem2icInv_i)
    begin
      valid_array[mem2icInvInd_i] <= 1'b0;
    end
    else if(fillValid)
    begin
      valid_array[fillIndex] <= 1'b1;
    end
  end

Pull Request with CSRs, FENCE.I and SH fix

Running openSBI on AnyCore

openSBI is an open source implementation of the RISC-V SBI Specification, where SBI stands for Supervisor Binary Interface. openSBI is essentially an interface between a platform-specific firmware running in M mode, and a bootloader or the kernel running in S mode, so running Linux on a RISC-V processor requires the ability to run openSBI first.

While this part of the project produced very little code, as it was more about testing than development, I think what we found can be interesting to include here.

We have made a few changes to the openSBI code. First of all, openSBI requires RV32IMA or RV64IMA support. AnyCore doesn’t support the Atomic extension yet, so we had to emulate the atomic instructions. This was done by my mentor, Jonathan Balkind. See this change here.

openSBI supports different platforms, out of which we used the generic and fpga-openpiton. Since the newest openSBI version didn’t work with OpenPiton + Ariane - which should be a more stable, and better verified platform - we also had to try using the generic version, or going back to version 0.9. These changes are done with selecting the appropriate build flags.

There is a $display function in the ActiveList of AnyCore, which prints the currentCommitPC, which is the program counter of the committed instruction. We could compare these from the log file of sims with the memory addresses of instructions from the disassembled binary. Following the committed PCs, we found that the execution stops while parsing the flattened device tree (FDT) each time.

Version 0.9 ran for 7k instructions with the generic platform, and 12k with fpga-openpiton. The newest version ran for 33k on fpga-openpiton and 385k with generic. With VCS, the generic platform committed 637k instructions.

This part requires further debugging, as it is a mandatory step towards being application class. This could be a great project for a future GSoC contributor!

New interface to AnyCore

The core is instantiated in a tile in OpenPiton. Due to differences in the AnyCore top module interface, and the generic OpenPiton core-L1.5 cache interface (this is called a Transaction-Response Interface, or TRI), AnyCore first connected to the anycoredecoder and anycore_encoder modules, which translated between the two. I refactored these to modules into a new one called anycore_tri_transducer, and moved it into AnyCore. This way, the top level interface of AnyCore is similar to other cores connected to OpenPiton.

Link to AnyCore PR

Link to BYOC PR

Personal closing notes

I’ve learnt a lot during this years GSoC. I was interested in computer architecture, and I’m really glad I could contribute to an open source core, and actually see how some of the ideas are implemented.

This was my first time contributing to a bigger project, and I had this idea before, that one has to understand every little detail of a project to be able to contribute. I’ve learnt that this is not the case, and understanding a bigger codebase is a very incremental process, that everyone has to go through.

Previously, I’ve had many positive experiences interacting with open source projects: when asking questions about the tools I’ve used, I’ve got very helpful and quick answers. When reporting bugs I’ve felt that people are passionate about the projects they’re maintaining, so my reports aren’t getting lost in some ticket system, but are looked into. It was a great experience to be on the other side, and push a project forward, which would hopefully be useful for somebody else.

I’ve found GSoC reports from previous FOSSi contributors very interesting and inspiring, so if you read this, make sure to check those out as well: