whensungoesdown: 一月 2022

2022年1月24日星期一

chiplab里的取指相关的信号

Chiplab ifu

chiplab里cpu取值令有一组信号给外层，外层负责实现cache和DRAM axi的通信。

ifu里 5 // group inst 6 output wire [31 :0] inst_addr , 7 input wire inst_addr_ok , 8 output wire inst_cancel , 9 input wire [1 :0] inst_count , 10 input wire inst_ex , 11 input wire [5 :0] inst_exccode , 12 input wire [127:0] inst_rdata , 13 output wire inst_req , 14 input wire inst_uncache , 15 input wire inst_valid ,

其中，inst_addr是取指的地址，inst_req是发出取值请求。比如pcbf的值assign给inst_addr，然后设置inst_req为1。等到指令成功取到以后，inst_valid输入为1。像现在chiplab没有cache的情况下，inst_addr_ok和inst_valid会同时返回1。因为没cache，所以inst_uncache也同时返回1。数据则是由inst_rdata带回，可以看到最多带128bit，也就是16个字节，但现在每次都只带回4字节，高位为空。还不知道哪个信号可以控制这个。

inst_ex，inst_excode还没有看有啥用。

inst_cancel有意思，设为1后，当前的取值并不会停，inst_rdata和inst_addr_ok都还会正常来，但inst_valid不会来1了。要是在inst_cancel后重新给inst_addr并且inst_req，等inst_valid再来就是取回新的指令了。

我现在用inst_valid和其它信号来控制pc_bf到pc_f的过程。如果遇到branch指令，在不用delay slot的时候，是需要刷流水线寄存器的。br_cancel是由ex2里brucancel_ex2过来。其实这已经在很深的流水线里了，经过了de is ex1。如果ipc是1的话，if段已经过了3条指令了，这时候要刷de is ex1。但由于现在的chiplab没有cache，而是直接通过axi来取值，所以需要6个cycle才能取指。

从inst_valid开始，de_port0_valid, de_port0_pc, de_port0_inst才有效并准备进入de的pipeline register。下一个clock，进入de pipeline并decode, 然后is_port0_valid, is_port0_pc，is_port0_inst, is_port0_op才有效。这里pc_bf和pc_f由于刚才的inst_valid，各自+4了。再下一个clock，br_cancel来了（现在还比较奇怪为什么br_cancel好像来的有点早，不应该刚到ex1吗）。这时，pcbf是加过4的，所以在取值下一条指令。但这时候因为brcancel的反馈，所以pcbf马上要转到br_target。这就导致下一个inst_valid到的时候，pc_f和inst的不一致。

2022年1月6日星期四

OpenSPARC T1是怎样刷流水线？

在ifu/rtl/sparc_ifu_fcl.v里看到这么一段

//-------------------------
// Rollback
//-------------------------

   // 04/05/02
   // Looks like we made a mistake with rollback. Should never
   // rollback to S. In the event of a dmiss or mul contention, just
   // kill all the instructions and rollback to F. This adds one
   // cycle to the dmiss penalty and to the mul latency if we have to
   // wait, both not a very high price to pay. This would have saved
   // lots of hours of design and verif time.
   //
   assign rb2_inst_d = thr_match_dw & inst_vld_d & dtu_fcl_rollback_g;
   assign rb1_inst_s = thr_match_fw & inst_vld_s & dtu_fcl_rollback_g;
   assign rb0_inst_bf = thr_match_nw & switch_bf & dtu_fcl_rollback_g;

   assign retract_iferr_d1 = erb_dtu_ifeterr_d1 & inst_vld_d1;

   assign retract_inst_d = retract_iferr_d1 & thr_match_de &
                           fcl_dtu_inst_vld_d |
                           mark4rb_d |
                           dtu_fcl_retract_d;

   assign rt1_inst_s = thr_match_fd & inst_vld_s & dtu_fcl_retract_d |
                       mark4rb_s;

看来是既有rollback （rb）也有kill。

   // determine rollback amount
   assign rb_frome = {4{(rb2_inst_e | rt2_inst_e) &
                        (inst_vld_e | intr_vld_e)}} & thr_e;
   assign rb_fromd = {4{(rb1_inst_d | rt1_inst_d) &
                        (inst_vld_d | intr_vld_d)}} & thr_d;
   assign rb_froms = {4{rb_stg_s & inst_vld_s_crit}} & thr_f;
   assign rb_w2 = rb_frome | rb_fromd;
   assign rb_for_iferr_e = {4{retract_iferr_e}} & thr_e;

感觉应该有控制流水线寄存器的部分，但还没找到。。

我之前清流水线是把流水线寄存器reset，流水线寄存器清0。相当于加入流水线泡泡 NOP就是00000000。

发现opensparc t1不是这个思路。一条指令最终被执行，实质就是它改写了寄存器或者改写了内存，或者是对其它系统状态结果有影响。比如一个add指令， add x2 x0 x3，把x2寄存器的值给到x3。这条指令可以在inst流水线寄存器里被清0，也可以最终不写regfile，这样这条指令也相当于没执行（其实是执行了，但是没有效果）。

opensparc应该就是这种方式的。比如这个ifu_exe_kill_e信号，注释也说的很清楚了

(ifu产生的，发给exu，kill当前流水线E里的这条指令)

input ifu_exu_kill_e; // kill instruction in e-stage

进入sparc_exu()->sparc_exu_ecl()

然后ifu_exu_kill_e分别进入

sparc_exu_eclccr()
sparc_exu_ecl_wb()
sparc_exu_eclbyplog_rs1()
sparc_exu_eclbyplog byplog_rs2()
sparc_exu_eclbyplog byplog_rs3()
sparc_exu_eclbyplog byplog_rs3h()

寄存器bypassing 逻辑都需要用到这个信号，应该基本上所有更新寄存器的地方都要用这个信号。

看主要的sparc_exu_ecl_wb() Writeback control logic

// Module Name: sparc_exu_ecl_wb
//      Description: Implements the writeback logic for the exu.
//              This includes the control signals for the w1 and w2 input
//      muxes as well as keeping track of the wen signal for ALU ops.

keeping track of the wen signal for ALU ops. 应该就是说的这个ifu_exu_kill_e了。

   assign wen_w_inst_vld = valid_w | inst_vld_noflush_wen_w;
   assign ecl_irf_wen_w = ifu_exu_inst_vld_w & wen_w_inst_vld | wen_no_inst_vld_w;

   // bypass valid logic and flops
   dff_s dff_wb_d2e(.din(ifu_exu_wen_d), .clk(clk), .q(wb_e), .se(se),
                  .si(), .so());
   dff_s dff_wb_e2m(.din(valid_e), .clk(clk), .q(wb_m), .se(se),
                  .si(), .so());
   dffr_s dff_wb_m2w(.din(valid_m), .clk(clk), .q(wb_w), .se(se),
                  .si(), .so(), .rst(reset));
   assign valid_e = wb_e & ~ifu_exu_kill_e & ~restore_e & ~wrsr_e;// restore doesn't finish on time
   assign bypass_m = wb_m;// bypass doesn't need to check for traps or sehold
   assign valid_m = bypass_m & ~rml_ecl_kill_m & ~sehold;// sehold turns off writes from this path
   assign valid_w = (wb_w & ~early_flush_w & ~ifu_tlu_flush_w);// check inst_vld later
   // don't check flush for bypass
   assign bypass_w = wb_w | inst_vld_noflush_wen_w | wen_no_inst_vld_w;

最终ifu_exu_kill_e这个信号混合其它信号，再经过几个流水线级，最终影响这个ecl_irf_wen_w。这个信号output出sparc_exu_ecl()，进入bw_r_irf irf()，也就是整数register file。

// Module Name: bw_r_irf
//      Description: Register file with 3 read ports and 2 write ports. Has
//                              32 registers per thread with 4 threads. Reading and writing
//                              the same register concurrently produces x.

module bw_r_irf (/*AUTOARG*/
   // Outputs
   so, irf_byp_rs1_data_d_l, irf_byp_rs2_data_d_l,
   irf_byp_rs3_data_d_l, irf_byp_rs3h_data_d_l,
   // Inputs
   rclk, reset_l, si, se, sehold, rst_tri_en, ifu_exu_tid_s2,
   ifu_exu_rs1_s, ifu_exu_rs2_s, ifu_exu_rs3_s, ifu_exu_ren1_s,
   ifu_exu_ren2_s, ifu_exu_ren3_s, ecl_irf_wen_w, ecl_irf_wen_w2,
   ecl_irf_rd_m, ecl_irf_rd_g, byp_irf_rd_data_w, byp_irf_rd_data_w2,
   ecl_irf_tid_m, ecl_irf_tid_g, rml_irf_old_lo_cwp_e,
   rml_irf_new_lo_cwp_e, rml_irf_old_e_cwp_e, rml_irf_new_e_cwp_e,
   rml_irf_swap_even_e, rml_irf_swap_odd_e, rml_irf_swap_local_e,
   rml_irf_kill_restore_w, rml_irf_cwpswap_tid_e, rml_irf_old_agp,
   rml_irf_new_agp, rml_irf_swap_global, rml_irf_global_tid
   ) ;

这个register file有点复杂，参数也有点多。。。但可以看出如果这个wen （ecl_irf_wen_w）没有的话，是不会写register file的。

但为什么用这种方式，而不是用pipeline bubble呢？还没搞清楚。。。

但似乎ifu_exu_kill_e这个信号并没有到lsu，比如：

ld [%L1],%L2
ld invalid address (触发异常)

遇到异常或中断，要刷流水线，如果是用opensparc t1的这种方式，那么前面这条指令是会被送到lsu，只是最后回写结果的时候没有写进寄存器。但这部操作已经会触发cache相关的操作了吧。

2022年1月5日星期三

单发流水线里，带delay slot的转移指令不需要刷decode阶段的流水线寄存器

比如流水线分为F D E M W，

branch logic 放在D，包括取出imm和regidx。

计算放在E，计算出的target address直接回馈给F或者BF。

这样做其实是在F阶段不知道branch结果的时候默认取值pc+4。当branch指令走到E阶段时，这时下一条指令已经走到了D，这条正是delay slot instruction。

所以如果指令集里是带delay slot，比如mips和sparc这样，branch taken的时候不需要刷前面的流水线，直接改变F阶段的pc。（要是有BF阶段，可能还不一样）

看chiplab时，gs232c_front()里给下一阶段的信号是3个port，每个port大概是这些信号：

    .o_allow          (de2_accept         ),
    .o_valid          ({de1_port2_valid,de1_port1_valid,de1_port0_valid}),
    .o_port0_pc       (de1_port0_pc       ),// O, 32
    .o_port0_inst     (de1_port0_inst     ),// O, 32
    .o_port0_taken    (de1_port0_br_taken ),// O, 1
    .o_port0_target   (de1_port0_br_target),// O, 30
    .o_port0_ex       (de1_port0_exception),// O, 1
    .o_port0_exccode (de1_port0_exccode ),// O, 5
    .o_port0_hint     (de1_port0_hint     ),
    .o_port1_pc       (de1_port1_pc       ),// O, 32
    .o_port1_inst     (de1_port1_inst     ),// O, 32
    .o_port1_taken    (de1_port1_br_taken ),// O, 1
    .o_port1_target   (de1_port1_br_target),// O, 30
    .o_port1_ex       (de1_port1_exception),// O, 1
    .o_port1_exccode (de1_port1_exccode ),// O, 5
    .o_port1_hint     (de1_port1_hint     ),
    .o_port2_pc       (de1_port2_pc       ),// O, 32
    .o_port2_inst     (de1_port2_inst     ),// O, 32
    .o_port2_taken    (de1_port2_br_taken ),// O, 1
    .o_port2_target   (de1_port2_br_target),// O, 30
    .o_port2_ex       (de1_port2_exception),// O, 1
    .o_port2_exccode (de1_port2_exccode ),// O, 5
    .o_port2_hint     (de1_port2_hint     ),

没有刷D阶段流水线的信号，基本是取值后直接就发给后面的逻辑处理了，最多同时发出三条指令。o_valid[3]表示哪个port valid。

而LoongArch是没有delay slot的，这里搞了半天一直在找刷流水线的信号。

因为br_cancel 1是表示branch taken（名字好像正好反了。。），br_target是目标地址。这俩信号是从后面回送给F阶段的。这段代码在gs232c_front()->gs232c_pipe_pc()里。

assign pc_next = wb_cancel ? wb_target :
                 br_cancel ? br_target :
                 pr_cancel ? pr_target :
                 bt_cancel ? bt_target : {pc_seq,2'h0};

感觉这怎么可能不刷流水线就能这么发出去指令就不管了呢。

后来想在opensparc t1看看怎么刷流水线的。结果还没找到。原来sparc也带delay instruction，比mips规则更复杂点，还带个anull bit。

The instruction following a delayed control-transfer instruction is called a
delay instruction. Setting the annul bit in a conditional delayed control-
transfer instruction causes the delay instruction to be annulled (that is, to have no effect) if and only if the branch is not taken. Setting the annul bit in an
unconditional delayed control-transfer instruction (“branch always”) causes
the delay instruction to be always annulled.

在tlu（trap logic unit）在得到中断或异常的时候刷后面整条的流水线，但还没仔细看这部分。

要想看annulled bit怎么工作的，可以追anull_next_e

题外话，看到了这么个bug fix

//bug6838,bug6989 - interrupt issued in annulled delay slot resets wm_other mask in e-stage; this
//                  reset causes switch logic to lose a long latency op(div) which set the wm_other mask
//                  in s-stage. Note that the div is issued to FPU. the ifu re-issues the interrupt -
//                  which results in flush. this kills the long latency op and div is lost
//
//                  fix is to detect interrupt in anulled delay slot followed by long latency op and
//                  not reset the wm_other mask.
//
//       10/07/04 - fix changed to delay setting of wm_other mask from d-cycle to e-cycle. hence
//                  removing the kill in killed_inst_done_e
//
//   assign killed_inst_done_e = (fcl_dtu_inst_vld_e & swc_e | //sw inst
//                                fcl_dtu_intr_vld_e) & // any intr
//                                 dtu_inst_anull_e;

那chiplab这个是怎么回事呢，结果是这个cpu并不是一个周期一条指令，具体说是我在看wave的时候是12个clock o_port0_pc走一条指令。不知道是不是配置的关系。

这样其实流水线都没意义了，ex阶段的信号返回给f也没关系，因为最终还要等很久才更新给is阶段的指令port。而且分支预测也不起作用了，因为总是等到ex的结果了。。。

订阅：评论 (Atom)