Skip to content

6745 Tutorial 13: DesignWare and Retiming

Synopsys Design Compiler (DC) includes the DesignWare (DW) library which is a collection of hardware components implementing arbiters, integer arithmetic units, floating-point arithmetic units, and memories. The Synopsys DW components also have optimized gate-level implementations that Synopsys DC can use when synthesizing your design. This tutorial will describe how these components can be used either through automatic inference or explicit instantiation. You can see a list of all of the available Synopsys DW components in the user guide here:

The user guide shows which units can be automatically inferred from an operator or function and which can only be used through explicit instantiation. Since most of the arithmetic units are combinational, the tutorial will also discuss how you can use register retiming to automatically pipeline these units so they can operate at higher clock frequencies. This tutorial assumes you have already completed the tutorials on Linux, Git, PyMTL, Verilog, ASIC front-end flow, ASIC back-end flow, and ASIC automated ASIC flow.

The first step is to access ecelinux. Use VS Code to log into a specific ecelinux server. Once you are at the ecelinux prompt, source the setup script, clone this repository from GitHub, and define an environment variable to keep track of the top directory for the project.

% source setup-ece6745.sh
% mkdir -p $HOME/ece6745
% cd $HOME/ece6745
% git clone git@github.com:cornell-ece6745/ece6745-tut13-dw tut13
% cd tut13
% export TOPDIR=$PWD

1. Synopsys DesignWare Automatic Inference

Let's start by exploring how Synopsys DC can automatically infer the use of Synopsys DW components by reviewing the sort unit from earlier tutorials. Recall the sort unit is implemented using a three-stage pipelined, bitonic sorting network and the datapath is shown below.

Let's look at the min/max unit:

module tut3_verilog_sort_MinMaxUnit
#(
  parameter p_nbits = 1
)(
  input  logic [p_nbits-1:0] in0,
  input  logic [p_nbits-1:0] in1,
  output logic [p_nbits-1:0] out_min,
  output logic [p_nbits-1:0] out_max
);

  always_comb begin

    // Find min/max

    if ( in0 >= in1 ) begin
      out_max = in0;
      out_min = in1;
    end
    else if ( in0 < in1 ) begin
      out_max = in1;
      out_min = in0;
    end

    // Handle case where there is an X in the input

    else begin
      out_min = 'x;
      out_max = 'x;
    end

  end

endmodule

Notice how this unit uses two comparison operators, one for greater-than-equal and one for less than. We will see how Synopsys DC is able to automatically infer the use of two Synopsys DW components for these operators.

First, we need to run the tests and interactive simulator to create the Verilog test benches which we can use for four-state RTL, fast-functional gate-level, and back-annotated gate-level simulation.

% mkdir -p $TOPDIR/sim/build
% cd $TOPDIR/sim/build
% pytest ../tut3_verilog/sort/test --test-verilog --dump-vtb
% ../tut3_verilog/sort/sort-sim --impl rtl-struct --input random --stats --translate --dump-vtb
% ../tut3_verilog/sort/sort-sim --impl rtl-struct --input zeros  --stats --translate --dump-vtb

Now let's use the ASIC automated flow to push the sort unit through synthesis and place-and-route.

% mkdir -p $TOPDIR/asic/build-tut08-sort
% cd $TOPDIR/asic/build-tut08-sort
% pyhflow ../designs/tut08-sort.yml
% ./run-flow

Then you can look at the resources report generated by Synopsys DC to see what Synopsys DW components were inferred.

% cd $TOPDIR/asic/build-tut08-sort
% cat 02-synopsys-dc-synth/resources.rpt

You should see something like this.

================================================================
| Cell     | Module  | Parameters | Contained Operations       |
================================================================
| gte_x_1  | DW_cmp  | width=8    | gte_30 (MinMaxUnit.v:30)   |
| lt_x_2   | DW_cmp  | width=8    | lt_34 (MinMaxUnit.v:34)    |
================================================================

Implementation Report
========================================
|          |         | Current         |
| Cell     | Module  | Implementation  |
========================================
| gte_x_1  | DW_cmp  | apparch (area)  |
| lt_x_2   | DW_cmp  | apparch (area)  |
========================================

The report shows how Synopsys DC was able to infer the use of a Synopsys DW comparator (DW_cmp). You can learn more about this component from its datasheet here:

You will see that the component includes three different microarchitectures:

  • rpl: Ripple carry
  • pparch: Delay-optimized flexible parallel-prefix
  • apparch: Area-optimized flexible architecture

Since the clock constraint is relatively generous (140ps of positive slack), Synopsys DC has decided to use a more area-optimized implementation.

2. Synopsys DesignWare Explicit Instantiation

Synopsys DC will to its best to infer Synopsys DW components whenever possible, but many components can only be used by explicitly instantiating the component in your Verilog. In this section, we will look at two examples: (1) instantiating a six-function comparator in the sort unit; and (2) instantiating a floating-point adder.

2.1. Explicitly Instantiating Six-Function Comparator

To illustrate how explicit instantiation works, let's use a six-function comparator to implement the min/max unit. Review the corresponding data-sheet here:

Now go ahead and modify the min/max unit to explicitly instantiate and use this six-function comparator as shown below.

module tut3_verilog_sort_MinMaxUnit
#(
  parameter p_nbits = 1
)(
  input  logic [p_nbits-1:0] in0,
  input  logic [p_nbits-1:0] in1,
  output logic [p_nbits-1:0] out_min,
  output logic [p_nbits-1:0] out_max
);

  logic lt;
  logic gt;
  logic eq;
  logic le;
  logic ge;
  logic ne;

  DW01_cmp6#(p_nbits) cmp_gt
  (
    .A  (in0),
    .B  (in1),
    .TC (1'b0),
    .LT (lt),
    .GT (gt),
    .EQ (eq),
    .LE (le),
    .GE (ge),
    .NE (ne)
  );

  assign out_max = gt ? in0 : in1;
  assign out_min = lt ? in0 : in1;

endmodule

Now try to rerun the tests.

% cd $TOPDIR/sim/build
% pytest ../tut3_verilog/sort/test -x --tb=long

The tests will fail because Verilator cannot find the implementation of the Synopsys DW component. Add the following include directive at the top of the implementation of the min/max unit:

`include "/opt/synopsys/syn/V-2023.12-SP5/dw/sim_ver/DW01_cmp6.v"

Now Verilator will be able to find the implementation of the Synopsys DW component, but it produces a warning about an implicit static function. We will need to disable this warning when processing the Synopsys DW component using Verilator's special linting comments.

/* verilator lint_off IMPLICITSTATIC */
`include "/opt/synopsys/syn/V-2023.12-SP5/dw/sim_ver/DW01_cmp6.v"
/* verilator lint_on IMPLICITSTATIC */

Now the tests should all pass so we can now regenerate the Verilog test benches for four-state RTL, fast-functional gate-level, and back-annotated gate-level simulation.

% cd $TOPDIR/sim/build
% pytest ../tut3_verilog/sort/test --test-verilog --dump-vtb
% ../tut3_verilog/sort/sort-sim --impl rtl-struct --input random --stats --translate --dump-vtb
% ../tut3_verilog/sort/sort-sim --impl rtl-struct --input zeros  --stats --translate --dump-vtb

Now let's push the sort unit through the ASIC automated flow again. We will start by just running the first two steps and looking at the resources report.

% cd ${TOPDIR}/asic/build-tut08-sort
% ./01-synopsys-vcs-rtlsim/run
% ./02-synopsys-dc-synth/run
% cat 02-synopsys-dc-synth/resources.rpt

You should see something like this:

==================================================================
| Cell     | Module    | Parameters | Contained Operations       |
==================================================================
| cmp_gt   | DW01_cmp6 | width=8    | cmp_gt (MinMaxUnit.v:38)   |
==================================================================

Implementation Report
==========================================
|          |           | Current         |
| Cell     | Module    | Implementation  |
==========================================
| cmp_gt   | DW01_cmp6 | apparch (area)  |
==========================================

This clearly indicates that Synopsys DC is now using the explicitly instantiated six-function comparator instead of automatically inferring a two-function comparator.

Let's go ahead and push the sort unit through the reset of the ASIC automated flow.

% cd $TOPDIR/asic/build-tut08-sort
% ./03-synopsys-vcs-ffglsim/run
% ./04-cadence-innovus-pnr/run
% ./05-synopsys-vcs-baglsim/run
% ./06-synopsys-pt-pwr/run
% ./07-summarize-results/run

Since the implementation now depends on Verilog code outside the source tree, your tests will no longer work on GitHub Actions. You can solve this by copying the Verilog corresponding to the explicitly instantiated components into your source tree. For example, we can copy the Verilog for the six-function comparator into a dw subdirectory.

% mkdir -p $TOPDIR/sim/dw
% cd $TOPDIR/sim/dw
% cp /opt/synopsys/syn/V-2023.12-SP5/dw/sim_ver/DW01_cmp6.v .

Then modify the include directive at the top of the implementation of the min/max unit appropriately.

`include "dw/DW01_cmp6.v"

Note that since the verilog provided by Synopsys DW is copyrighted you should not make it public.

2.2. Explicitly Instantiating Floating-Point Adder

This section will further illustrate how to use Synopsys DW components by explicitly instantiating a floating-point adder. You can learn more about the Synopsys DW component for a floating-point adder from its datasheet here:

We have already shown how to explicitly instantiate this Synopsys DW component along with input registers to create a single-stage floating-point adder. Look at the implementation provided in FPAdd1stage.v.

% cd $TOPDIR/sim/tut13_dw
% code FPAdd1stage.v

The implementation is shown below.

module tut13_dw_FPAdd1stage
(
  input  logic        clk,
  input  logic        reset,

  input  logic        in_val,
  input  logic [31:0] in0,
  input  logic [31:0] in1,

  output logic        out_val,
  output logic [31:0] out
);

  // pipeline registers

  logic        val_X0;
  logic [31:0] in0_X0;
  logic [31:0] in1_X0;

  always_ff @(posedge clk) begin
    if ( reset )
      val_X0 <= 1'b0;
    else
      val_X0 <= in_val;

    in0_X0 <= in0;
    in1_X0 <= in1;
  end

  // floating-point adder

  logic [7:0]  status_X0;
  logic [31:0] out_X0;

  DW_fp_add
  #(
    .sig_width       (23),
    .exp_width       (8),
    .ieee_compliance (1)
  )
  fp_add
  (
    .a      (in0_X0),
    .b      (in1_X0),
    .rnd    (3'b000),
    .z      (out_X0),
    .status (status_X0)
  );

  // output logic

  assign out_val = val_X0;
  assign out = out_X0 & {32{val_X0}};

endmodule

We configure the floating-point adder to support 32-bit floating point in standard single-precision IEEE format. The Synopsys DW component supports disabling IEEE compliance, different rounding modes, and status flags. We need to also explicitly include the Synopsys DW behavioral Verilog files. Let's go ahead and copy them into a dw directory in the source tree.

% mkdir -p $TOPDIR/sim/dw
% cd $TOPDIR/sim/dw
% cp /opt/synopsys/syn/V-2023.12-SP5/dw/sim_ver/DW_fp_addsub.v .
% cp /opt/synopsys/syn/V-2023.12-SP5/dw/sim_ver/DW_fp_add.v .

Notice how we have to copy two files since DW_fp_add.v uses the module defined in DW_fp_addsub.v. You may need to experiment to ensure you have copied all of the files required for the desired Synopsys DW component.

Now add the following include directives at the top of the FPAdd1stage.v file.

/* verilator lint_off LATCH */
`include "dw/DW_fp_addsub.v"
`include "dw/DW_fp_add.v"
/* verilator lint_on LATCH */

Here we are using Verilator's special linting comments to turn off linting checks for inferred latches. You may need to experiment to ensure you have turned off the right linting checks so that Verilator can use the Synopsys DW behavioral Verilog component.

Examine the simple basic test we have provided for the floating-point adder.

% cd $TOPDIR/sim/tut13_dw/test
% code FPAdd1stage_test.py

The basic test case along with some helper functions is shown below.

def fp2bits( fp ):
  if fp == '?':
    return '?'
  else:
    return Bits32(int.from_bytes( pack( '>f', fp ), byteorder='big' ))

def row( in_val, in0, in1, out_val, out ):
  return [ in_val, fp2bits(in0), fp2bits(in1), out_val, fp2bits(out) ]

def test_basic( cmdline_opts ):
  run_test_vector_sim( FPAdd1stage(), [
       ( 'in_val in0   in1   out_val* out*'   ),
    row( 0,      0.00, 0.00, 0,       '?'     ),
    row( 1,      1.00, 1.00, 0,       '?'     ),
    row( 1,      1.50, 1.50, 1,       2.00    ),
    row( 1,      1.25, 2.50, 1,       3.00    ),
    row( 0,      0.00, 0.00, 1,       3.75    ),
    row( 0,      0.00, 0.00, 0,       '?'     ),
  ], cmdline_opts )

We can use the Python struct package to convert a Python floating-point variable into 32-bit IEEE single-precision format. Here is an example:

% python
>>> from struct import pack
>>> pack( '>f', 1.5 ).hex()
'3fc00000'

The encoding of 0x3fc00000 matches what we expect when using an IEEE-754 floating-point converter such as this:

We need to use int.from_bytes to convert a byte array into an integer which is required when creating a Bits32 object.

Let's go ahead and run this basic test.

% cd $TOPDIR/sim/build
% pytest ../tut13_dw/test/FPAdd1stage_test.py -sv

Now we are ready to generate a Verilog test bench which we can use for four-state RTL, fast-functional gate-level, and back-annotated gate-level simulation.

% cd $TOPDIR/sim/build
% pytest ../tut13_dw/test/FPAdd1stage_test.py --test-verilog --dump-vtb

Now let's push the 1-stage floating-point adder through the ASIC automated flow again. We will start by just running the first two steps and looking at the synthesis reports.

% mkdir -p $TOPDIR/asic/build-tut13-fpadd-1stage
% cd $TOPDIR/asic/build-tut13-fpadd-1stage
% pyhflow ../designs/tut13-fpadd-1stage.yml
% ./01-synopsys-vcs-rtlsim/run
% ./02-synopsys-dc-synth/run

Let's first check the resources report to confirm that Synopsys DC is indeed using the Synopsys DW component for the floating-point adder as expected.

% cd $TOPDIR/asic/build-tut13-fpadd-1stage
% cat 02-synopsys-dc-synth/resources.rpt

The resources report shows how Synopsys DC ultimately ended using not just one Synopsys DW component, but many components which together implement the floating-point addition. For example, consider this part of the resources report.

===============================================================================
| Cell      | Module     | Parameters            | Contained Operations       |
===============================================================================
| lt_x_1    | DW_cmp     | width=31              | lt_189                     |
| sub_x_6   | DW01_sub   | width=8               | sub_230                    |
| ashr_7    | DW_rightsh | A_width=26,SH_width=8 | srl_235_lsb_trim           |
| ash_8     | DW_leftsh  | A_width=26,SH_width=8 | sll_237                    |
| gt_x_10   | DW_cmp     | width=8               | gt_253                     |
| ash_12    | DW_leftsh  | A_width=27,SH_width=5 | sll_264                    |
| add_x_16  | DW01_inc   | width=23              | add_301                    |
| U1        | DW_lzd     | a_width=27            | U1                         |
| DP_OP_54J1| DP_OP_54J1 |                       |                            |
| DP_OP_55J1| DP_OP_55J1 |                       |                            |
===============================================================================

Here we can see that Synopsys DC is using Synopsys DW components for comparators, subtractors, shifters, incrementers, and zero detectors. The bottom two rows tell us that Synospys DC has also created some custom components by unmerging and merging Synopsys DW components. You can learn more about these custom operators later in the report.

Datapath Report for DP_OP_54J1_124_7007
==============================================================================
| Cell                 | Contained Operations                                |
==============================================================================
| DP_OP_54J1_124_7007  | add_247 add_247_2                                   |
==============================================================================

==============================================================================
|       |      | Data     |       |                                          |
| Var   | Type | Class    | Width | Expression                               |
==============================================================================
| I1    | PI   | Unsigned | 27    |                                          |
| I2    | PI   | Unsigned | 28    |                                          |
| I3    | PI   | Unsigned | 1     |                                          |
| O1    | PO   | Unsigned | 28    | I1 + I2 + I3                             |
==============================================================================

Datapath Report for DP_OP_55J1_125_9206
==============================================================================
| Cell                 | Contained Operations                                |
==============================================================================
| DP_OP_55J1_125_9206  | add_304 sub_305                                     |
==============================================================================

==============================================================================
|       |      | Data     |       |                                          |
| Var   | Type | Class    | Width | Expression                               |
==============================================================================
| I1    | PI   | Unsigned | 8     |                                          |
| I2    | PI   | Unsigned | 5     |                                          |
| O1    | PO   | Unsigned | 9     | I1 + $unsigned(1'b1)                     |
| O2    | PO   | Signed   | 10    | O1 - I2                                  |
==============================================================================

The DP_OP_54J1 custom component implements a three input adder which adds a 27-bit, 28-bit, and 1-bit input to produce a 28-bit output. The DP_OP_55J1 custom component implements a kind of addition/subtraction operation.

Now let's check the timing report.

% cd $TOPDIR/asic/build-tut13-fpadd-1stage
% cat 02-synopsys-dc-synth/timing.rpt

The timing report should look similar to what is shown below.

  Startpoint: v/in1_reg_reg[7]
              (rising edge-triggered flip-flop clocked by ideal_clock1)
  Endpoint: out[20] (output port clocked by ideal_clock1)
  Path Group: ideal_clock1
  Path Type: max

  Des/Clust/Port     Wire Load Model       Library
  ------------------------------------------------
  FPAdd1stage_noparam
                     5K_hvratio_1_1        NangateOpenCellLibrary

  Point                        Fanout      Incr       Path
  -----------------------------------------------------------
  clock ideal_clock1 (rise edge)         0.0000     0.0000
  clock network delay (ideal)            0.0000     0.0000
  v/in1_X0_reg[7]/CK (DFF_X1)            0.0000     0.0000 r
  v/in1_X0_reg[7]/Q (DFF_X1)             0.0790     0.0790 f
  v/in1_reg[7] (net)             1       0.0000     0.0790 f
  v/U44/ZN (OR2_X2)                      0.0525     0.1315 f
  v/n246 (net)                   2       0.0000     0.1315 f
  v/U493/ZN (OAI211_X1)                  0.0368     0.1683 r
  ...
  v/U442/ZN (XNOR2_X1)                   0.0528     2.8226 f
  v/n213 (net)                   1       0.0000     2.8226 f
  v/U474/ZN (NOR2_X1)                    0.0359     2.8584 r
  v/n1483 (net)                  1       0.0000     2.8584 r
  v/U39/ZN (OR2_X2)                      0.0452     2.9036 r
  v/out[20] (net)                1       0.0000     2.9036 r
  v/out[20] (tut13_dw_FPAdd1stage)       0.0000     2.9036 r
  out[20] (net)                          0.0000     2.9036 r
  out[20] (out)                          0.0456     2.9492 r
  data arrival time                                 2.9492

  clock ideal_clock1 (rise edge)         3.0000     3.0000
  clock network delay (ideal)            0.0000     3.0000
  output external delay                 -0.0500     2.9500
  data required time                                2.9500
  -----------------------------------------------------------
  data required time                                2.9500
  data arrival time                                -2.9492
  -----------------------------------------------------------
  slack (MET)                                       0.0008

The clock period constraint was set to be 3ns. The design is able to meet this constraint with a critical path that through almost 60 logic gates.

Let's go ahead and push the 1-stage floating-point adder through the reset of the ASIC automated flow.

% cd $TOPDIR/asic/build-tut13-fpadd-1stage
% ./03-synopsys-vcs-ffglsim/run
% ./04-cadence-innovus-pnr/run
% ./05-synopsys-vcs-baglsim/run
% ./06-synopsys-pt-pwr/run
% ./07-summarize-results/run

The final summary is shown below.

 timestamp           = 2025-04-06 11:17:28
 design_name         = FPAdd1stage_noparam
 clock_period        = 3.0
 rtlsim              = 1/1 passed
 synth_setup_slack   = 0.0008 ns
 synth_num_stdcells  = 1713
 synth_area          = 1985.956 um^2
 ffglsim             = 1/1 passed
 pnr_setup_slack     = 0.2676 ns
 pnr_hold_slack      = 0.0100 ns
 pnr_clk_ins_src_lat = 0 ns
 pnr_num_stdcells    = 1760
 pnr_area            = 2005.108 um^2
 baglsim             = 1/1 passed

3. Synopsys Design Compiler for Register Retiming

While it can be very useful to leverage Synopsys DW components, what do we do if the provided component does not meet timing? In the previous section, our floating-point adder met the 3ns clock period constraint, but what if our target constraint is 1.5ns? Normally, we would consider pipelining the floating-point adder but this is not possible since we did not implement the floating-point adder ourselves. Even if we did implement the floating-point adder pipelining complex arithmetic units can be quite tedious. To address this issue, we can use a powerful technique called register retiming where the synthesis tool will automatically move pipeline registers to try and balance the pipeline stages. If we add an extra stage of pipeline registers at the end of the floating-point adder, then the synthesis tool can push these registers into the combinational logic to reduce the critical path.

To illustrate register retiming, we have provided a 2-stage floating-point adder in FPAdd2stage.v.

% cd $TOPDIR/sim/tut13_dw
% code FPAdd2stage.v

This implementation is similar to the 1-stage floating-point adder except for the extra set of retiming registers shown below.

  // retiming registers

  logic        val_X1;
  logic [31:0] out_X1;

  always_ff @(posedge clk) begin
    if ( reset )
      val_X1 <= 1'b0;
    else
      val_X1 <= val_X0;

    out_X1 <= out_X0;
  end

  // output logic

  assign out_val = val_X1;
  assign out = out_X1 & {32{val_X1}};

This looks strange since we are adding a set of pipeline registers after the floating-point adder. Without register retiming this would make no sense since these extra retiming registers will not actually reduce the critical path. The key idea though, is that register retiming will enable the synthesis tool to move these retiming registers into the middle of the combinational logic for the floating-point adder.

Let's run the tests for our 2-stage floating point adder.

% cd $TOPDIR/sim/build
% pytest ../tut13_dw/test/FPAdd2stage_test.py -sv

The trace output is shown in part below.

../tut13_dw/test/FPAdd1stage_test.py::test_basic
  1r in_val=0, in0=00000000, in1=00000000, out=00000000, out_val=0
  2r in_val=0, in0=00000000, in1=00000000, out=00000000, out_val=0
  3: in_val=0, in0=00000000, in1=00000000, out=00000000, out_val=0
  4: in_val=1, in0=3f800000, in1=3f800000, out=00000000, out_val=0
  5: in_val=1, in0=3fc00000, in1=3fc00000, out=00000000, out_val=0
  6: in_val=1, in0=3fa00000, in1=40200000, out=40000000, out_val=1
  7: in_val=0, in0=00000000, in1=00000000, out=40400000, out_val=1
  8: in_val=0, in0=00000000, in1=00000000, out=40700000, out_val=1

We can now see that a transaction takes two instead of one cycle. The first transaction goes into the floating-point adder on cycle 4 and the result is valid on cycle 6. Let's run the tests to create the Verilog test benches which we can use for four-state RTL, fast-functional gate-level, and back-annotated gate-level simulation.

% cd $TOPDIR/sim/build
% pytest ../tut13_dw/test/FPAdd2stage_test.py --test-verilog --dump-vtb

Now let's push the 2-stage floating-point adder through the first two steps of the ASIC automated flow and look at the synthesis timing reports.

% mkdir -p $TOPDIR/asic/build-tut13-fpadd-2stage
% cd $TOPDIR/asic/build-tut13-fpadd-2stage
% pyhflow ../designs/tut13-fpadd-2stage.yml
% ./01-synopsys-vcs-rtlsim/run
% ./02-synopsys-dc-synth/run
% cat ./02-synopsys-dc-synth/timing.rpt

The timing report should look similar to what is shown below.

  Startpoint: v/in1_X0_reg[0]
              (rising edge-triggered flip-flop clocked by ideal_clock1)
  Endpoint: v/out_X1_reg[22]
            (rising edge-triggered flip-flop clocked by ideal_clock1)
  Path Group: ideal_clock1
  Path Type: max

  Des/Clust/Port     Wire Load Model       Library
  ------------------------------------------------
  FPAdd2stage_noparam
                     5K_hvratio_1_1        NangateOpenCellLibrary

  Point                                       Fanout      Incr       Path
  --------------------------------------------------------------------------
  clock ideal_clock1 (rise edge)                        0.0000     0.0000
  clock network delay (ideal)                           0.0000     0.0000
  v/in1_X0_reg[0]/CK (DFF_X1)                           0.0000     0.0000 r
  v/in1_X0_reg[0]/Q (DFF_X1)                            0.0929     0.0929 r
  v/in1_X0[0] (net)                             4       0.0000     0.0929 r
  v/U318/ZN (AND2_X1)                                   0.0445     0.1373 r
  v/n217 (net)                                  1       0.0000     0.1373 r
  v/U580/ZN (NAND2_X1)                                  0.0246     0.1619 f
  ...
  v/U368/ZN (AND3_X2)                                   0.0677     2.3291 f
  v/n2090 (net)                                11       0.0000     2.3291 f
  v/U2004/ZN (NAND2_X1)                                 0.0377     2.3667 r
  v/n1854 (net)                                 1       0.0000     2.3667 r
  v/U2005/ZN (NAND2_X1)                                 0.0254     2.3922 f
  v/n2221 (net)                                 1       0.0000     2.3922 f
  v/out_X1_reg[22]/D (DFF_X1)                           0.0086     2.4007 f
  data arrival time                                                2.4007

  clock ideal_clock1 (rise edge)                        1.5000     1.5000
  clock network delay (ideal)                           0.0000     1.5000
  v/out_X1_reg[22]/CK (DFF_X1)                          0.0000     1.5000 r
  library setup time                                   -0.0395     1.4605
  data required time                                               1.4605
  --------------------------------------------------------------------------
  data required time                                               1.4605
  data arrival time                                               -2.4007
  --------------------------------------------------------------------------
  slack (VIOLATED)                                                -0.9402

Since we are using a 2-stage floating-point adder we reduced the clock period constraint to 1.5ns but we are not able to meet timing. This is because even though we added a set of retiming registers, we have not actually enabled retiming so the critical path will still be through the entire floating-point adder. The tools try hard but missing timing with a negative slack of 940ps.

We need to use the set_optimize_registers command in the TCL script for Synospys DC to enable register retiming for specific modules in our design. The command would look like this:

set_optimize_registers true \
  -check_design -verbose -print_critical_loop \
  -design FPAdd2stage_noparam \
  -clock ideal_clock1 \
  -delay_threshold 1.5

We have support for register retiming in the ASIC automated flow. You can see it in the synthesis step template.

% cd $TOPDIR/asic/steps/02-synopsys-dc-synth
% code run.tcl

Search through the TCL file to find the part related to register retiming which should be similar to what is shown below.

{% for module in retiming | default([]) -%}
set_optimize_registers true \
  -design {{module}}  \
  -check_design -verbose -print_critical_loop \
  -clock ideal_clock1 -delay_threshold {{clock_period}}
{% endfor %}

The retime variable in the YAML design file is used to specify a list of module names that should be retimed. Modify the tut13-fpadd-2stage.yml using VS Code.

% cd $TOPDIR/asic/designs/tut13-fpadd-2stage.yml
% code run.tcl

Add the following to indicate that the FPAdd2stage_noparam module should be retimed.

retiming:
  - FPAdd2stage_noparam

Now rerun pyhflow and verify the run scripts for synthesis now include the set_optimize_registers command.

% cd $TOPDIR/asic/build-tut13-fpadd-2stage
% pyhflow ../designs/tut13-fpadd-2stage.yml
% less ./02-synopsys-dc-synth/run.tcl

Assuming everything looks good, let's rerun synthesis and look at the timing report again.

% cd $TOPDIR/asic/build-tut13-fpadd-2stage
% ./01-synopsys-vcs-rtlsim/run
% ./02-synopsys-dc-synth/run
% cat ./02-synopsys-dc-synth/timing.rpt

The timing report should look similar to what is shown below.

  Startpoint: v/ideal_clock1_r_REG58_S1
              (rising edge-triggered flip-flop clocked by ideal_clock1)
  Endpoint: v/ideal_clock1_r_REG13_S2
            (rising edge-triggered flip-flop clocked by ideal_clock1)
  Path Group: ideal_clock1
  Path Type: max

  Des/Clust/Port     Wire Load Model       Library
  ------------------------------------------------
  FPAdd2stage_noparam
                     5K_hvratio_1_1        NangateOpenCellLibrary

  Point                                       Fanout      Incr       Path
  --------------------------------------------------------------------------
  clock ideal_clock1 (rise edge)                        0.0000     0.0000
  clock network delay (ideal)                           0.0000     0.0000
  v/ideal_clock1_r_REG58_S1/CK (DFF_X1)                 0.0000     0.0000 r
  v/ideal_clock1_r_REG58_S1/Q (DFF_X1)                  0.0803     0.0803 f
  v/n1594 (net)                                 2       0.0000     0.0803 f
  v/U14/ZN (OR2_X1)                                     0.0691     0.1494 f
  v/n34 (net)                                   4       0.0000     0.1494 f
  v/U57/ZN (NOR2_X1)                                    0.0900     0.2395 r
  ...
  v/U943/ZN (AND3_X1)                                   0.0366     1.3132 f
  v/n774 (net)                                  1       0.0000     1.3132 f
  v/U944/ZN (AND2_X1)                                   0.0373     1.3505 f
  v/n776 (net)                                  1       0.0000     1.3505 f
  v/U945/ZN (NOR4_X1)                                   0.0863     1.4368 r
  v/n1692 (net)                                 1       0.0000     1.4368 r
  v/ideal_clock1_r_REG13_S2/D (DFF_X1)                  0.0090     1.4458 r
  data arrival time                                                1.4458

  clock ideal_clock1 (rise edge)                        1.5000     1.5000
  clock network delay (ideal)                           0.0000     1.5000
  v/ideal_clock1_r_REG13_S2/CK (DFF_X1)                 0.0000     1.5000 r
  library setup time                                   -0.0400     1.4600
  data required time                                               1.4600
  --------------------------------------------------------------------------
  data required time                                               1.4600
  data arrival time                                               -1.4458
  --------------------------------------------------------------------------
  slack (MET)                                                      0.0142

We are now able to meet timing. The synthesis tool has retimed both the input and output registers which is why the critical path starts and ends at registers with new names.

Let's go ahead and push the 2-stage floating-point adder through the reset of the ASIC automated flow.

% cd $TOPDIR/asic/build-tut13-fpadd-2stage
% ./03-synopsys-vcs-ffglsim/run
% ./04-cadence-innovus-pnr/run
% ./05-synopsys-vcs-baglsim/run
% ./06-synopsys-pt-pwr/run
% ./07-summarize-results/run

The final summary is shown below.

 timestamp           = 2025-04-06 11:57:03
 design_name         = FPAdd2stage_noparam
 clock_period        = 1.5
 rtlsim              = 1/1 passed
 synth_setup_slack   = 0.0142 ns
 synth_num_stdcells  = 1759
 synth_area          = 2272.970 um^2
 ffglsim             = 1/1 passed
 pnr_setup_slack     = 0.2948 ns
 pnr_hold_slack      = 0.0108 ns
 pnr_clk_ins_src_lat = 0 ns
 pnr_num_stdcells    = 1851
 pnr_area            = 2372.454 um^2
 baglsim             = 1/1 passed

Compare these results to the results for the 1-stage floating-point adder. The 1-stage design met timing at 3ns, while the 2-stage design is able to meet timing at 1.5ns. The trade-off is area. The 2-stage design requires 2372um^2 while the 1-stage design only required 2005um^2 (18% increase). The energy for the 2-stage design would also likely be higher.