6745 Tutorial 13: DesignWare and Retiming
Synopsys Design Compiler (DC) includes the DesignWare (DW) library which is a collection of hardware components implementing arbiters, integer arithmetic units, floating-point arithmetic units, and memories. The Synopsys DW components also have optimized gate-level implementations that Synopsys DC can use when synthesizing your design. This tutorial will describe how these components can be used either through automatic inference or explicit instantiation. You can see a list of all of the available Synopsys DW components in the user guide here:
The user guide shows which units can be automatically inferred from an operator or function and which can only be used through explicit instantiation. Since most of the arithmetic units are combinational, the tutorial will also discuss how you can use register retiming to automatically pipeline these units so they can operate at higher clock frequencies. This tutorial assumes you have already completed the tutorials on Linux, Git, PyMTL, Verilog, ASIC front-end flow, ASIC back-end flow, and ASIC automated ASIC flow.
The first step is to access ecelinux
. Use VS Code to log into a
specific ecelinux
server. Once you are at the ecelinux
prompt, source
the setup script, clone this repository from GitHub, and define an
environment variable to keep track of the top directory for the project.
% source setup-ece6745.sh
% mkdir -p $HOME/ece6745
% cd $HOME/ece6745
% git clone git@github.com:cornell-ece6745/ece6745-tut13-dw tut13
% cd tut13
% export TOPDIR=$PWD
1. Synopsys DesignWare Automatic Inference
Let's start by exploring how Synopsys DC can automatically infer the use of Synopsys DW components by reviewing the sort unit from earlier tutorials. Recall the sort unit is implemented using a three-stage pipelined, bitonic sorting network and the datapath is shown below.
Let's look at the min/max unit:
module tut3_verilog_sort_MinMaxUnit
#(
parameter p_nbits = 1
)(
input logic [p_nbits-1:0] in0,
input logic [p_nbits-1:0] in1,
output logic [p_nbits-1:0] out_min,
output logic [p_nbits-1:0] out_max
);
always_comb begin
// Find min/max
if ( in0 >= in1 ) begin
out_max = in0;
out_min = in1;
end
else if ( in0 < in1 ) begin
out_max = in1;
out_min = in0;
end
// Handle case where there is an X in the input
else begin
out_min = 'x;
out_max = 'x;
end
end
endmodule
Notice how this unit uses two comparison operators, one for greater-than-equal and one for less than. We will see how Synopsys DC is able to automatically infer the use of two Synopsys DW components for these operators.
First, we need to run the tests and interactive simulator to create the Verilog test benches which we can use for four-state RTL, fast-functional gate-level, and back-annotated gate-level simulation.
% mkdir -p $TOPDIR/sim/build
% cd $TOPDIR/sim/build
% pytest ../tut3_verilog/sort/test --test-verilog --dump-vtb
% ../tut3_verilog/sort/sort-sim --impl rtl-struct --input random --stats --translate --dump-vtb
% ../tut3_verilog/sort/sort-sim --impl rtl-struct --input zeros --stats --translate --dump-vtb
Now let's use the ASIC automated flow to push the sort unit through synthesis and place-and-route.
% mkdir -p $TOPDIR/asic/build-tut08-sort
% cd $TOPDIR/asic/build-tut08-sort
% pyhflow ../designs/tut08-sort.yml
% ./run-flow
Then you can look at the resources report generated by Synopsys DC to see what Synopsys DW components were inferred.
You should see something like this.
================================================================
| Cell | Module | Parameters | Contained Operations |
================================================================
| gte_x_1 | DW_cmp | width=8 | gte_30 (MinMaxUnit.v:30) |
| lt_x_2 | DW_cmp | width=8 | lt_34 (MinMaxUnit.v:34) |
================================================================
Implementation Report
========================================
| | | Current |
| Cell | Module | Implementation |
========================================
| gte_x_1 | DW_cmp | apparch (area) |
| lt_x_2 | DW_cmp | apparch (area) |
========================================
The report shows how Synopsys DC was able to infer the use of a Synopsys
DW comparator (DW_cmp
). You can learn more about this component from
its datasheet here:
You will see that the component includes three different microarchitectures:
rpl
: Ripple carrypparch
: Delay-optimized flexible parallel-prefixapparch
: Area-optimized flexible architecture
Since the clock constraint is relatively generous (140ps of positive slack), Synopsys DC has decided to use a more area-optimized implementation.
2. Synopsys DesignWare Explicit Instantiation
Synopsys DC will to its best to infer Synopsys DW components whenever possible, but many components can only be used by explicitly instantiating the component in your Verilog. In this section, we will look at two examples: (1) instantiating a six-function comparator in the sort unit; and (2) instantiating a floating-point adder.
2.1. Explicitly Instantiating Six-Function Comparator
To illustrate how explicit instantiation works, let's use a six-function comparator to implement the min/max unit. Review the corresponding data-sheet here:
Now go ahead and modify the min/max unit to explicitly instantiate and use this six-function comparator as shown below.
module tut3_verilog_sort_MinMaxUnit
#(
parameter p_nbits = 1
)(
input logic [p_nbits-1:0] in0,
input logic [p_nbits-1:0] in1,
output logic [p_nbits-1:0] out_min,
output logic [p_nbits-1:0] out_max
);
logic lt;
logic gt;
logic eq;
logic le;
logic ge;
logic ne;
DW01_cmp6#(p_nbits) cmp_gt
(
.A (in0),
.B (in1),
.TC (1'b0),
.LT (lt),
.GT (gt),
.EQ (eq),
.LE (le),
.GE (ge),
.NE (ne)
);
assign out_max = gt ? in0 : in1;
assign out_min = lt ? in0 : in1;
endmodule
Now try to rerun the tests.
The tests will fail because Verilator cannot find the implementation of the Synopsys DW component. Add the following include directive at the top of the implementation of the min/max unit:
Now Verilator will be able to find the implementation of the Synopsys DW component, but it produces a warning about an implicit static function. We will need to disable this warning when processing the Synopsys DW component using Verilator's special linting comments.
/* verilator lint_off IMPLICITSTATIC */
`include "/opt/synopsys/syn/V-2023.12-SP5/dw/sim_ver/DW01_cmp6.v"
/* verilator lint_on IMPLICITSTATIC */
Now the tests should all pass so we can now regenerate the Verilog test benches for four-state RTL, fast-functional gate-level, and back-annotated gate-level simulation.
% cd $TOPDIR/sim/build
% pytest ../tut3_verilog/sort/test --test-verilog --dump-vtb
% ../tut3_verilog/sort/sort-sim --impl rtl-struct --input random --stats --translate --dump-vtb
% ../tut3_verilog/sort/sort-sim --impl rtl-struct --input zeros --stats --translate --dump-vtb
Now let's push the sort unit through the ASIC automated flow again. We will start by just running the first two steps and looking at the resources report.
% cd ${TOPDIR}/asic/build-tut08-sort
% ./01-synopsys-vcs-rtlsim/run
% ./02-synopsys-dc-synth/run
% cat 02-synopsys-dc-synth/resources.rpt
You should see something like this:
==================================================================
| Cell | Module | Parameters | Contained Operations |
==================================================================
| cmp_gt | DW01_cmp6 | width=8 | cmp_gt (MinMaxUnit.v:38) |
==================================================================
Implementation Report
==========================================
| | | Current |
| Cell | Module | Implementation |
==========================================
| cmp_gt | DW01_cmp6 | apparch (area) |
==========================================
This clearly indicates that Synopsys DC is now using the explicitly instantiated six-function comparator instead of automatically inferring a two-function comparator.
Let's go ahead and push the sort unit through the reset of the ASIC automated flow.
% cd $TOPDIR/asic/build-tut08-sort
% ./03-synopsys-vcs-ffglsim/run
% ./04-cadence-innovus-pnr/run
% ./05-synopsys-vcs-baglsim/run
% ./06-synopsys-pt-pwr/run
% ./07-summarize-results/run
Since the implementation now depends on Verilog code outside the source
tree, your tests will no longer work on GitHub Actions. You can solve
this by copying the Verilog corresponding to the explicitly instantiated
components into your source tree. For example, we can copy the Verilog
for the six-function comparator into a dw
subdirectory.
% mkdir -p $TOPDIR/sim/dw
% cd $TOPDIR/sim/dw
% cp /opt/synopsys/syn/V-2023.12-SP5/dw/sim_ver/DW01_cmp6.v .
Then modify the include directive at the top of the implementation of the min/max unit appropriately.
Note that since the verilog provided by Synopsys DW is copyrighted you should not make it public.
2.2. Explicitly Instantiating Floating-Point Adder
This section will further illustrate how to use Synopsys DW components by explicitly instantiating a floating-point adder. You can learn more about the Synopsys DW component for a floating-point adder from its datasheet here:
We have already shown how to explicitly instantiate this Synopsys DW
component along with input registers to create a single-stage
floating-point adder. Look at the implementation provided in
FPAdd1stage.v
.
The implementation is shown below.
module tut13_dw_FPAdd1stage
(
input logic clk,
input logic reset,
input logic in_val,
input logic [31:0] in0,
input logic [31:0] in1,
output logic out_val,
output logic [31:0] out
);
// pipeline registers
logic val_X0;
logic [31:0] in0_X0;
logic [31:0] in1_X0;
always_ff @(posedge clk) begin
if ( reset )
val_X0 <= 1'b0;
else
val_X0 <= in_val;
in0_X0 <= in0;
in1_X0 <= in1;
end
// floating-point adder
logic [7:0] status_X0;
logic [31:0] out_X0;
DW_fp_add
#(
.sig_width (23),
.exp_width (8),
.ieee_compliance (1)
)
fp_add
(
.a (in0_X0),
.b (in1_X0),
.rnd (3'b000),
.z (out_X0),
.status (status_X0)
);
// output logic
assign out_val = val_X0;
assign out = out_X0 & {32{val_X0}};
endmodule
We configure the floating-point adder to support 32-bit floating point in
standard single-precision IEEE format. The Synopsys DW component supports
disabling IEEE compliance, different rounding modes, and status flags. We
need to also explicitly include the Synopsys DW behavioral Verilog files.
Let's go ahead and copy them into a dw
directory in the source tree.
% mkdir -p $TOPDIR/sim/dw
% cd $TOPDIR/sim/dw
% cp /opt/synopsys/syn/V-2023.12-SP5/dw/sim_ver/DW_fp_addsub.v .
% cp /opt/synopsys/syn/V-2023.12-SP5/dw/sim_ver/DW_fp_add.v .
Notice how we have to copy two files since DW_fp_add.v
uses the module
defined in DW_fp_addsub.v
. You may need to experiment to ensure you have
copied all of the files required for the desired Synopsys DW component.
Now add the following include directives at the top of the
FPAdd1stage.v
file.
/* verilator lint_off LATCH */
`include "dw/DW_fp_addsub.v"
`include "dw/DW_fp_add.v"
/* verilator lint_on LATCH */
Here we are using Verilator's special linting comments to turn off linting checks for inferred latches. You may need to experiment to ensure you have turned off the right linting checks so that Verilator can use the Synopsys DW behavioral Verilog component.
Examine the simple basic test we have provided for the floating-point adder.
The basic test case along with some helper functions is shown below.
def fp2bits( fp ):
if fp == '?':
return '?'
else:
return Bits32(int.from_bytes( pack( '>f', fp ), byteorder='big' ))
def row( in_val, in0, in1, out_val, out ):
return [ in_val, fp2bits(in0), fp2bits(in1), out_val, fp2bits(out) ]
def test_basic( cmdline_opts ):
run_test_vector_sim( FPAdd1stage(), [
( 'in_val in0 in1 out_val* out*' ),
row( 0, 0.00, 0.00, 0, '?' ),
row( 1, 1.00, 1.00, 0, '?' ),
row( 1, 1.50, 1.50, 1, 2.00 ),
row( 1, 1.25, 2.50, 1, 3.00 ),
row( 0, 0.00, 0.00, 1, 3.75 ),
row( 0, 0.00, 0.00, 0, '?' ),
], cmdline_opts )
We can use the Python struct
package to convert a Python floating-point
variable into 32-bit IEEE single-precision format. Here is an example:
The encoding of 0x3fc00000 matches what we expect when using an IEEE-754 floating-point converter such as this:
We need to use int.from_bytes
to convert a byte array into an integer
which is required when creating a Bits32
object.
Let's go ahead and run this basic test.
Now we are ready to generate a Verilog test bench which we can use for four-state RTL, fast-functional gate-level, and back-annotated gate-level simulation.
Now let's push the 1-stage floating-point adder through the ASIC automated flow again. We will start by just running the first two steps and looking at the synthesis reports.
% mkdir -p $TOPDIR/asic/build-tut13-fpadd-1stage
% cd $TOPDIR/asic/build-tut13-fpadd-1stage
% pyhflow ../designs/tut13-fpadd-1stage.yml
% ./01-synopsys-vcs-rtlsim/run
% ./02-synopsys-dc-synth/run
Let's first check the resources report to confirm that Synopsys DC is indeed using the Synopsys DW component for the floating-point adder as expected.
The resources report shows how Synopsys DC ultimately ended using not just one Synopsys DW component, but many components which together implement the floating-point addition. For example, consider this part of the resources report.
===============================================================================
| Cell | Module | Parameters | Contained Operations |
===============================================================================
| lt_x_1 | DW_cmp | width=31 | lt_189 |
| sub_x_6 | DW01_sub | width=8 | sub_230 |
| ashr_7 | DW_rightsh | A_width=26,SH_width=8 | srl_235_lsb_trim |
| ash_8 | DW_leftsh | A_width=26,SH_width=8 | sll_237 |
| gt_x_10 | DW_cmp | width=8 | gt_253 |
| ash_12 | DW_leftsh | A_width=27,SH_width=5 | sll_264 |
| add_x_16 | DW01_inc | width=23 | add_301 |
| U1 | DW_lzd | a_width=27 | U1 |
| DP_OP_54J1| DP_OP_54J1 | | |
| DP_OP_55J1| DP_OP_55J1 | | |
===============================================================================
Here we can see that Synopsys DC is using Synopsys DW components for comparators, subtractors, shifters, incrementers, and zero detectors. The bottom two rows tell us that Synospys DC has also created some custom components by unmerging and merging Synopsys DW components. You can learn more about these custom operators later in the report.
Datapath Report for DP_OP_54J1_124_7007
==============================================================================
| Cell | Contained Operations |
==============================================================================
| DP_OP_54J1_124_7007 | add_247 add_247_2 |
==============================================================================
==============================================================================
| | | Data | | |
| Var | Type | Class | Width | Expression |
==============================================================================
| I1 | PI | Unsigned | 27 | |
| I2 | PI | Unsigned | 28 | |
| I3 | PI | Unsigned | 1 | |
| O1 | PO | Unsigned | 28 | I1 + I2 + I3 |
==============================================================================
Datapath Report for DP_OP_55J1_125_9206
==============================================================================
| Cell | Contained Operations |
==============================================================================
| DP_OP_55J1_125_9206 | add_304 sub_305 |
==============================================================================
==============================================================================
| | | Data | | |
| Var | Type | Class | Width | Expression |
==============================================================================
| I1 | PI | Unsigned | 8 | |
| I2 | PI | Unsigned | 5 | |
| O1 | PO | Unsigned | 9 | I1 + $unsigned(1'b1) |
| O2 | PO | Signed | 10 | O1 - I2 |
==============================================================================
The DP_OP_54J1
custom component implements a three input adder which
adds a 27-bit, 28-bit, and 1-bit input to produce a 28-bit output. The
DP_OP_55J1
custom component implements a kind of addition/subtraction
operation.
Now let's check the timing report.
The timing report should look similar to what is shown below.
Startpoint: v/in1_reg_reg[7]
(rising edge-triggered flip-flop clocked by ideal_clock1)
Endpoint: out[20] (output port clocked by ideal_clock1)
Path Group: ideal_clock1
Path Type: max
Des/Clust/Port Wire Load Model Library
------------------------------------------------
FPAdd1stage_noparam
5K_hvratio_1_1 NangateOpenCellLibrary
Point Fanout Incr Path
-----------------------------------------------------------
clock ideal_clock1 (rise edge) 0.0000 0.0000
clock network delay (ideal) 0.0000 0.0000
v/in1_X0_reg[7]/CK (DFF_X1) 0.0000 0.0000 r
v/in1_X0_reg[7]/Q (DFF_X1) 0.0790 0.0790 f
v/in1_reg[7] (net) 1 0.0000 0.0790 f
v/U44/ZN (OR2_X2) 0.0525 0.1315 f
v/n246 (net) 2 0.0000 0.1315 f
v/U493/ZN (OAI211_X1) 0.0368 0.1683 r
...
v/U442/ZN (XNOR2_X1) 0.0528 2.8226 f
v/n213 (net) 1 0.0000 2.8226 f
v/U474/ZN (NOR2_X1) 0.0359 2.8584 r
v/n1483 (net) 1 0.0000 2.8584 r
v/U39/ZN (OR2_X2) 0.0452 2.9036 r
v/out[20] (net) 1 0.0000 2.9036 r
v/out[20] (tut13_dw_FPAdd1stage) 0.0000 2.9036 r
out[20] (net) 0.0000 2.9036 r
out[20] (out) 0.0456 2.9492 r
data arrival time 2.9492
clock ideal_clock1 (rise edge) 3.0000 3.0000
clock network delay (ideal) 0.0000 3.0000
output external delay -0.0500 2.9500
data required time 2.9500
-----------------------------------------------------------
data required time 2.9500
data arrival time -2.9492
-----------------------------------------------------------
slack (MET) 0.0008
The clock period constraint was set to be 3ns. The design is able to meet this constraint with a critical path that through almost 60 logic gates.
Let's go ahead and push the 1-stage floating-point adder through the reset of the ASIC automated flow.
% cd $TOPDIR/asic/build-tut13-fpadd-1stage
% ./03-synopsys-vcs-ffglsim/run
% ./04-cadence-innovus-pnr/run
% ./05-synopsys-vcs-baglsim/run
% ./06-synopsys-pt-pwr/run
% ./07-summarize-results/run
The final summary is shown below.
timestamp = 2025-04-06 11:17:28
design_name = FPAdd1stage_noparam
clock_period = 3.0
rtlsim = 1/1 passed
synth_setup_slack = 0.0008 ns
synth_num_stdcells = 1713
synth_area = 1985.956 um^2
ffglsim = 1/1 passed
pnr_setup_slack = 0.2676 ns
pnr_hold_slack = 0.0100 ns
pnr_clk_ins_src_lat = 0 ns
pnr_num_stdcells = 1760
pnr_area = 2005.108 um^2
baglsim = 1/1 passed
3. Synopsys Design Compiler for Register Retiming
While it can be very useful to leverage Synopsys DW components, what do we do if the provided component does not meet timing? In the previous section, our floating-point adder met the 3ns clock period constraint, but what if our target constraint is 1.5ns? Normally, we would consider pipelining the floating-point adder but this is not possible since we did not implement the floating-point adder ourselves. Even if we did implement the floating-point adder pipelining complex arithmetic units can be quite tedious. To address this issue, we can use a powerful technique called register retiming where the synthesis tool will automatically move pipeline registers to try and balance the pipeline stages. If we add an extra stage of pipeline registers at the end of the floating-point adder, then the synthesis tool can push these registers into the combinational logic to reduce the critical path.
To illustrate register retiming, we have provided a 2-stage
floating-point adder in FPAdd2stage.v
.
This implementation is similar to the 1-stage floating-point adder except for the extra set of retiming registers shown below.
// retiming registers
logic val_X1;
logic [31:0] out_X1;
always_ff @(posedge clk) begin
if ( reset )
val_X1 <= 1'b0;
else
val_X1 <= val_X0;
out_X1 <= out_X0;
end
// output logic
assign out_val = val_X1;
assign out = out_X1 & {32{val_X1}};
This looks strange since we are adding a set of pipeline registers after the floating-point adder. Without register retiming this would make no sense since these extra retiming registers will not actually reduce the critical path. The key idea though, is that register retiming will enable the synthesis tool to move these retiming registers into the middle of the combinational logic for the floating-point adder.
Let's run the tests for our 2-stage floating point adder.
The trace output is shown in part below.
../tut13_dw/test/FPAdd1stage_test.py::test_basic
1r in_val=0, in0=00000000, in1=00000000, out=00000000, out_val=0
2r in_val=0, in0=00000000, in1=00000000, out=00000000, out_val=0
3: in_val=0, in0=00000000, in1=00000000, out=00000000, out_val=0
4: in_val=1, in0=3f800000, in1=3f800000, out=00000000, out_val=0
5: in_val=1, in0=3fc00000, in1=3fc00000, out=00000000, out_val=0
6: in_val=1, in0=3fa00000, in1=40200000, out=40000000, out_val=1
7: in_val=0, in0=00000000, in1=00000000, out=40400000, out_val=1
8: in_val=0, in0=00000000, in1=00000000, out=40700000, out_val=1
We can now see that a transaction takes two instead of one cycle. The first transaction goes into the floating-point adder on cycle 4 and the result is valid on cycle 6. Let's run the tests to create the Verilog test benches which we can use for four-state RTL, fast-functional gate-level, and back-annotated gate-level simulation.
Now let's push the 2-stage floating-point adder through the first two steps of the ASIC automated flow and look at the synthesis timing reports.
% mkdir -p $TOPDIR/asic/build-tut13-fpadd-2stage
% cd $TOPDIR/asic/build-tut13-fpadd-2stage
% pyhflow ../designs/tut13-fpadd-2stage.yml
% ./01-synopsys-vcs-rtlsim/run
% ./02-synopsys-dc-synth/run
% cat ./02-synopsys-dc-synth/timing.rpt
The timing report should look similar to what is shown below.
Startpoint: v/in1_X0_reg[0]
(rising edge-triggered flip-flop clocked by ideal_clock1)
Endpoint: v/out_X1_reg[22]
(rising edge-triggered flip-flop clocked by ideal_clock1)
Path Group: ideal_clock1
Path Type: max
Des/Clust/Port Wire Load Model Library
------------------------------------------------
FPAdd2stage_noparam
5K_hvratio_1_1 NangateOpenCellLibrary
Point Fanout Incr Path
--------------------------------------------------------------------------
clock ideal_clock1 (rise edge) 0.0000 0.0000
clock network delay (ideal) 0.0000 0.0000
v/in1_X0_reg[0]/CK (DFF_X1) 0.0000 0.0000 r
v/in1_X0_reg[0]/Q (DFF_X1) 0.0929 0.0929 r
v/in1_X0[0] (net) 4 0.0000 0.0929 r
v/U318/ZN (AND2_X1) 0.0445 0.1373 r
v/n217 (net) 1 0.0000 0.1373 r
v/U580/ZN (NAND2_X1) 0.0246 0.1619 f
...
v/U368/ZN (AND3_X2) 0.0677 2.3291 f
v/n2090 (net) 11 0.0000 2.3291 f
v/U2004/ZN (NAND2_X1) 0.0377 2.3667 r
v/n1854 (net) 1 0.0000 2.3667 r
v/U2005/ZN (NAND2_X1) 0.0254 2.3922 f
v/n2221 (net) 1 0.0000 2.3922 f
v/out_X1_reg[22]/D (DFF_X1) 0.0086 2.4007 f
data arrival time 2.4007
clock ideal_clock1 (rise edge) 1.5000 1.5000
clock network delay (ideal) 0.0000 1.5000
v/out_X1_reg[22]/CK (DFF_X1) 0.0000 1.5000 r
library setup time -0.0395 1.4605
data required time 1.4605
--------------------------------------------------------------------------
data required time 1.4605
data arrival time -2.4007
--------------------------------------------------------------------------
slack (VIOLATED) -0.9402
Since we are using a 2-stage floating-point adder we reduced the clock period constraint to 1.5ns but we are not able to meet timing. This is because even though we added a set of retiming registers, we have not actually enabled retiming so the critical path will still be through the entire floating-point adder. The tools try hard but missing timing with a negative slack of 940ps.
We need to use the set_optimize_registers
command in the TCL script for
Synospys DC to enable register retiming for specific modules in our
design. The command would look like this:
set_optimize_registers true \
-check_design -verbose -print_critical_loop \
-design FPAdd2stage_noparam \
-clock ideal_clock1 \
-delay_threshold 1.5
We have support for register retiming in the ASIC automated flow. You can see it in the synthesis step template.
Search through the TCL file to find the part related to register retiming which should be similar to what is shown below.
{% for module in retiming | default([]) -%}
set_optimize_registers true \
-design {{module}} \
-check_design -verbose -print_critical_loop \
-clock ideal_clock1 -delay_threshold {{clock_period}}
{% endfor %}
The retime
variable in the YAML design file is used to specify a list
of module names that should be retimed. Modify the
tut13-fpadd-2stage.yml
using VS Code.
Add the following to indicate that the FPAdd2stage_noparam
module
should be retimed.
Now rerun pyhflow and verify the run scripts for synthesis now include
the set_optimize_registers
command.
% cd $TOPDIR/asic/build-tut13-fpadd-2stage
% pyhflow ../designs/tut13-fpadd-2stage.yml
% less ./02-synopsys-dc-synth/run.tcl
Assuming everything looks good, let's rerun synthesis and look at the timing report again.
% cd $TOPDIR/asic/build-tut13-fpadd-2stage
% ./01-synopsys-vcs-rtlsim/run
% ./02-synopsys-dc-synth/run
% cat ./02-synopsys-dc-synth/timing.rpt
The timing report should look similar to what is shown below.
Startpoint: v/ideal_clock1_r_REG58_S1
(rising edge-triggered flip-flop clocked by ideal_clock1)
Endpoint: v/ideal_clock1_r_REG13_S2
(rising edge-triggered flip-flop clocked by ideal_clock1)
Path Group: ideal_clock1
Path Type: max
Des/Clust/Port Wire Load Model Library
------------------------------------------------
FPAdd2stage_noparam
5K_hvratio_1_1 NangateOpenCellLibrary
Point Fanout Incr Path
--------------------------------------------------------------------------
clock ideal_clock1 (rise edge) 0.0000 0.0000
clock network delay (ideal) 0.0000 0.0000
v/ideal_clock1_r_REG58_S1/CK (DFF_X1) 0.0000 0.0000 r
v/ideal_clock1_r_REG58_S1/Q (DFF_X1) 0.0803 0.0803 f
v/n1594 (net) 2 0.0000 0.0803 f
v/U14/ZN (OR2_X1) 0.0691 0.1494 f
v/n34 (net) 4 0.0000 0.1494 f
v/U57/ZN (NOR2_X1) 0.0900 0.2395 r
...
v/U943/ZN (AND3_X1) 0.0366 1.3132 f
v/n774 (net) 1 0.0000 1.3132 f
v/U944/ZN (AND2_X1) 0.0373 1.3505 f
v/n776 (net) 1 0.0000 1.3505 f
v/U945/ZN (NOR4_X1) 0.0863 1.4368 r
v/n1692 (net) 1 0.0000 1.4368 r
v/ideal_clock1_r_REG13_S2/D (DFF_X1) 0.0090 1.4458 r
data arrival time 1.4458
clock ideal_clock1 (rise edge) 1.5000 1.5000
clock network delay (ideal) 0.0000 1.5000
v/ideal_clock1_r_REG13_S2/CK (DFF_X1) 0.0000 1.5000 r
library setup time -0.0400 1.4600
data required time 1.4600
--------------------------------------------------------------------------
data required time 1.4600
data arrival time -1.4458
--------------------------------------------------------------------------
slack (MET) 0.0142
We are now able to meet timing. The synthesis tool has retimed both the input and output registers which is why the critical path starts and ends at registers with new names.
Let's go ahead and push the 2-stage floating-point adder through the reset of the ASIC automated flow.
% cd $TOPDIR/asic/build-tut13-fpadd-2stage
% ./03-synopsys-vcs-ffglsim/run
% ./04-cadence-innovus-pnr/run
% ./05-synopsys-vcs-baglsim/run
% ./06-synopsys-pt-pwr/run
% ./07-summarize-results/run
The final summary is shown below.
timestamp = 2025-04-06 11:57:03
design_name = FPAdd2stage_noparam
clock_period = 1.5
rtlsim = 1/1 passed
synth_setup_slack = 0.0142 ns
synth_num_stdcells = 1759
synth_area = 2272.970 um^2
ffglsim = 1/1 passed
pnr_setup_slack = 0.2948 ns
pnr_hold_slack = 0.0108 ns
pnr_clk_ins_src_lat = 0 ns
pnr_num_stdcells = 1851
pnr_area = 2372.454 um^2
baglsim = 1/1 passed
Compare these results to the results for the 1-stage floating-point adder. The 1-stage design met timing at 3ns, while the 2-stage design is able to meet timing at 1.5ns. The trade-off is area. The 2-stage design requires 2372um^2 while the 1-stage design only required 2005um^2 (18% increase). The energy for the 2-stage design would also likely be higher.