regional clock gate splitting algorithm for clock tree ...repository.um.edu.my/86272/1/regional...

ICSE2010 Proc. 2010, Melaka, Malaysia

Regional Clock Gate Splitting Algorithm for Clock

Tree Synthesis

Siong Kiong Teng Intel Microelectronics (M) Sdn Bhd

Penang, Malaysia

Email:[email protected]

Norhayati Soin Dept. of Electrical Engineering

University of Malaya

Kuala Lumpur, Malaysia

Email:[email protected]

Abstract In this paper, the new clock distribution design flow

enable signal timing violations had been presented. The clock gate

components in a clock tree are exposed to setup timing violations

due to the nature that the clock gates skew is normally big as they

are located at the beginning of the clock tree. The effective

splitting of the clock gate to the lower level of the clock tree will

improve the clock gate skew and thus improve the setup margin.

I. INTRODUCTION

Clock gating is being used widely in VLSI clock distribution as a method to reduce clock network power dissipations [1]. The clock gating components are part of the clock tree distribution components during clock tree synthesis (CTS) process. Therefore, the physical location of the clock gating components have a huge impact on the overall clock tree power consumption and the clock gate enable signal timing convergence during performance verification (PV).

In high speed VLSI designs, there is always a big challenge to meet the clock gate enable signal setup timing requirement during PV. Clock gate is normally built using an AND gate [1]. The most commonly used clock gate is an integrated, latch-based clock gate. Latch-based clock gates prevent glitches from the enable from being propagated forward to the gated clock output [1]. Fig. 1 shows a commonly used latch-based clock gate design. In this design, the EN signal needs to arrive before the clock to ensure a clean clock gating event.

EN

CLK

Gated CLK

Figure 1. Latch-based Clock Gate [1]

As shown in Fig. 2, in the normal VLSI design implementation, the clock will arrive at the clock gate earlier than Flop A and Flop B. The clock gate is not treated as the sink pins or end points during CTS process. Thus, the clock gate skew will not be matched with the rest of the sequential elements. The timing constraint of the setup margin on the clock gate is a function of the clock gate skew to the Flop A. The setup margin can be written as (1)

setupperiod TTTTTTS )( 4312 (1)[2]

where

S = Setup margin of the clock gate. Tperiod = Clock period. T1 = Clock tree insertion delay to Flop A. T2 = Clock tree insertion delay to clock gate. T3 = Combinational gate delay. T4 = Flop A clock to Q delay. Tsetup = Clock Gate setup requirement.

Q

QSET

CLR

D

Flop B

Clock Gate

Q

QSET

CLR

D

Flop AClock

T1

T2

T3

T4

Figure 2. Clock Gate Timing Constraint

The problem becomes more severe on the high speed designs when Tperiod become smaller and the overall timing window becomes tighter. The designers will often have to drop the clock gate implementation due to timing constraints. This will increase the overall clock tree power and yield an un-optimized design. In this research, we propose an effective clock gate splitting algorithm that is able to improve the setup margin and create more opportunity for clock gating design to save power.

Design communities focusing on the clock tree distribution researches have explored various approaches to improve the registers placement, clock gates placement and clock routing. Ref [3] and [4] proposed a clock tree planning methodology that constructed gated clock tree topology with consideration of both switching activities and cell locations using formulated cost function. This approach is effective but the efficiency on the power reduction is limited by the weakness that the gated clock tree is constructed after conventional cell placement. Moreover, the research assumed the ideal binary clock tree which is not common for normal EDA industrial CTS tool. In [5], the clock gates are split based on the total number of fanout of the individual clock gate. This approach did not account for the physical information of the clock gate fanout. When the

131 978-1-4244-6609-2/10/$26.00 ©2010 IEEE


physical location of the clock gate fanout is far, it will not split the clock gate. Therefore, the physical location is important to solve the problem. Ref [6] and [7] proposed the zero skew optimization during the gated clock tree construction. However, after the clock gate removing step, the zero skew is not honored. In [8], the authors built the gated clock tree network from scratch with the algorithm that insert the clock gating enable structural together with the clock buffers. Their algorithm had a recursive approach to calculate the effective switched capacitance directly during the clock tree network construction. All the previous approaches had not addressed the clock gate timing issue during clock tree construction.

II. REGIONAL CLOCK GATE SPLITTING ALGORITHM

A. Clock Gate Marking

The clock gate splitting algorithm will be done after physical cell placement and optimization but before common industrial CTS flow. The algorithm will go through the clock gate marking process. The clock marking process is to identify all the clock gates existing in the design and associated each of the clock gates with its design attribute. Each of the clock gates will be associated to its respective fanout. After that, all the fanout will be bounded within a box based on their physically location. The fanout region of the clock gate is being calculated as the bounding box that encompasses all the loads of the clock gate. The area of the bounding box will be calculated. The area is the multiplication of the length and width of the bounding box.

Fig. 3 shows the bounding box of the clock gate and their respective area. Clock Gate A, Clock Gate B and Clock Gate C will be associated with their respective bounding boxes which are Area A, Area B and Area C using the clock gate marking algorithm.

Clock Gate A

Q

QSET

CLR

D

Q

QSET

CLR

D

Q

QSET

CLR

D

Clock Gate B

Q

QSET

CLR

D

Q

QSET

CLR

D

Q

QSET

CLR

D

Clock Gate C

Q

QSET

CLR

D

Q

QSET

CLR

D

Q

QSET

CLR

D

Q

QSET

CLR

D

CLK

Area A

Area B

Area C

Figure 3. Fanout load area associations.

The clock gate marking process is to determine which clock gates needs to go through clock gate splitting process. When the fanout area increases, the clock gate skew with respect to the

flops will increase. This indicates that the clock gate will need to go through clock gate splitting to reduce clock skew.

B. Clock Gate Splitting

The clock gate splitting algorithm will split the clock gate based on the targeted area. If the area for individual clock gate calculated during the clock gate marking process is bigger than the targeted area, the regional clock gate splitting process will be triggered. Besides that, the clock gate marking process will also take in the input from physical synthesis report to determine which clock gate is actually having a problem meeting the setup timing. If the clock gate is not having setup timing violations, it will not go through clock gate splitting process even though its area might be bigger than targeted area. This is to ensure the clock gate can gate off more clock tree branches during clock gating event.

The targeted area is the function of the individual clock frequency. When the clock frequency increases, the targeted area decreases. This is because when the clock frequency increases, the timing window will be tighter. Thus, the clock gate skew needs to be smaller to ensure the setup timing is met.

When the area is bigger than the targeted area, the clock gate splitting algorithm will first determine the bounding box geometry is x dominant or y dominant. If the length in x-direction is bigger than in y-direction, the bounding box will be cut at x/2 point. If the length in y-direction is bigger than in x-direction, the bounding box will be cut at y/2 point. Fig. 4 shows the bounding box cutting algorithm during clock gate splitting flow.

X > Y

Y > X

BOX B

Cut at X/2

Cut at Y/2

BOX A

Figure 4. Bounding box cutting based on longer length.

After cutting and dividing the bounding boxes, the bounding box area will be re-calculated based on the new loads locations. This will yield an equal or smaller bounding box. The bounding box splitting and rebounding process will be continued on the divided boxes until the area of the boxes is smaller than the targeted area. During the process, the flops and registers cell placement are kept fixed to avoid timing perturbation due to flops movement.

After solving all the physical bounding boxes to meet targeted area constraints, the algorithm will start moving and duplicating the clock gates. The original clock gate will be moved to the center of the first bounding box. The new bounding boxes will have all their clock gates duplicated and placed at the center of the bounding box. This will be the ideal location for the clock

its loads. The cell placement will be legalized incrementally using standard industrial EDA tool

132


After that, the netlist connectivity will be updated. The newly created clock gates will have same logical connection as the master clock gate. The connection will be made on all the pins of the clock gate. The duplication mapping file will be generated to be used during logical formal verification process. Fig. 5 shows the results of regional clock gate splitting based on Fig. 3 flops placement. As shown in Fig. 5, all the respective new bounding boxes will have smaller area and less fanout.

Clock Gate A

Q

QSET

CLR

D

Q

QSET

CLR

D

Q

QSET

CLR

D

Clock Gate B

Q

QSET

CLR

D

Q

QSET

CLR

D

Q

QSET

CLR

D

Clock Gate C

Q

QSET

CLR

D

Q

QSET

CLR

D

Q

QSET

CLR

D

Q

QSET

CLR

D

CLK

Area A

Area B

Area C

Figure 5. Regional clock gate splitting.

III. INTEGRATION OF THE ALGORITHM INTO AN INDUSTRIAL

FLOW

This section will describe how the regional clock gate splitting methodology is integrated into an industrial design flow that relies on commercial EDA design tool for physical optimization and clock distribution synthesis.

The regional clock gate splitting flow can be used in normal EDA software. The algorithm presented is implemented in TCL and is easily integrated into the normal industrial physical implementation design flow before CTS execution as shown in Fig 6.

Starting with the high level register transfer language (RTL), the logic synthesis will be performed to translate the behavior model to structural netlist. Physical synthesis will then be run to do the coarse placement and optimization of the design. After that, the flow will run through the CTS flow.

The clock gate timing is being generated to determine all the violated clock gates that need to go through clock gate splitting process. If there are no clock gates having setup timing violations, the flow is completed. If there are some clock gates violating setup timing, the regional split clock gate flow will be triggered to split the clock gate and reran the CTS flow again.

The process will be repeated until there are no clock gate timing violations.

RTL codes

Logic Synthesis

Physical Placement and Logic Optimization

Area Based Split Clock Gate Flow

CTS Flow

Clock Gate Enable Timing Met?

End

No

Yes

Figure 6. Flow Chart of Integration of Regional Split Clock Gate Flow into

Standard EDA Flow.

IV. PHYSICAL CLOCK TREE SYNTHESIS RESULTS

In this section, the CTS results on a design consisting of 3 millions gates and close to 500K flops is being compared with traditionally CTS approach [9] and the regional split clock gate CTS approach. For each approach, the CTS results on the clock tree area, number of clock buffers and clock skew are collected.

The result is tabulated in Table I. With the regional split clock gate approach, the overall clock tree area had been reduced by 1.17% with less clock buffers used during CTS flow. The overall clock gate component increased as expected because of the duplication of the clock gate components. However, the clock buffer count savings is able to compensate the clock gate increase and thus improve the total clock tree area. The results also show that this approach is able to maintain the final clock skew results to be equal or less than the conventional approach.

TABLE I.

CTS RESULTS COMPARISION WITH REGIONAL CLOCK GATE SPLITING APPROACH

Conventional

Approach

Regional Clock Gate

Splitting Approach

Total Flops 500000 500000

Total Clock Buffers 31018 30602

Total Clock Gate 20389 20630

Clock Skew (picosecond) 318 317

Total Clock Tree Area 156881 155041

V. PERFORMANCE AND POWER VERIFICATION

In this section, we compared the clock gate enable setup timing margin based on the similar design case study discussed in section IV. The post-CTS design is being

optimized again to model the clock tree synthesis effect on the design timing. The clock tree dynamic power consumption will also be compared against the traditional approach by studying the overall clock tree capacitances.

133


TABLE II.

CLOCK GATE ENABLE TIMING PERFORMANCE

Conventional

Approach

Regional Clock Gate

Splitting Approach

Worst Setup Slack of

Clock Gate A (picosecond) -407 -191


Clock Gate B (picosecond) -336 -182


Clock Gate C (picosecond) -22 +32

For the performance gains by this methodology, we

observed the significant improvement of the clock gate enable

timing slack. Table II shows the clock gate enable timing

violations after running through the regional clock gate

splitting algorithm. As shown, the clock gates enable timing

improved by 46% to 53% range for clock gate A and clock

gate B. This proves that our algorithm is able to solve the

clock gate enable setup timing by effectively reduce the clock

gate skew. However, the total number of setup violations on

the clock gates will increase because the original clock gate

and the split or duplicated clock gates will show up as

violators. However, the worst negative slack on the clock gates

had improved as shown in Table II. In clock gate C case study,

the violations is resolved by using the split clock gate

algorithm where the clock gate enable timing become positive

after split clock gate flow. The proposed methodology shows the promising results

not only on the clock tree enable timing, but also on the clock tree dynamic power consumption. As shown in Table III, the overall clock network capacitances are lower in the regional clock gate splitting test case. The clock network capacitance will be proportional to the overall clock tree power. Two clock gate operations scenarios are being compared in the experiment. The clock gate ON operation result is tabulated with all the clock gates open to allow all registers to be latching actively. The clock gate OFF result is tabulated with all the clock gates disabled to achieve gating.

A reduction in capacitance is observed with clock gate ON operations using regional clock gate splitting approach. Meanwhile an increase in total clock network capacitance is noticed for clock gate OFF operations. The reduction in the clock gate ON operation because the clock gates now are located closer to the flops and they are less overlapping clock trees on gated and ungated clocks. However, the clock gate OFF operation condition increased the capacitance is ascribed to the increase in pre-clock gate buffer stages, brought around by the split clock gate effect to the lower level of buffer trees. This is the tradeoff between power and the clock gating enable timing convergence efforts.

TABLE III.

CLOCK TREE POWER PERFORMANCE

Conventional

Approach

Regional Clock Gate

Splitting Approach

Clock Gate ON Power

(Capacitance in fentofarad) 381419 380736

Clock Gate OFF

(Capacitance in fentofarad) 19910 21878

VI. CONCLUSION

In this paper, a new design flow for gated clock tree synthesis had been proposed taking clock gate enable timing performance into account. After the physical cells placement, the regional clock gate splitting algorithm is able to effectively reduced the post clock gate clock tree insertion delay and thus reducing the clock skew of the clock gate to the rest of the clock tree. The paper also shows that with this approach, the overall clock tree buffers reduced and the clock tree distribution is constructed with less area and less power. With this algorithm, we tradeoff the gated clock tree power to improve the clock gate enable timing and thus enable high speed design clock gating opportunity. Therefore, the proposed algorithm is the best optimization technique for both power and performance.

REFERENCES

[1] Low-Power Methodology Manual For System-on-Chip Design. Books by Springer ISBN 978-0-387-71818-7

[2] IEEE Trans. on CAD, vol. CAD-6, no. 4, pp. 650-665,

July 1987.

[3] -aware clock tree

Design. 2004.

[4] -tree power optimization based on RTL clock-Automation. 2003.

[5] SNUG San Jose 2006 Proceeding.

[6] chips by -DAC, pp.313-318,1998.

[7] -power

715-722, June 2001

[8] Wei-Chung Chao and Wai- -Power Gated and Buffered

Automation of Electronic Systems, Vol. 13, No. 1, Article 20, Pub. date: January 2008.

[9] Using IC-Compiler CTS Tool. SNUG Singapore 2009 Proceeding.

134

regional clock gate splitting algorithm for clock tree ...repository.um.edu.my/86272/1/regional...

Documents