regional clock gate splitting algorithm for clock tree ...repository.um.edu.my/86272/1/regional...
TRANSCRIPT
ICSE2010 Proc. 2010, Melaka, Malaysia
Regional Clock Gate Splitting Algorithm for Clock
Tree Synthesis
Siong Kiong Teng Intel Microelectronics (M) Sdn Bhd
Penang, Malaysia
Email:[email protected]
Norhayati Soin Dept. of Electrical Engineering
University of Malaya
Kuala Lumpur, Malaysia
Email:[email protected]
Abstract In this paper, the new clock distribution design flow
enable signal timing violations had been presented. The clock gate
components in a clock tree are exposed to setup timing violations
due to the nature that the clock gates skew is normally big as they
are located at the beginning of the clock tree. The effective
splitting of the clock gate to the lower level of the clock tree will
improve the clock gate skew and thus improve the setup margin.
I. INTRODUCTION
Clock gating is being used widely in VLSI clock distribution as a method to reduce clock network power dissipations [1]. The clock gating components are part of the clock tree distribution components during clock tree synthesis (CTS) process. Therefore, the physical location of the clock gating components have a huge impact on the overall clock tree power consumption and the clock gate enable signal timing convergence during performance verification (PV).
In high speed VLSI designs, there is always a big challenge to meet the clock gate enable signal setup timing requirement during PV. Clock gate is normally built using an AND gate [1]. The most commonly used clock gate is an integrated, latch-based clock gate. Latch-based clock gates prevent glitches from the enable from being propagated forward to the gated clock output [1]. Fig. 1 shows a commonly used latch-based clock gate design. In this design, the EN signal needs to arrive before the clock to ensure a clean clock gating event.
EN
CLK
Gated CLK
Figure 1. Latch-based Clock Gate [1]
As shown in Fig. 2, in the normal VLSI design implementation, the clock will arrive at the clock gate earlier than Flop A and Flop B. The clock gate is not treated as the sink pins or end points during CTS process. Thus, the clock gate skew will not be matched with the rest of the sequential elements. The timing constraint of the setup margin on the clock gate is a function of the clock gate skew to the Flop A. The setup margin can be written as (1)
setupperiod TTTTTTS )( 4312 (1)[2]
where
S = Setup margin of the clock gate. Tperiod = Clock period. T1 = Clock tree insertion delay to Flop A. T2 = Clock tree insertion delay to clock gate. T3 = Combinational gate delay. T4 = Flop A clock to Q delay. Tsetup = Clock Gate setup requirement.
Q
QSET
CLR
D
Flop B
Clock Gate
Q
QSET
CLR
D
Flop AClock
T1
T2
T3
T4
Figure 2. Clock Gate Timing Constraint
The problem becomes more severe on the high speed designs when Tperiod become smaller and the overall timing window becomes tighter. The designers will often have to drop the clock gate implementation due to timing constraints. This will increase the overall clock tree power and yield an un-optimized design. In this research, we propose an effective clock gate splitting algorithm that is able to improve the setup margin and create more opportunity for clock gating design to save power.
Design communities focusing on the clock tree distribution researches have explored various approaches to improve the registers placement, clock gates placement and clock routing. Ref [3] and [4] proposed a clock tree planning methodology that constructed gated clock tree topology with consideration of both switching activities and cell locations using formulated cost function. This approach is effective but the efficiency on the power reduction is limited by the weakness that the gated clock tree is constructed after conventional cell placement. Moreover, the research assumed the ideal binary clock tree which is not common for normal EDA industrial CTS tool. In [5], the clock gates are split based on the total number of fanout of the individual clock gate. This approach did not account for the physical information of the clock gate fanout. When the
131 978-1-4244-6609-2/10/$26.00 ©2010 IEEE
ICSE2010 Proc. 2010, Melaka, Malaysia
physical location of the clock gate fanout is far, it will not split the clock gate. Therefore, the physical location is important to solve the problem. Ref [6] and [7] proposed the zero skew optimization during the gated clock tree construction. However, after the clock gate removing step, the zero skew is not honored. In [8], the authors built the gated clock tree network from scratch with the algorithm that insert the clock gating enable structural together with the clock buffers. Their algorithm had a recursive approach to calculate the effective switched capacitance directly during the clock tree network construction. All the previous approaches had not addressed the clock gate timing issue during clock tree construction.
II. REGIONAL CLOCK GATE SPLITTING ALGORITHM
A. Clock Gate Marking
The clock gate splitting algorithm will be done after physical cell placement and optimization but before common industrial CTS flow. The algorithm will go through the clock gate marking process. The clock marking process is to identify all the clock gates existing in the design and associated each of the clock gates with its design attribute. Each of the clock gates will be associated to its respective fanout. After that, all the fanout will be bounded within a box based on their physically location. The fanout region of the clock gate is being calculated as the bounding box that encompasses all the loads of the clock gate. The area of the bounding box will be calculated. The area is the multiplication of the length and width of the bounding box.
Fig. 3 shows the bounding box of the clock gate and their respective area. Clock Gate A, Clock Gate B and Clock Gate C will be associated with their respective bounding boxes which are Area A, Area B and Area C using the clock gate marking algorithm.
Clock Gate A
Q
QSET
CLR
D
Q
QSET
CLR
D
Q
QSET
CLR
D
Clock Gate B
Q
QSET
CLR
D
Q
QSET
CLR
D
Q
QSET
CLR
D
Clock Gate C
Q
QSET
CLR
D
Q
QSET
CLR
D
Q
QSET
CLR
D
Q
QSET
CLR
D
CLK
Area A
Area B
Area C
Figure 3. Fanout load area associations.
The clock gate marking process is to determine which clock gates needs to go through clock gate splitting process. When the fanout area increases, the clock gate skew with respect to the
flops will increase. This indicates that the clock gate will need to go through clock gate splitting to reduce clock skew.
B. Clock Gate Splitting
The clock gate splitting algorithm will split the clock gate based on the targeted area. If the area for individual clock gate calculated during the clock gate marking process is bigger than the targeted area, the regional clock gate splitting process will be triggered. Besides that, the clock gate marking process will also take in the input from physical synthesis report to determine which clock gate is actually having a problem meeting the setup timing. If the clock gate is not having setup timing violations, it will not go through clock gate splitting process even though its area might be bigger than targeted area. This is to ensure the clock gate can gate off more clock tree branches during clock gating event.
The targeted area is the function of the individual clock frequency. When the clock frequency increases, the targeted area decreases. This is because when the clock frequency increases, the timing window will be tighter. Thus, the clock gate skew needs to be smaller to ensure the setup timing is met.
When the area is bigger than the targeted area, the clock gate splitting algorithm will first determine the bounding box geometry is x dominant or y dominant. If the length in x-direction is bigger than in y-direction, the bounding box will be cut at x/2 point. If the length in y-direction is bigger than in x-direction, the bounding box will be cut at y/2 point. Fig. 4 shows the bounding box cutting algorithm during clock gate splitting flow.
X > Y
Y > X
BOX B
Cut at X/2
Cut at Y/2
BOX A
Figure 4. Bounding box cutting based on longer length.
After cutting and dividing the bounding boxes, the bounding box area will be re-calculated based on the new loads locations. This will yield an equal or smaller bounding box. The bounding box splitting and rebounding process will be continued on the divided boxes until the area of the boxes is smaller than the targeted area. During the process, the flops and registers cell placement are kept fixed to avoid timing perturbation due to flops movement.
After solving all the physical bounding boxes to meet targeted area constraints, the algorithm will start moving and duplicating the clock gates. The original clock gate will be moved to the center of the first bounding box. The new bounding boxes will have all their clock gates duplicated and placed at the center of the bounding box. This will be the ideal location for the clock
its loads. The cell placement will be legalized incrementally using standard industrial EDA tool
132
ICSE2010 Proc. 2010, Melaka, Malaysia
After that, the netlist connectivity will be updated. The newly created clock gates will have same logical connection as the master clock gate. The connection will be made on all the pins of the clock gate. The duplication mapping file will be generated to be used during logical formal verification process. Fig. 5 shows the results of regional clock gate splitting based on Fig. 3 flops placement. As shown in Fig. 5, all the respective new bounding boxes will have smaller area and less fanout.
Clock Gate A
Q
QSET
CLR
D
Q
QSET
CLR
D
Q
QSET
CLR
D
Clock Gate B
Q
QSET
CLR
D
Q
QSET
CLR
D
Q
QSET
CLR
D
Clock Gate C
Q
QSET
CLR
D
Q
QSET
CLR
D
Q
QSET
CLR
D
Q
QSET
CLR
D
CLK
Area A
Area B
Area C
Figure 5. Regional clock gate splitting.
III. INTEGRATION OF THE ALGORITHM INTO AN INDUSTRIAL
FLOW
This section will describe how the regional clock gate splitting methodology is integrated into an industrial design flow that relies on commercial EDA design tool for physical optimization and clock distribution synthesis.
The regional clock gate splitting flow can be used in normal EDA software. The algorithm presented is implemented in TCL and is easily integrated into the normal industrial physical implementation design flow before CTS execution as shown in Fig 6.
Starting with the high level register transfer language (RTL), the logic synthesis will be performed to translate the behavior model to structural netlist. Physical synthesis will then be run to do the coarse placement and optimization of the design. After that, the flow will run through the CTS flow.
The clock gate timing is being generated to determine all the violated clock gates that need to go through clock gate splitting process. If there are no clock gates having setup timing violations, the flow is completed. If there are some clock gates violating setup timing, the regional split clock gate flow will be triggered to split the clock gate and reran the CTS flow again.
The process will be repeated until there are no clock gate timing violations.
RTL codes
Logic Synthesis
Physical Placement and Logic Optimization
Area Based Split Clock Gate Flow
CTS Flow
Clock Gate Enable Timing Met?
End
No
Yes
Figure 6. Flow Chart of Integration of Regional Split Clock Gate Flow into
Standard EDA Flow.
IV. PHYSICAL CLOCK TREE SYNTHESIS RESULTS
In this section, the CTS results on a design consisting of 3 millions gates and close to 500K flops is being compared with traditionally CTS approach [9] and the regional split clock gate CTS approach. For each approach, the CTS results on the clock tree area, number of clock buffers and clock skew are collected.
The result is tabulated in Table I. With the regional split clock gate approach, the overall clock tree area had been reduced by 1.17% with less clock buffers used during CTS flow. The overall clock gate component increased as expected because of the duplication of the clock gate components. However, the clock buffer count savings is able to compensate the clock gate increase and thus improve the total clock tree area. The results also show that this approach is able to maintain the final clock skew results to be equal or less than the conventional approach.
TABLE I.
CTS RESULTS COMPARISION WITH REGIONAL CLOCK GATE SPLITING APPROACH
Conventional
Approach
Regional Clock Gate
Splitting Approach
Total Flops 500000 500000
Total Clock Buffers 31018 30602
Total Clock Gate 20389 20630
Clock Skew (picosecond) 318 317
Total Clock Tree Area 156881 155041
V. PERFORMANCE AND POWER VERIFICATION
In this section, we compared the clock gate enable setup timing margin based on the similar design case study discussed in section IV. The post-CTS design is being
optimized again to model the clock tree synthesis effect on the design timing. The clock tree dynamic power consumption will also be compared against the traditional approach by studying the overall clock tree capacitances.
133
ICSE2010 Proc. 2010, Melaka, Malaysia
TABLE II.
CLOCK GATE ENABLE TIMING PERFORMANCE
Conventional
Approach
Regional Clock Gate
Splitting Approach
Worst Setup Slack of
Clock Gate A (picosecond) -407 -191
Worst Setup Slack of
Clock Gate B (picosecond) -336 -182
Worst Setup Slack of
Clock Gate C (picosecond) -22 +32
For the performance gains by this methodology, we
observed the significant improvement of the clock gate enable
timing slack. Table II shows the clock gate enable timing
violations after running through the regional clock gate
splitting algorithm. As shown, the clock gates enable timing
improved by 46% to 53% range for clock gate A and clock
gate B. This proves that our algorithm is able to solve the
clock gate enable setup timing by effectively reduce the clock
gate skew. However, the total number of setup violations on
the clock gates will increase because the original clock gate
and the split or duplicated clock gates will show up as
violators. However, the worst negative slack on the clock gates
had improved as shown in Table II. In clock gate C case study,
the violations is resolved by using the split clock gate
algorithm where the clock gate enable timing become positive
after split clock gate flow. The proposed methodology shows the promising results
not only on the clock tree enable timing, but also on the clock tree dynamic power consumption. As shown in Table III, the overall clock network capacitances are lower in the regional clock gate splitting test case. The clock network capacitance will be proportional to the overall clock tree power. Two clock gate operations scenarios are being compared in the experiment. The clock gate ON operation result is tabulated with all the clock gates open to allow all registers to be latching actively. The clock gate OFF result is tabulated with all the clock gates disabled to achieve gating.
A reduction in capacitance is observed with clock gate ON operations using regional clock gate splitting approach. Meanwhile an increase in total clock network capacitance is noticed for clock gate OFF operations. The reduction in the clock gate ON operation because the clock gates now are located closer to the flops and they are less overlapping clock trees on gated and ungated clocks. However, the clock gate OFF operation condition increased the capacitance is ascribed to the increase in pre-clock gate buffer stages, brought around by the split clock gate effect to the lower level of buffer trees. This is the tradeoff between power and the clock gating enable timing convergence efforts.
TABLE III.
CLOCK TREE POWER PERFORMANCE
Conventional
Approach
Regional Clock Gate
Splitting Approach
Clock Gate ON Power
(Capacitance in fentofarad) 381419 380736
Clock Gate OFF
(Capacitance in fentofarad) 19910 21878
VI. CONCLUSION
In this paper, a new design flow for gated clock tree synthesis had been proposed taking clock gate enable timing performance into account. After the physical cells placement, the regional clock gate splitting algorithm is able to effectively reduced the post clock gate clock tree insertion delay and thus reducing the clock skew of the clock gate to the rest of the clock tree. The paper also shows that with this approach, the overall clock tree buffers reduced and the clock tree distribution is constructed with less area and less power. With this algorithm, we tradeoff the gated clock tree power to improve the clock gate enable timing and thus enable high speed design clock gating opportunity. Therefore, the proposed algorithm is the best optimization technique for both power and performance.
REFERENCES
[1] Low-Power Methodology Manual For System-on-Chip Design. Books by Springer ISBN 978-0-387-71818-7
[2] IEEE Trans. on CAD, vol. CAD-6, no. 4, pp. 650-665,
July 1987.
[3] -aware clock tree
Design. 2004.
[4] -tree power optimization based on RTL clock-Automation. 2003.
[5] SNUG San Jose 2006 Proceeding.
[6] chips by -DAC, pp.313-318,1998.
[7] -power
715-722, June 2001
[8] Wei-Chung Chao and Wai- -Power Gated and Buffered
Automation of Electronic Systems, Vol. 13, No. 1, Article 20, Pub. date: January 2008.
[9] Using IC-Compiler CTS Tool. SNUG Singapore 2009 Proceeding.
134