[ieee 2011 ieee 17th asia-pacific conference on communications (apcc) - sabah, malaysia...

Implementation of Multi-Class Shared Buffer with Finite Memory Size

A. A. Abdul Rahman System Technology,

Telekom Research & Development (TMRND),

Selangor, Malaysia.

K. Seman, K. Saadan Faculty of Science & Technology, Universiti Sains Islam Malaysia

(USIM), Negeri Sembilan, Malaysia.

A. Azman Faculty of Computer Science & IT, Universiti Putra Malaysia (UPM),

Selangor, Malaysia.

Abstract—High packet network have become an essential in modern multimedia communication. Shared buffer is commonly used to utilize the buffer in the switch. In this paper, we analyse the performance of shared buffer with different memory sizes. The architecture of the multi-class shared buffer is developed for 16x16 ports switch that is targeted in Xilinx FPGA. The performance of the multi-class shared buffer switch is analysed in term of throughput and mean delay. Based on the simulation with different memory sizes, it is observed that the optimum selection of memory size under uniform traffic depends on the maximum traffic load of the switch.

Keywords-Shared buffer; multi-class; finite memory size; architecture design; FPGA,

I. INTRODUCTION Buffering scheme plays an important role in obtaining a

high performance switch. Shared buffer is used to utilize the buffer within the switch [1]. The shared buffer switch fabric can emulate output queue architecture [2], [3].

In general, the input buffer switch suffers the Head-of-Line (HoL) blocking which limits the throughput to 58% [4]. The virtual output queue (VOQ) has been developed to overcome this problem. Virtual output queue has successful cater the above mention problem. Unfortunately, it has increased the complexity of the design.

Meanwhile, in output buffer, the switch has to deal with cell contention. Cell contention occurs due to simultaneous delivery of arrival cells to the same output port. This will increase the amount of cells lost in the switch. In order to cater this problem, speed-up of N is used for N ports switch [5], [6]. This makes it impractical for the switch with large ports.

Shared buffer has been developed to have the best ability from both input and output buffer. It is able to eliminate the HoL blocking effect and cell contention problems. However, it needs to improve the read/write access time within a given time slot [7]. By using dual port memory as a shared memory, performance of the shared buffer switch can be improved.

The multi-class scheme is introduce to provide variety of service classes offering different performance guarantees (Quality of Service, QoS) [4]. In this paper, the priority classes are used to differentiate the service requirements between each class.

In [8], the design of shared buffer architecture is explored. The design concentrates only on a single class buffer. In [9], the analysis of shared memory is done with two discard levels. Its shared buffer is divided into two sections, one is for high priority traffic (committed and excess traffic) and the other is for low priority traffic (committed traffic). This will guarantee the throughput and minimize the traffic loss but the level of memory optimization will reduce. The decrease of memory optimization is because the high priority traffic cannot be accommodated in low priority traffic buffer when its buffer is full, even though the low priority traffic allocation has a lot of empty spaces.

This paper attempts to analyze the performance of multi-class shared buffer under uniform traffic, while taking into account the size of the shared memory. The switch performance is analyzed in term of mean delay and throughput.

This paper is organized as follows. In section II, the system architecture is discussed in detail. Section III presents the simulation model used in this research and its analysis in term of total mean delay and throughput. In section IV, the synthesis results of the proposed shared bufffer multi-class switch architecture are shown. Section V is devoted for simulation results with different size of shared memory. Lastly, section VI provides a brief conclusion.

II. SYSTEM ARCHITECTURE

A. Multi-class shared buffer architecture The multi-class shared buffer uses priority setting to

differentiate each class. The priority setting is classified based on QoS requirements in term of delay and cell lost.

Figure 1 illustrates the concept of multi-class shared buffer for NxN switch. The multi-class shared buffer architecture consists of four main blocks which are idle memory controller, write port controller, read port controller and shared memory.

Idle memory controller contains the idle locations of shared memory. The write port controller will forward the incoming cell to the shared memory. The read port controller will contain the memory addresses where the cells are stored. The addresses are kept in a separate FIFO for each different class within each port. This separation of FIFO is preferred for

2011 17th Asia-Pacific Conference on Communications (APCC) 2nd – 5th October 2011 | Sutera Harbour Resort, Kota Kinabalu, Sabah, Malaysia

978-1-4577-0390-4/11/$26.00 ©2011 IEEE 548

its ease in control of priority cells departure which will simplify the scheduler design.

Figure 1: Multi-class shared buffer architecture

Traditionally, buffering strategies are fixed to 53 bytes of packet size for ATM network. Today, due to the introduction of variable packet size in IP network the size must be segmented to fixed size packet called cell [10]. The incoming IP packet is segmented into smaller sized cell as it arrived and reassembled it at the output. In this architecture the cell size is set to 32-byte. Based on [10] this size has the best memory utilization.

B. Idle memory controller The idle memory controller is used to store and keep track

of idle memory location in shared memory. When the cell arrives, it will give an idle location to store the cell. When the cell is read the location became available and the address is sent back to idle memory controller for future used. Figure 2 shows the architecture of idle memory controller. Initially, the Address Generator will generate all available address locations and save them in the FIFO. Then the address (idle_address_out) will be used to store the new arrival cell and will be returned back after this cell is read for departure.

Figure 2: Idle memory controller

The size of FIFO depends on the total size of shared memory location.

C. Write port controller The write port controller is used to write data in the

available memory location. This location is defined by idle memory controller.

Figure 3: Write port controller.

Figure 3 shows the Write port controller block diagram. The write buffer control will generate select signal to MUX1 to choose the available cell that is destined for that port which is define by the port_addr. During the initialization process the write buffer control will become idle. When the correct cells are detected, the wr_pointer will enable signal to write the cells in shared memory.

D. Read port controller The read port controller is used to read the cell from the

shared memory for departure. Figure 4 shows the architecture of read port controller. There are two dedicated FIFO for low priority class and high priority class. The FIFO_L will store location of low priority cells and FIFO_H will store location of high priority cells. Each time the cells are written in the shared memory, the location is stored in this FIFO. The size of the FIFO depends on the size of shared memory address bus.

549

Figure 4: Read port controller.

The read buffer control will generate signal to choose the read address location from either FIFO_L or FIFO_H. Signal rd_pointer is used to read cell for the shared memory. Figure 5 shows the pseudo code used as selection decision.

Figure 5: Pseudo code for read buffer control.

E. Shared memory The shared memory is the main component in the shared

buffer architecture. It is used to store arrival cells while waiting for departure.

Figure 6: Shared memory.

Figure 6 shows the architecture of the shared memory. It uses two ports RAM to enable simultaneous read and write. The port A write address (wr_addrA) comes from idle memory controller (Idle_addr_out). The input data (Data_RAM_in) is forwarded from Write port controller. The MUX2 is used to select data from all the input ports that contain cell. Signal weA will be triggered by wr_pointer from any port that have cell to be stored in the memory.

When the cells data are stored in the memory, it will be read using port B read address (rd_addrB). The read address will give priority to high class traffic as compared to low class traffic. The MUX1 will choose read address from each port. The enable signal for port B (enB) can be triggered by the rd_pointer signal from any port.

The output is then demultiplexer (DEMUX1) to its destination based on destination address assigned to the cell.

III. SIMULATION MODEL A simulation model was developed to simulate the

performance of the proposed shared buffer. In this simulation the architecture will use two priority classes for 16x16 ports. The two priority classes are label as Class1 for high priority class and Class0 for low priority class. The proposed shared buffer switch operates in time slotted transmission. Each time slot consists of three phase; arrival, scheduler and departure.

In arrival phase, the cell will arrive at the beginning of time slot according to independent Bernoulli distribution. For uniform traffic, the maximum arrival rate at any queue is always less than 1/N of the traffic load. Figure 7 shows the address packet generation in arrival process.

IF (FIFO_H not empty) { rd_sel = 1; rd_pointer_H = 1;

rd_pointer_L = 1; } ELSE IF (FIFO_H empty & FIFO_L not empty) { rd_sel = 0; rd_pointer_H = 0;

rd_pointer_L = 1; } ELSE { rd_sel = X; rd_pointer_H = 0;

rd_pointer_L = 0; } ENDIF

550

Figure 7: Pseudo code for address cell generator.

All the arrival cells will be stored in the shared memory. In the case when the shared memory is full all the incoming cells will be dropped.

In scheduler phase, the read buffer controller will decide which cell should be served first. In this simulation, the high priority cells will be served first as compared to low priority cells.

In the departure phase, cells will be read from the shared buffer and sent to the output ports according to their destination address. Then these memory locations that have been read are updated in the idle memory controller.

In order to evaluate the performance of the proposed switch architecture, some analysis is done on the total mean delay and throughput with different shared memory size setting. The delay is calculated based on assumption that those arrival cells in a slot are independent and identically distributed Bernoulli processes. The total delay for each cell, D can be expressed as in (1) with Tq is the waiting time for the cell in queue, THoL is the waiting time for the cell in HoL and Ts is the delay in services.

(1) T T T D sHoLq ++=

In hardware analysis, this total delay for each cell can be defined as the time for one cell being processed in the system from arrival to departure. This is shown in (2) with Tin is the arrival time of the cell and Tout is the cell departure time.

(2) T T D inout −=

Then, the mean delay for single port Class C priority, DPC

(class C can be refer as class1 or class0) can be calculated as expressed in (3) with Lc is the total number of cells for Class C priority.

(3) L

)(D D

c

L

1 ii

PC

c

∑==

By using data from (3), the total mean delay for the whole system with N number of node, E(D) can be calculated as in (4).

(4) N

)(D E(D)

N

1 iPCi∑

==

The total throughput (TP) can be calculated by finding the

ratio between the total output cells from each class (OC) with the total input cells (ICTotal) as express in (5).

(5) IC

)(OC TP

Total

N

1 ii∑

==

The total throughput result will show the rate of cells lost

in the switch.

IV. SYNTHESIS AND HARDWARE ANALYSIS The synthesis for 16x16 shared buffer with memory size of

256 is done using Xilinx ISE software. The synthesis results are shown in Table 1.

TABLE I. XILINX FPGA SYNTHESIS RESULTS

FPGA Synthesis Reports

Clock speed LUT SLICE

XC6VCX240T 251.820MHz 14924 3053

XC5VTX240T 212.895MHz 15121 3085

XC6SLX150T 135.251MHz 15109 3218

The syntheses are done for Virtex5, Virtex6 and Spartan6

Xilinx FPGA devices. The performance of the design is different, depending on the targeted FPGA chip. Based on the latest version of Xilinx FPGA device in Virtex6 family (XC6VCX240T), the design can achieve clock speed of 251.820MHz. The bandwidth of this architecture can support up to 64.4Gbps.

V. RESULT Hardware design using Verilog code is developed to

evaluate the performance of the proposed architecture. The hardware simulation results are used to obtain the total delay for each cell in (2). Later, the results are used by Matlab to calculate the total mean delay in (4) and the throughput in (5). These results can indicate the performance of the switch.

The simulations are done by adjusting the size of the shared memory.

FOR (destination address = 1 to N) DO {

IF (random number < traffic load) THEN Destination address = random number * N;

ELSE No Destination address;

ENDIF }

551

0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10

5

10

15

20Delay for 16 port shared buffer multiclass switch

Load

Mea

n D

elay

(Tim

e sl

ot)

Class1,size 32Class0,size 32Class0,size 64Class1,size 64Class0,size 128Class1,size 128Class0,size 256Class1,size 256

Figure 8: Total mean delay for Class1 and Class 0 for

different memory size.

In Figure 8, simulation graph for class1 and class0 with different memory size, which are 32, 64, 128 and 256, are plotted. From the graph, the performance of the total mean delay for the memory size of 32 is better as compared to others because of the cells drop due to full memory. Furthermore, this will reduce the number of contentions between cells.

The difference between the class1 and class0 delay increases when the size of the memory increases. This reflects to the number of contention occurs in the switch. When there are more cells in the memory fail to depart due to lose in contention, the delay will increased.

0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10.5

0.6

0.7

0.8

0.9

1Throughput for 16 port shared buffer multiclass switch

Load

Thro

ug

hpu

t

32 location 64 location 128 location 256 location

Figure 9: Total throughput for different memory size.

In Figure 9, the simulation graph of throughput with different memory size, which is 32, 64, 128 and 256 are plotted. It can be seen in the plotted graph that the throughput performance for memory size of 256 is better as compared to others due to larger memory size. Furthermore, this will reduce the number of cell lost in the switch.

For switch with traffic load 0.9, the memory size of 128 is sufficient to give 100% throughput. For memory size of 64, the throughput of 100% can be achieved until traffic load reached 0.8. It can also be seen that the throughput performance for memory size of 32 start to degrade after the traffic load of 0.7.

VI. CONCLUSION In this paper, we proposed and designed multi-class

shared buffer switch architecture. This preliminary research, studies the effect of mean delay and throughput with different size of shared memory. The design of the switch is developed by using Verilog code and the performance is tested with different memory sizes. Based on the simulation results, the optimum selection of memory size under uniform traffic is depends on the maximum traffic load of the switch. The optimum memory size under nonuniform traffic with some level of priority control is still under progress.

REFERENCES [1] D. Seidel, A. Raju, and M. A. Bayoumi, "A new ATM switch

architecture: scalable shared buffer," Proc. Electronics, Circuits, and Systems, 1996. ICECS '96, Proceedings of the Third IEEE International Conference on, 1996, pp. 772-775 vol.2.

[2] H. J. Chao, H. L. Cheuk, and O. Eiji, Broadband Packet Switching Technologies: A Practical Guide to ATM Switches and IP Routers, John Wiley & Sons, Inc., 2001.

[3] M. Arpaci and J. A. Copeland, "Buffer management for shared-memory ATM switches," Communications Surveys & Tutorials, IEEE, 2000, pp.2-10.

[4] M. Karol, M. Hluchyj, and S. Morgan, "Input Versus Output Queueing on a Space-Division Packet Switch," Communications, IEEE Transactions on, 1987, pp. 1347-1356.

[5] C. Shang-Tse, A. Goel, N. McKeown, and B. Prabhakar, "Matching output queueing with a combined input output queued switch," Proc. IEEE Journal of Selected Areas in Communications, 1999, pp. 1030-1039.

[6] Z. Hui, "Service disciplines for guaranteed performance service in packet-switching networks," Proceedings of the IEEE, 1995, pp. 1374-1396.

[7] H. J. Chao, H. L. Cheuk, and O. Eiji, Broadband Packet Switching Technologies: A Practical Guide to ATM Switches and IP Routers, John Wiley & Sons, Inc., 2001.

[8] S. O'Kane, S. Sezer, and C. Toal, "Design and implementation of a shared buffer architecture for a gigabit Ethernet packet switch," Proc. SOC Conference, 2005. Proceedings. IEEE International, 2005, pp. 283-286.

[9] S. Bergida and Y. Shavitt, "Analysis of Shared Memory Priority Queues with Two Discard Levels," Network, IEEE, 2007, pp. 46-50.

[10] S. O'Kane, S. Sezer, and L. Lum Soong, "A Study of Shared Buffer Memory Segmentation for Packet Switched Networks," Proc. Telecommunications, 2006. AICT-ICIW '06. International Conference on Internet and Web Applications and Services/Advanced International Conference on, 2006, pp. 55-55.

552

[ieee 2011 ieee 17th asia-pacific conference on communications (apcc) - sabah, malaysia...

Documents