2010 kes · title: 2010_kes author: ï¿½ï¿½josï¿½a. mtz. created date: 1/20/2011 6:36:52 am

R. Setchi et al. (Eds.): KES 2010, Part II, LNAI 6277, pp. 173–182, 2010. © Springer-Verlag Berlin Heidelberg 2010

Vertical Fragmentation Design of Distributed Databases Considering the Nonlinear Nature of

Roundtrip Response Time

Rodolfo A. Pazos R.1, Graciela Vázquez A.2, José A. Martínez F.1, and Joaquín Pérez O.3

1 Instituto Tecnológico de Cd. Madero 2 ESIME, Instituto Politécnico Nacional

3 Centro Nacional de Investigación y Desarrollo Tecnológico [email protected], [email protected],

[email protected], [email protected]

Abstract. One of the challenges of applications of distributed database (DDB) systems is the possibility of expanding through the use of the Internet, so widespread nowadays. One of the most difficult problems in DDB systems de-ployment is distribution design. Additionally, existing models for optimizing the data distribution design have only aimed at optimizing query transmission and processing costs overlooking the delays incurred by query transmission and processing times, which is a major concern for Internet-based systems. In this paper a mathematical programming model is presented, which describes the be-havior of a DDB with vertical fragmentation and permits to optimize its design taking into account the nonlinear nature of roundtrip response time (query transmission delay, query processing delay, and response transmission delay). This model was solved using two metaheuristics: the threshold accepting algorithm (a variant of simulated annealing) and tabu search, and comparative experiments were conducted with these algorithms in order to assess their effec-tiveness for solving this problem.

Keywords: Mathematical programming, distributed databases, vertical frag-mentation, metaheuristic algorithms.

1 Introduction

Computing, from its beginning, has undergone many changes: from the massive main-frames that carried out limited tasks whose use was exclusive for a few organizations, up to modern computers (either PCs or laptops) that have even more processing power than the first ones, and which are more involved in the daily tasks of people. Organiza-tions have embraced the use of computer communication networks, including the Internet, making easier the implementation of Distributed Database (DDB) systems.

DDB systems managers have the responsibility of carrying out the fragmentation, allocation and monitoring of data at the network sites. In most DDB systems its de-sign greatly affects the overall system performance, mainly because the time required

174 R.A. Pazos R. et al.

for query transmission and processing and the transmission of query responses depend considerably on the sites where data resides, sites where queries are issued, data access frequencies, query processing capacity of database (DB) servers, and transmis-sion speed of communication lines.

Traditionally, the DDB distribution design problem has been defined as finding re-lations (tables) fragments and their allocation such that the overall costs incurred by query transmission and processing are minimized.

However, it has been recognized for many years the importance of considering response time in DDB modeling [1]. Unfortunately, a survey of the specialized litera-ture revealed that roundtrip response time has not been considered. Therefore, existing models for optimizing DDB design have only aimed at minimizing the transmission and processing costs of queries, overlooking the delays incurred by transmission and processing times (including queuing delays), which are one of the major concerns for Internet-based systems.

We define roundtrip response time as the time elapsed between the arrival of a query to a network site (query source) and the time when the last bit of the response to the query is received at the query source. This definition implies that roundtrip con-sists of: transmitting a query from its source site to a DB server site (where the query is processed), processing the query and generating a response at the server site, and transmitting the response back to the source site.

Considering only transmission and processing costs yields a linear optimization expression with respect to the problem decision variables [2, 3, 4], which is easier to deal with than the expression that involves roundtrip response time (presented in Sub-section 3.1). Some investigators have assumed that transmission and processing costs are directly proportional to roundtrip response time; however, this assumption is not always true. For example, when considering a transmission line, the transmission cost of a query has been defined as being directly proportional to the transmission time of

Fig. 1. Variation of processing+queueing time with respect to query load

Vertical Fragmentation Design of DDBs Considering the Nonlinear Nature 175

the query; therefore, the overall transmission cost (incurred by the entire query load) is directly proportional to the amount of query data transmitted on the line.

Unfortunately, this simplifying assumption ignores queuing time; i.e., the time a query needs to wait before being transmitted because the transmission line queue is not empty when the query arrives. Similarly, queuing processes should also be con-sidered for query processing at server sites, as well as for transmission of query responses from servers back to source sites. For illustrating this situation, Fig. 1 pre-sents simulation results that show the behavior of processing+queuing time of queries at a server site with respect to query load. In fact, this behavior has been studied by queuing theory; thus, the average processing+queueing delay of queries at a server can be approximated by the following expression [5]:

λ−=

CT

1

where C represents the processing capacity of the server and λ is the arrival rate of queries to the server. As the figure and the preceding formula show, when the query load approaches the server processing capacity, the processing+queuing time spent by queries grows steeply and tends to infinity, because the number of queries waiting to be processed grows larger and larger.

Finally, in this paper, the DDB distribution design problem is defined as finding re-lations fragments and their allocation such that the average roundtrip response time incurred by transmitting and processing all the queries and transmitting query responses is minimized. This problem is modeled using a zero-one mathematical pro-gramming model, which makes possible to solve it using the large assortment of metaheuristic algorithms available.

2 Related Work

A survey of the specialized literature revealed that modeling roundtrip response time has not been considered, as shown in Table 1, which summarizes the most recent and relevant works on DDB fragmentation and allocation.

The second, third and fourth columns of Table 1 show that some works have ad-dressed only the fragmentation problem, most works have only dealt with the frag-ment allocation problem and some others have integrally addressed both problems. The fifth column shows that all the previous works have considered transmission, access or processing costs. The seventh column shows that most works have proposed heuristic algorithms for addressing DDB fragmentation and allocation, and only a few have proposed mathematical programming formulations.

This comparative survey reveals that our approach has a distinctive feature that sets it apart from others: the incorporation of roundtrip response time into a mathematical programming formulation of the DDB fragmentation+allocation problem.


Table 1. Related works on fragmentation and allocation for DDBs

Problems Addressed Value to Minimize

Solution Approach

W o r k s

Fra

gmen

tati

on

Allo

cati

on

Inte

grat

ed

Fra

gmen

tati

on

+ A

lloca

tion

Tra

nsm

issi

on

or p

roce

ssin

g co

sts

Rou

ndtr

ip r

e-sp

onse

tim

e

Heu

rist

ic

algo

rith

m

Mat

hem

atic

al

prog

ram

min

g fo

rmul

atio

n

Chakravarthy, 94 [1] ! ! !

Tamhankar, 98 [6] ! ! !

Golfarelli, 99 [2] * ! !

Huang, 01 [7] ! ! !

Pérez, 03 [3] ! ! !

Ulus, 03 [8] ! ! !

Menon, 05 [4] ! ! !

Basseda, 06 [9] ! ! !

Ma, 06 [10] ! ! !

Hababeh, 07 [11] ! ! ! !

Gorla, 08 [12] * ! !

Tambulea, 08 [13] ! ! !

Our approach ! ! !

* for centralized databases.

3 Mathematical Model

The DDB distribution design problem is defined as follows: given a set of attributes (data-objects) O = {o1, o2, ..., oL}, a computer communication network that consists of a set of sites S = {s1, s2, ..., sI}, a set of queries Q = {q1, q2, ..., qK} generated at the network sites, the attributes required by each query, and the generation rates of que-ries at each site; the problem consists of finding relations fragments and their alloca-tion that minimizes the average roundtrip response time.

This section describes a (binary) integer-programming model for the vertical frag-mentation and allocation of data that involves roundtrip response time. We refer to this model as the VFA-RT model. In this model, the decision about storing an attrib-ute l in site j is represented by a binary variable xlj. Thus xlj = 1 if l is stored in j, and xlj

= 0 otherwise. Thus, the problem consists of finding an assignment of values to the x's, such that the value assignment satisfies certain restrictions and minimizes an objective function.

3.1 Objective Function

The objective function below models the average roundtrip response time using three terms (inside the parenthesis): delay of queries incurred by their transmission


(including queuing delay) on the communication lines from their sources to the serv-ers where they are processed, delay of queries incurred by their processing (including queuing delay) at the servers, and delay of queries responses incurred by their trans-mission (including queuing delay) on the communication lines from the servers back to the sources of their corresponding queries.

!!!!!!

"

#

$$$$$$

%

&

−+

−+

−= '

'

'

''

'

'''' ij

kjkki

ijRj

k ijkki

jij

kjkki

ijj k i

jkki

yf

C

yf

C

yf

Cyfz

1

1

1

1

1

11min µµQ

(1)

where z: average roundtrip response time, yjk: a dependent decision variable, such that yjk = 1 if one or more attributes used

by query k are stored in site j, and yjk = 0 otherwise, fki: arrival rate of query k to source node i (expressed in queries/sec.), Cj: processing capacity of server j (expressed in queries/sec.), Cij: transmission speed of the communication line from node i to node j (ex-

pressed in bits/sec.), 1/µQ: mean length of queries (expressed in bits/query), and 1/µR: mean length of query responses (expressed in bits/response).

It is not easy to see that expression (1) models the average roundtrip response time. Unfortunately, its mathematical derivation and explanation is rather complex and can not be included here due to space limitations; however, inquisitive readers can find the derivation and explanation in [14].

3.2 Restrictions

The model involves three groups of restrictions, shown below. Since replication of attributes is not considered, the first group of restrictions (2) specifies that each attrib-ute must be allocated to only one site. Additionally, the second set of restrictions (3) specifies that each attribute should be allocated to a site where at least one query that involves the attribute is generated. Finally, the third group of restrictions (4) is used for restricting the values of the dependent variables y's according to the values of the x's. These restrictions are expressed as follows:

1='

jljx

∀ l

each attribute must be stored in only one site;

(2)

'≤k

kiklli qx ϕ

∀ l,i

where

()*

=>

=0if,0

0if,1

ki

kiki f

fϕ

each attribute l must be allocated to a site i where at least one query that in-volves the attribute is generated,

(3)


0≥−'l

ljkljk xqyt

∀ j,k

this restriction forces the value of yjk to 1 when some product qklxlj equals 1 (i.e., some attribute l involved in query k is located in server j), and induces yjk to 0 otherwise, where !t = number of attributes.

(4)

4 Solution Methods

Given that the problem described in [15] is NP-hard (as shown in the reference) and considering that the VFA-RT model is similar to the first one (the major difference being the objective function), we informally conclude that VFA-RT must also be NP-hard. When dealing with NP-hard problems, the practical approach is to use metaheuristic algorithms for solving them. In this case, we decided to use a variant of simulated annealing called threshold accepting (TA), since we have successfully ap-plied it to a problem similar to VFA-RT [16]. Additionally, we also decided to use tabu search (TS) –since it has been applied to a wide variety of combinatorial optimi-zation problems–, in order to determine how it compares versus TA for this problem.

Since the implemented version of the TA algorithm is not as widely known as TS, a brief description is presented next.

4.1 Threshold Accepting Algorithm

This is a simplification of the original simulated annealing, which spends less com-puting time than the classic algorithm [17].

In this algorithm, given a current feasible solution x = {xlj | l=1, ..., L; j=1, ..., J}, a neighbor feasible solution y is randomly generated. If ∆z = z(y) – z(x) < T, y is ac-cepted as the new current solution (where z represents the objective function and T is an algorithm parameter, called temperature, that has a suitably chosen value); other-wise, the current solution remains unchanged. Notice that T replaces the expression exp(–∆E/k T) in the classic algorithm.

The temperature value T is reduced each time until thermal equilibrium is reached (line 14). The value of T is decremented multiplying it by a cooling factor µ < 1, and a new search cycle (lines 7-14) is carried out again using the new value for T. These cycles are repeated inside an outer cycle (lines 6-16) with ever decreasing values of T until the system reaches freezing. This version of the algorithm has a third cycle (lines 8-13) for generating batches of solutions, which helps improve its performance.

THRESHOLD ACCEPTING PSEUDOCODE 1) begin 2) real T0, µ 3) integer i, n 4) x = initial feasible solution


5) T = T0 high value of initial temperature 6) repeat 7) repeat 8) for i = 1 to n 9) y = randomly generated neighbor solution of x 10) if z(y) – z(x) < T then 11) x = y 12) end if 13) end for 14) until thermal equilibrium is reached 15) T = µT 16) until freezing is reached 17) end Like most mataheuristic algorithms, the performance of simulated annealing depends on the values assigned to the parameters (T0, µ, n) and stop criteria (thermal equilib-rium and freezing). Usually, the values for these parameters (and stop criteria) that yield the best performance are determined by trial and error; additionally, the best values vary from problem to problem. In these circumstances, for our problem we used the values and criteria proposed in [16].

5 Experimental Results

Experiments were carried out for 12 randomly generated problems (P1 through P12). Table 2 shows the parameters used for generating the problems data; i.e., variable coefficients for expressions (1)-(4). It is worth pointing out that TA and TS are ran-dom algorithms; i.e., they move randomly from the current solution to the next one (line 9 of the TA pseudocode); therefore, they usually generate different solutions in different executions for the same problem. Thus, in order to obtain statistically sig-nificant results, 30 different instances were generated for each problem.

The experiment consisted of solving each of the instances (for problems P1-P12) using TA and TS. Table 3 shows the results obtained from this experiment, which was carried out on a laptop with the following characteristics: Intel Core Duo CPU T5450 at 1.66 GHz, with 2038 MB RAM, and 32 bit Vista Home Premium operating system. The first column indicates the problem. Table columns two though four hold the minimal, average and maximal values of the objective function for the solutions ob-tained by TA, and column five holds its average execution time in seconds. Columns six through nine show the corresponding values for TS.

Since for problems P1 through P4 the optimal values are known, an asterisk indi-cates when an algorithm was able to find the optimal solution. The results show that TA always finds the optimum for problems P1-P3 and sometimes finds it for problem P4; however, TS always finds the optimum only for problem P3, sometimes finds it for problems P1 and P2, and never finds it for problem P4. By looking at the average values of the objective function, it is clear that TA obtains better results than TS usu-ally with a smaller amount of execution time.


Table 2. Input data for generating test cases

Prob

lem

No.

Attr

ibut

es L

(=t)

No.

Site

s I

No.

Que

ries

K

Attr

ibut

e U

sage

qkm

Arr

ival

Rat

e f ki

Proc

essi

ng C

apac

ity

Cj

Tra

nsm

issi

on S

peed

C

ij

Mea

n Q

uery

Len

gth

1/µ Q

Mea

n R

espo

nse

Len

gth

1/µ R

P1 2 3 2 0-1 6-21 50 100,000 1,000 5,600 P2 10 2 3 0-1 0-19 50 100,000 1,000 5,600 P3 3 4 5 0-1 1-25 500 1,000,000 1,000 5,600 P4 10 3 4 0-1 2-21 500 1,000,000 1,000 5,600 P5 23 6 5 0-1 0-24 500 1,000,000 1,000 5,600 P6 14 5 13 0-1 0-25 500 1,000,000 1,000 5,600 P7 12 9 10 0-1 1-25 5,000 1,000,000 1,000 5,600 P8 23 5 10 0-1 0-25 5,000 1,000,000 1,000 5,600 P9 18 11 9 0-1 0-25 5,000 1,000,000 1,000 5,600 P10 26 11 7 0-1 0-25 5,000 1,000,000 1,000 5,600 P11 17 13 10 0-1 0-25 5,000 1,000,000 1,000 5,600 P12 28 9 12 0-1 0-25 5,000 1,000,000 1,000 5,600

Table 3. Comparative results for threshold accepting and tabu search

Threshold Accepting Tabu Search Objective Function Objective Function

Prob

.

Min. Avg. Max. Time Min. Avg. Max. Time P1 0.228208* 0.228208* 0.228208* 3 0.228208* 0.240492 0.289628 9 P2 0.118173* 0.118173* 0.118173* 15 0.118173* 0.260277 0.446106 12 P3 0.007264* 0.007264* 0.007264* 1 0.007264* 0.007264* 0.007264* 42 P4 0.006210* 0.006294 0.006746 2 0.006681 0.006943 0.007398 148 P5 0.008271 0.008935 0.010108 24 0.009887 0.010799 0.011491 1,755 P6 0.009362 0.011862 0.013149 63 0.018224 0.028068 0.098507 21 P7 0.006539 0.006673 0.006962 39 0.007648 0.007895 0.008104 1,300 P8 0.006598 0.007814 0.010415 24 0.008836 0.011892 0.014403 1,255 P9 0.006715 0.007009 0.007309 80 0.007867 0.008422 0.008749 3,961 P10 0.006745 0.007123 0.008014 100 0.008163 0.008463 0.008744 7,615 P11 0.006895 0.007252 0.008322 111 0.008088 0.008653 0.009213 4,968 P12 0.007369 0.010280 0.012339 91 0.011947 0.014928 0.020861 908

6 Final Remarks

This paper describes a mathematical programming model for the DDB distribution design problem, which is defined as finding relations fragments and their allocation such that the average roundtrip response time incurred by transmitting and processing all the queries and transmitting query responses is minimized. The optimal design problem was formulated as a zero-one integer programming model (called VFA-RT), which consists of an objective function (1) that aims at minimizing the overall


roundtrip query response time, and three groups of constraints (2)-(4). Unlike previ-ous works, the objective function used in our formulation permits modeling the nonlinear nature of query response time.

Formulating the DDB distribution design problem as a mathematical programming model makes possible to solve it using the large assortment of metaheuristic algo-rithms available. Therefore, we used two metaheuristic algorithms for solving the VFA-RT model: threshold accepting and tabu seach. Comparative experiments were conducted with these algorithms in order to assess their effectiveness for solving this problem. The experimental results show that threshold accepting obtains better results than tabu search usually with a smaller amount of execution time. However, it is im-portant to point out that the implemented tabu search is a basic version that does not include sophisticated intensification and diversification strategies; therefore, perhaps a more sophisticated implementation could yield better results.

References

1. Chakravarthy, S., Muthuraj, J., Varadarajan, R., et al.: An Objective Function for Verti-cally Partitioning Relations in Distributed Databases and its Analysis. Distributed and Parallel Databases 2(2), 183–207 (1994)

2. Golfarelli, M., Maio, D., Rizzi, S.: Vertical Fragmentation of Views in Relational Data Warehouses. In: Proc. Settimo Convegno Nazionale Sistemi Evoluti per Basi di Dati (SEBD 1999), Villa Olmo, Italy, pp. 23–25 (1999)

3. Pérez, J., Pazos, R., Santaolaya, R., et al.: Data-Object Replication, Distribution, and Mo-bility in Network Environments. In: Broy, M., Zamulin, A.V. (eds.) PSI 2003. LNCS, vol. 2890, pp. 539–545. Springer, Heidelberg (2004)

4. Menon, S.: Allocating Fragments in Distributed Databases. IEEE Trans. Parallel and Distributed Systems 16(7), 577–585 (2005)

5. Kleinrock, L.: Communication Nets: Stochastic Message Flow and Delay. Dover Publica-tions, USA (2007)

6. Tamhankar, A.M., Ram, S.: Database Fragmentation and Allocation: an Integrated Meth-odology and Case Study. IEEE Trans. Systems, Man and Cybernetics Part A 28(3), 288–305 (1998)

7. Huang, Y.F., Chen, J.H.: Fragment Allocation in Distributed Database Design. J. of Infor-mation Science and Engineering 17(3), 491–506 (2001)

8. Ulus, T., Uysal, M.: Heuristic Approach to Dynamic Data Allocation in Distributed Data-base Systems. Pakistan J. of Information and Technology 2(3), 231–239 (2003)

9. Basseda, R.: Fragment Allocation in Distributed Database Systems. Technical Report. Faculty of Electrical and Computer Eng., School of Engineering, Database Research Group, University of Tehran, Iran (2006)

10. Ma, H., Schewe, K.-D., Kirchberg, M.: A Heuristic Approach to Vertical Fragmentation Incorporating Query Information. In: Proc. 7th Int. Baltic Conf. on Databases and Informa-tion Systems, pp. 69–76 (2006)

11. Hababeh, I., Ramachandran, M., Bowring, N.: Application Design for Data Fragmentation and Allocation in Distributed Database Systems. In: Distributed Database Systems, Re-search and Practice Conference, Leeds, UK (2007)

12. Gorla, N., Wing-yan, B.P.: Vertical Fragmentation in Databases Using Data-Mining Tech-nique. Int. J. of Data Warehousing and Mining 4(3), 35–53 (2008)


13. Tambulea, L., Horvat-Petrescu, M.: Redistributing Fragments into a Distributed Database. Int. J. of Computers, Communications & Control 3(4), 384–394 (2008)

14. Pazos, R.A., Vázquez, G., Pérez, J., Martínez, J.A.: Modeling the Nonlinear Nature of Re-sponse Time in the Vertical Fragmentation Design of Distributed Databases. Advances in Soft Computing 50, 605–612 (2009)

15. Lin, X., Orlowska, M.E., Zhang, Y.: On Data Allocation with the Minimum Overall Communication Costs in Distributed Database Design. In: Proc. 5th International Confer-ence on Computing and Information, pp. 539–544 (1993)

16. Pérez, J., Pazos, R., Velez, L., Rodríguez, G.: Automatic Generation of Control Parameters for the Threshold Accepting Algorithm. In: Coello Coello, C.A., de Albornoz, Á., Sucar, L.E., Battistutti, O.C. (eds.) MICAI 2002. LNCS (LNAI), vol. 2313, pp. 125–144. Springer, Heidelberg (2002)

17. Dueck, G., Scheuer, T.: Threshold Accepting: a General Purpose Optimization Algorithm Appearing Superior to Simulated Annealing. Journal of Computational Physics 90(1), 161–175 (1990)

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

2010 kes · title: 2010_kes author: ï¿½ï¿½josï¿½a. mtz. created date: 1/20/2011 6:36:52 am

Documents