selem trabelsi & walha amal

Upload: salem-trabelsi

Post on 04-Apr-2018

230 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 Selem Trabelsi & Walha Amal

    1/25

    A novel intrusion detection system based on

    hierarchical clustering and

    support vector machines

    Prsent par:

    proposed by :Mr Abbes Tarek

  • 7/29/2019 Selem Trabelsi & Walha Amal

    2/25

    PLAN

  • 7/29/2019 Selem Trabelsi & Walha Amal

    3/25

    Introduction

  • 7/29/2019 Selem Trabelsi & Walha Amal

    4/25

    NIDS is a network intrusion detection system that attempts to detect

    malicious activitiesA problem of NIDS is that detects only known network attacks.

    Many methods proposed on the design of NIDS such as:

    Introduction

    1

    This study proposes a SVM-based intrusion detection system based on ahierarchical clustering algorithm to preprocess the KDD Cup 1999

    dataset before SVM training.

    decision tree based on C5 used by the KDD Cup 1999

    Fuzzy rough C-means (FRCM) based on the fuzzy set theory.

    support vector machine (SVM)

  • 7/29/2019 Selem Trabelsi & Walha Amal

    5/25

  • 7/29/2019 Selem Trabelsi & Walha Amal

    6/25

    Designed for very large data sets

    Time and memory are limitedIncremental and dynamic clustering of incoming objectsOnly one scan of data is necessaryConstructs a tree called a clustering feature (CF) treeDoes not need the whole data set in advanceAble to handle noise effectively.

    BIRCH hierarchical clustering alghorithm

    Two key phases:Scans the database to build an in-memory treeApplies clustering algorithm to cluster the leaf nodes

    2

  • 7/29/2019 Selem Trabelsi & Walha Amal

    7/25

    Clustering feature (CF)Node in the CF tree composed of clustering feature

    A CF is a triplet summarizes the information of a cluster.

    Given a cluster of instances we define:

    Centroid:

    Radius:

    Euclidean distance :

    Theorem to merge sub-clusters:

    CF1 + CF2= (n1+n2, LS1+LS2, SS1+SS2)

    NP

    i

    NP

    i

    i

    i

    PSS

    PLS

    2

    CF = (n, LS, SS)

    3

  • 7/29/2019 Selem Trabelsi & Walha Amal

    8/25

    4

    A CF tree is a compact representation of a dataset

    The insertion procedure of CF tree has three steps.

    leaf node represents a cluster that absorbs many data point and must

    satisfy the threshold requirement T.

    Each non-leaf node contains B entries of the form (CFi, child i )

  • 7/29/2019 Selem Trabelsi & Walha Amal

    9/25

    5

    Step 1: Identify the appropriate leaf

    It starts from the root and traverses the CF tree recursively down to the

    leaf level by choosing the child node, whose centroid is closest at each

    level to the new entry.

    The algorithm computes the distances between CFx and each entry

  • 7/29/2019 Selem Trabelsi & Walha Amal

    10/25

    6

    three possible options:

    leaf absorbs the new entry without violating the radius threshold T

    condition. add a new leaf entry for the new entry.

    If adding a new entry violates the branching factor B threshold the

    leaf node has to split by choosing the farthest pair of entries as seeds.

    Step 2: Modify the leaf

  • 7/29/2019 Selem Trabelsi & Walha Amal

    11/25

    7

    Step 3: Modify entries on the path to the leaf

    oAfter inserting the new entry into a leaf node, the algorithm

    needs to update the CF information for each non-leaf entry alongthe path backwards to the root.

    oIf no leaf node splitting is invoked, the algorithm simply adds

    the CF to reflect the addition of the new entry. If leaf nodesplitting is invoked, the algorithm checks whether the parent node

    meets the branching factor constraint.

    o

    If the parent node violates the B threshold, it is split andrecursively traversed back to the root, while performing the

    same checks.

  • 7/29/2019 Selem Trabelsi & Walha Amal

    12/25

  • 7/29/2019 Selem Trabelsi & Walha Amal

    13/25

    A SVM is a supervised learning method. It performs classification by

    constructing an N -dimensional hyperplane that optimally separates the data

    into different categories.

    Support vector machines (SVM)

    8

    The training complexity is very dependent on the size of the dataset

    SVM have shown good results in data classification , but it is unable tooperate at such a large dataset due to system failures caused by

    insufficient memory

  • 7/29/2019 Selem Trabelsi & Walha Amal

    14/25

    Is a combination of support vector machine and hierarchical clustering to

    build an intrusion detection system.The BIRCH algorithm has to transform dataset to a smaller sized and used

    to produce a reduced and high quality dataset before SVM training.

    The process of the proposed system is described as follows:

    (1) Transform and scale data

    (2) Construct CF trees for attacks and normal packets

    (3) Do feature selection for each type of attacks.

    (4) Train four SVM classifiers by the centroid of all entries in leaf nodes of

    CF trees.(5) Combine the four SVMs classifiers to build an intrusion detection system

    9

    SVM with hierarchical clustering

    f d l

  • 7/29/2019 Selem Trabelsi & Walha Amal

    15/25

    SVM requires each data point to be represented as a vector of real numbers.

    every non-numerical attribute has to be transformed into numerical data first.

    For example, the protocol_type attribute in KDD Cup 1999, tcp is changed

    with 0, udp with 1, and icmp with 2.

    Data transformation and scaling

    Data scaling can avoid attributes with greater values dominating those

    attributes with smaller values.

    each attribute is called with linear scaling to the range of [0, 1] by

    dividing every attribute value by its own maximum value.

    10

    i

  • 7/29/2019 Selem Trabelsi & Walha Amal

    16/25

    CF trees construction

    The CF trees can be constructed with a single scan of the dataset.

    four kinds of attacks in KDD Cup:

    DoS: Denial of ServiceR2L: illegitimate access from a remote machine

    U2R: Acquire the privileges of a super user

    Probing: scan of port

    One CF tree for normal taffic

    In the KDD Cup 1999 data set there are 41 features

    LS and SS are contain the sum and square sum value of each of the features 10

    l i

  • 7/29/2019 Selem Trabelsi & Walha Amal

    17/25

    Feature selection

    11

    not all features are needed in the design of a network intrusion

    detection system.

    It is critical to identify important features of networktraffic

    data.

    a feature is important is determined based on the accuracy and

    the number of false positives of the system, with and without the

    feature.

  • 7/29/2019 Selem Trabelsi & Walha Amal

    18/25

  • 7/29/2019 Selem Trabelsi & Walha Amal

    19/25

    12

    The datasets contained 24 training attack types classified into

    four kinds of attacks, DoS, U2R, R2L, and Probe.

    The best performance of this system in terms of accuracy was 95.72%

    with only a 0.73% false positive rate.

    this system showed superior performance in DoS and Probe attacks and

    suffered from both U2R and R2L attacks because the numbers of

    instances for these two attacks were too small in the original KDD Cup

    1999 dataset.

  • 7/29/2019 Selem Trabelsi & Walha Amal

    20/25

    13

    This table presents the original numbers of instance for each kind of attack

    and the numbers of instances by CF trees with different threshold T

    The hierarchical clustering reduces the number of instances for datasets.For exemple with a CF tree (T=0.2) the number of instances of Dos is very

    small compared to original data set (only 271)

  • 7/29/2019 Selem Trabelsi & Walha Amal

    21/25

    14

    This table compares the KDD Cup 1999 winners system and other

    researches, this system provided the best detection rate for DoS andProbe attacks.

    The ESC-IDS showed the best detection rate for the R2L attack, and

    Multi-classifier showed the best detection rate for the U2R attack.

    in terms of accuracy, this system could achieve the best performance.

  • 7/29/2019 Selem Trabelsi & Walha Amal

    22/25

    15

    As shown in Table , the

    IDS detection rate of this

    study for new attacks is

    only 39.04%, and theworst detection is on

    new R2L attacks.

  • 7/29/2019 Selem Trabelsi & Walha Amal

    23/25

  • 7/29/2019 Selem Trabelsi & Walha Amal

    24/25

    CONCLUSIONET

    PERSPECTIVES

    Many researches concerning NIDSs applied SVMs because SVMs are wellknown for their generalization performances.

    this study proposed an SVM-based network intrusion detection system with

    BIRCH hierarchical clustering for data preprocessing.

    The BIRCH hierarchical clustering could provide highly qualified, abstracted

    and reduced datasets to the SVM training

    the resultant SVM classifiers showed better performance than the SVM

    classifiers using the originally redundant dataset.

    16

  • 7/29/2019 Selem Trabelsi & Walha Amal

    25/25