Datasets
1. Methods and the materials
1.1 Introduction
Researchers often use many types of methodologies to achieve specific goals. Most commonly, methodological approaches used qualitative or quantitative methodologies. Table 2 shows the Features of qualitative and quantitative research.
To satisfy the objectives of the capstone, quantitative research was used. This type of methodology will deal with numerical and statistical data. It will help in providing great objectives and provision of accurate results. The type of research design that will be used is experimental design. It is because it involves the allocation of samples of data into various groups. It may include a matching set, nonpartisan groups, and replicated measures.
Table 1: Features of Qualitative & Quantitative Research
| Qualitative research | Quantitative Research |
| The aim is a complete, detailed description. | The aim is a complete, detailed description. |
| The aim is to classify features, count them, and construct statistical models to explain what is observed. | The aim is to classify features, count them, and construct statistical models to explain what is observed. |
| The researcher may only know roughly in advance what he/she is looking for. | The researcher may only know roughly in advance what he/she is looking for. |
| The researcher knows clearly in advance what he/she is looking for. | The researcher knows clearly in advance what he/she is looking for. |
| Recommended during earlier phases of research projects. | Recommended during earlier phases of research projects. |
| Recommended during latter phases of research projects. | Recommended during latter phases of research projects. |
| The design emerges as the study unfolds. | The design emerges as the study unfolds. |
| A researcher is the data gathering instrument. | The researcher is the data gathering instrument. |
1.2 Datasets
Datasets refer to the assembling of data by following a particular order. They will have any data from a sequence of an array in a database table (Nakajima & Bui, 2016). A dataset will be in a tabular form, whereby it contains rows and columns. Various types of datasets will be used. Wireless Sensor Network Dataset collects a systematic strategy to achieve a particular aim or make specific ideas in a wireless network effective. It will be set out to gather information from Network Simulator 2 (NS-2) and later refined it to generate 23 features (Almomani,2016). IoT Intrusion Dataset 2020 will be produced using the MQQT model network design. The IoTID20 will help to operate, instruct and assess the new IoT IDS being constructed. IoT Botnet Dataset will provide a wide-ranging network and allows various functions with various characteristics to run in IoT devices (Putchala, 2017). BoT-IoT dataset will be generated by constructing a logical network platform in the cyber range lab. The platform will comprise of standards and botnet dealings. The features, classes, rows and columns of a datasets will depend on what is being examined and the researcher’s needs. Datasets are created by the artificial Intelligence team who have broad knowledge in data algorithms and machine learning. The datasets which are free to the users include Microsoft dataset and Kaggle Datasets.
1.2.1 Wireless Sensor Network Dataset dataset 1
It is a systematic strategy to effectively achieve a particular aim or make specific ideas in a wireless network. It has been set out to gather information from Network Simulator 2 (NS-2) and later refined to generate 23 features. This dataset has 374,661 instances. They can be developed without many routers. The WSN is used to manage tangible and environmental circumstances such as movements, blast, and temperature. Hence, the IDs need to be put in place to ensure that WSN services’ security is enabled in a network. The IDs should also be suitable with features of WSN and be able to notice any security warnings in the networks. The WSN-DS also categorizes various DOS attacks such as scheduling, flooding, grey hole, and black hole attacks in the network (Almomani, 2016).
| id | time | Is_CH | Who CH | Dist_To_CH | ADV_R | JOIN_R | SCH_R | Rank | Consumed Er | Attack Type |
| 101000 | 50 | 1 | 101000 | 0 | 0 | 25 | 1 | 0 | 2.4694 | Normal |
| 101010 | 50 | 1 | 101010 | 0 | 0 | 30 | 0 | 0 | 2.3611 | Blackhole |
| 104001 | 200 | 0 | 104077 | 0 | 2 | 0 | 1 | 1 | 0.03509 | Grey |
1.2.2 IoT Intrusion Dataset 2020 Dataset 2
The dataset is being produced using the MQQT model network design. The IoTID20 helps to operate, instruct and assess the new IoT IDS being constructed. The IoT networks need an adequately constructed dataset to initiate a great approach and detection method of IoT devices (Putchala, 2017).
1.2.3 IoT Botnet Dataset 3
It provides a wide-ranging network and allows various functions with various characteristics to run in IoT devices. The dataset helps examine and inspect abnormal tasks’ accuracy in a detection protocol for IoT ID networks.
It was generated by constructing a logical network platform in the cyber range lab. The medium comprises standards and botnet dealings. The origin of dataset files such as CSV files is issued in the divergent pattern. After that, the files are isolated depending on the attack group to better identify IoT ID devices and networks.
1.2.4 DDoS Botnet on IoT Devices Dataset 4
Vast and significant botnets are being developed from IoT devices. It has increased the DDoS attacks and make them more superior in the network. The botnet being generated is slow in detecting attacks and has insufficient authentication in the IoT devices.
1.3 Balanced or imbalanced
The balanced dataset is whereby the number of positive values is equal to the number of negative values. The unbalanced dataset is whereby the number of positive values is higher than the number of negative values. Binary datasets have two values, one and zero. Multiclass groups single data and multiple sources to one group. The technique used is by changing evaluation metrics. It nvolves training the balanced dataset using the model to build forecast not used during the training and forecasting to the anticipated values hence making it imbalanced. Another technique is use of confusion matrix. The technique produce an outlook or description how good a design of the dataset is doing. Ithis involves use of F1, precision and recall.
1.4 Feature selection
The subset selection analyzes a subset of traits as a category for reliability (Liu, 2017). The gain ratio is the moderation data gain that minimizes the inclination of the data. Used to reduce prejudice regarding multivalued features by returning the number and magnitude into considerations when choosing particular features in the dataset. Principle components analysis is used in Noise insertion (class or feature) to the training or NOT –. It is a unsupervised technique used used for configuration minimization in machine learning. It is also used to ease interpration and minimize loss of data in datasets.
Noise insertion enlarges the size of the data. When noise is inserted in a particular dataset and is not trained, it can cause an array’s data problems. Instance reduction technique is used to classify features which have similar characteristics of a coefficient. It is also used to compute for each feature from the initial calibrate worth of its coefficient resemblance in a dataset.
1.5 Noise removal on training
Noise suppression, there is a conquering of high noise by mixing with the clean presentation of datasets with noise. Spectral noise gating occurs when the signal level is at a high point than the unwanted noise in various datasets (Woltering, 2017). The instance reduction techniques in noise removal work in that there is choose of features for every cluster, and withdrawal of remaining features of the noise hence leads to noise removal. On the other hand, the technique adds other features to another class that had similar features from another group in the dataset, leading to noise insertion.
1.6 ML Techniques
These are techniques used to instruct computers to perform ordinarily to living things and improve human beings and animals (Bhavitha, 2017). Machine learning is a broad model that has helped produce a computational statistical thesis of learning activities and sketched learning techniques. Instructors also use them to analyze about the student’s skills and their knowledge and also learn new learning techniques for them The following are ML techniques.
1.6.1 Neural Network
It is a set of specific techniques designed to identify patterns or formats in machine learning. They translate sensible data by using a particular type of grouping machine. It helps to classify and to group data sets. It consists of artificial neurons shared across various layers three layers the hidden, output, and input layers. It has the multilayer perceptron that will be implemented in this paper.
1.6.2 Decision Trees
DT Is a set of instructions used to group data depending on the characteristic’s values. The grouping is symbolized in a tree structure design, where the branches symbolize the chosen number of input features. During the tree making, the training dataset is split into various subgroups. The finest division is determined on children impunity which is illustrated as follows; entropy =
1.6.3 Random Forest
It merges several decision trees where every decision tree is designed depending on nonpartisan random vectors’ figures. The only group can manage its results. It is acknowledged as the ensemble learning method because it brings together the outputs of decision trees. If the number of trees is bigger, the top generalization of the error meets depending on the following equation. Generalization error=
1.7 Evaluation Criteria
It is advisable to estimate a machine learning algorithm (Brink, 2017). The evaluation criteria include accuracy, described as the percentage of the correct projection for the test information. True positive is a situation where you predict a YES, and the accurate result is also YES. True negative is whereby you expect; YES, and the actual product is NO. A T-test is administered whenever a test’s static follows a t distribution if the expression grading of the expression is standard. G-means is the average sensitivity and specificity of data.
1.8.2 Accuracy
It indicates the good percentage of the machine learning algorithm. It is achieved by grouping the tested features of the dataset of a particular group. It involves dividing a certain group of datasets into subgroups. The selected sub-groups are tested then they find the one with the most accurate results. The equation is as follows. accuracy = TP + TN / TP + TN + FP + FN.
1.7.3 Precision
It shows the accuracy of the values that have positive forecasts. It is attained by dividing the true positive values with true positive values and false positive values. The equation is as follows. precision= TP / TP + FP.
1.7.4 Recall
It is the ratio of positive values; a categorizer controls to enable them to categorize them accurately. The recall is attained by dividing the true positive with the sum of false negatives and true values. The equation is; Recall= TP/TP+FN
1.7.5 G-means
It means a converging technique that attempts to instantly decide the number of groups by checking the examined data routinely. It is one of the expanded K-means. It runs the k-means by incrementing K in a deterministic design till the examined test attains the hypothesis. Such that, the information of data allocated to every k-means midpoint is Gaussian. The equation is as follows
References
Almomani, I., Al-Kasasbeh, B., & Al-Akhras, M. (2016). WSN-DS: A dataset for intrusion detection systems in wireless sensor networks. Journal of Sensors, 2016.
Bhavitha, B. K., Rodrigues, A. P., & Chiplunkar, N. N. (2017, March). Comparative study of machine learning techniques in sentimental analysis. In 2017 International conference on inventive communication and computational technologies (ICICCT) (pp. 216-221). IEEE.
Brink, H., Richards, J. W., Fetherolf, M., & Cronin, B. (2017). Real-world machine learning (p., 330). Shelter Island, NY: Manning.
Liu, Y., Bi, J. W., & Fan, Z. P. (2017). Multiclass sentiment classification: The experimental comparisons of feature selection and machine learning algorithms. Expert Systems with Applications, 80, 323-339.
Malheiro, R., Panda, R., Gomes, P., & Paiva, R. (2016). Bi-modal music emotion recognition: Novel lyrical features and dataset. 9th International Workshop on Music and Machine Learning–MML’2016–in conjunction with the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases–ECML/PKDD 2016, October 2016.
Nakajima, S., & Bui, H. N. (2016, December). Dataset coverage for testing machine learning computer programs. In 2016 23rd Asia-Pacific Software Engineering Conference (APSEC) (pp. 297-304). IEEE.
Putchala, M. K. (2017). Deep learning approach for intrusion detection system (ids) in the internet of things (IoT) network using gated recurrent neural networks (GRU).
Wolterink, J. M., Leiner, T., Viergever, M. A., & Išgum, I. (2017). Generative adversarial networks for noise reduction in low-dose CT. IEEE transactions on medical imaging, 36(12), 2536