Internet of Things
Computer Sciences and Information Technology
1. CSE-CIC-IDS2018 on AWS
This dataset is a consequence of a joint initiative between The Canadian Institute for Cybersecurity (CIC) and the Communications Security Establishment (CSE) and the project leverages on the idea of creating profiles with the aim of coming up with cyber security datasets using a systemic approach (UNB, 2018). The project involves giving a detailed description of the type of intrusion together with formulating an abstract distribution model that details the protocols, applications and entities to a lower level network. This dataset comprises of at least 7 different attack scenarios which include: Network infiltration from within, Brute-force, Web attacks, Botnet, DDoS, Heartbleed, and DoS (UNB, 2018). Another important aspect of this dataset is the attacking infrastructure which comprises of 50 machines while the victim organization is made up of five departments each having 420 personal computers and 30 servers (UNB, 2018). Moreover, UNB (2018) adds that the dataset has a record of the network traffic as well as the log files of every machine from the side of the victim. It also has eighty network traffic features that have been extracted from the captured traffic through the use CICFlowMeter-V3 (UNB, 2018). These described features are what make up the Cyber Defense Dataset (CSE-CIC-IDS2018).
2. Random Forest Classifier
One of the major roles of machine learning is its ability to undertake classification where it helps its users to understand the specific class or group that a given observation belongs to. This ability to come with accurate classification of observations is very important for various reasons such as some business applications like being able to come up with accurate predictions on whether a user is likely to purchase a particular product (Korkmaz, 2018). Korkmaz adds that Data science has managed to provide a collection of numerous classification algorithms like support vector machine, decision trees, logistic regression, and the random forest classifier which ranks top in the classifier hierarchy.
Random Forest Classifier can be described as a tree-based graph involving the building of numerous decision trees and combining their output so as to enhance the model’s generalization ability (Korkmaz, 2018). This technique of mixing trees can be understood as the ensemble method which involves mixing weak learners or individual trees with the intention of supplying robust learners. Korkmaz (2018) adds that the technique can be applied when solving categorical data issues in both continuous data and classification in various regression scenarios.
3. Support Vector Machine Classifier
A Support Vector Machine classifier (SVM) is a tool used for both regression and classification purposes (Cristianini, 2004). The SVM aims to develop a hyperplane within multidimensional space that can come up with unique classifications of various record instances. Hyperplanes act as decision barriers that Help in the classification of record instances and hyperplane’s dimensions depends on the amount of features (Cristianini, 2004). This implies that in two dimensions, the hyperplane tends to appear as line while in three dimensions it assumes the shape of two-dimensional hyperplane. The direction and actual location of a hyperplane is affected by support vectors which happen to be statistics factors with close resemblance to a hyperplane (Cristianini, 2004). Through these support vectors, Cristianini (2004) adds that it becomes easy to classify the hyperplane’s margin. As such, a support vector machine classifier helps to maximize the existing margin between the hyperplane and statistics factor.
4. The Decision tree
According to Bhukya and Ramachandram (2010), a decision tree is a predictive model that begins with just a single node before branching further into possible outcomes. Each of these outcomes results in other additional nodes which later branch off to form other possibilities giving it a shape that resembles that of a tree. Each node is developed so as to test a given feature before branching out to form a test output where every leaf node is a representation of a given category (Bhukya & Ramachandram, 2010). .
5. XGBoost
A XGBoost can be described as a Machine Learning algorithm that uses the decision-tree-based predictive model through gradient boosting frameworks (Morde, 2019). When handling prediction problems that involve unstructured data, the artificial neural networks usually outperform the other frameworks or algorithms. However, in small-to-medium data, Morde (2019) notes that the XGBoost is considered as the best tool currently available in the market.
References
Bhukya, D. P., & Ramachandram, S. (2010). Decision Tree Induction: An Approach for Data Classification Using AVL-Tree. International Journal of Computer and Electrical Engineering, 660–665. doi: 10.7763/ijcee.2010.v2.208
Cristianini, N. (2004). Support Vector Machine (SVM, Maximal Margin Classifier). Dictionary of Bioinformatics and Computational Biology. doi: 10.1002/9780471650126.dob0717.pub2
Korkmaz, S. A. (2018). Recognition of the İmages with Random Forest Classifier Based on the LPP and LBP Methods. Sakarya University Journal of Science, 1–1. doi: 10.16984/saufenbilder.349567
Morde, V. (2019, April 8). XGBoost Algorithm: Long May She Reign! Retrieved from https://towardsdatascience.com/https-medium-com-vishalmorde-xgboost-algorithm-long-she-may-rein-edd9f99be63d
UNB. (2018). CSE-CIC-IDS2018 on AWS. Retrieved from https://www.unb.ca/cic/datasets/ids-2018.html