Task
Using any suitable open-source research dataset of your choosing (e.g., from Google Dataset Search (Links to an external site.) or UCI Machine Learning Repository (Links to an external site.)), carry out the following tasks:
1. Build and train classification and/or regression models from the dataset in any suitable programming environment of your choosing (e.g., MATLAB) using three machine learning techniques of your choice.
2. Justify the rationale behind the choice of your dataset, machine learning techniques, and programming environment.
3. Compare and contrast the performance of the three machine learning techniques in terms of prediction or validation accuracy, training time, prediction speed, R-squared values, MSE values, and transparency (as may be applicable).
4. Analyse the error matrices, the ROCs (and AUCs) for all three methods (as may be applicable).
5. Comment on how the hyperparameters (if any) are tuned or optimized (if applicable) to enhance the built/trained models.
6. Submit a report showing the work carried out.
A sample dataset from Google Dataset Search for Credit Card Scoring with Targets is also provided on Canvas.
Note that for any dataset that you choose, you must be able to demonstrate a clear interpretation of the dataset and how it fits the problem you are trying to solve and acknowledge the source via a detailed citation.
As stated above, please you can use any dataset of your choice. Preferably, open-source datasets that are freely available from online repositories. Should you decide to “stray” by using a dataset from your workplace or based on your experiments, please be certain that this does not infringe upon any data privacy and protection policies. In other words, this dataset must not be restricted in any way; that is, it can be viewed, edited, and modified by anybody.
For datasets without labels or classes or categories, you can generate suitable labels or classes or categories using conventional methods that are appropriate value. For example, in a credit card scoring dataset, a 24-year-old male who rents and has a large unpaid credit amount on his car, with little money in his checking and savings accounts may be considered to have a “high risk” of defaulting on any additional credit.
Guidelines
Your submission should 2000 words in length (+/- 10%).
As stated in the task you can use any suitable open-source research dataset of your choice, though here is a dataset (source: Kaggle (Leonardo Ferreira, 2018) (Links to an external site.)) that you may wish to use.
Again, please you can use any dataset of your choice. Preferably, open-source datasets that are freely available from online repositories such as the ones mentioned above. Should you decide to “stray” by using a dataset from your workplace or based on your experiments, please be certain that this does not infringe upon any data privacy and protection policies. In other words, this dataset must not be restricted in any way; that is, it can be viewed, edited, and modified by anybody.
Note that for any dataset that you choose, you must be able to demonstrate a clear interpretation of the dataset and how it fits the problem you are trying to solve and acknowledge the source via a detailed citation.
For datasets without labels or classes or categories, you can generate suitable labels or classes or categories using conventional methods. For example, in a credit card scoring dataset, a 24-year-old male who rents and has a large unpaid credit amount on his car, with little money in his checking and savings accounts may be considered to have a “high risk” of defaulting on any additional credit.
Please make sure that you correctly cite all secondary sources you use, and include a reference list. The reference list will not be included in your final word count.
Hint
Ensure that your submission fulfils the marking criteria detailed below.
Please note that for this assignment, you have the “laxity” of using any programming or development environment of your choosing. For students who have little or no background in programming, please it is strongly recommended that you use the nearly “plug-and-play” approach available via built-in applications in the MATLAB environment as carried out in the demonstration videos (Week 7).
Please refer to this published paper to have an analytical and methodical understanding of how machine learning techniques or algorithms (classification methods ONLY) can be evaluated and compared for a given machine learning problem or task:
• A. O. Sangodoyin, M. O. Akinsolu, P. Pillai and V. Grout, “Detection and Classification of DDoS Flooding Attacks on Software-Defined Networks: A Case Study for the Application of Machine Learning (Links to an external site.),” in IEEE Access, vol. 9, pp. 122495-122508, 2021, doi: 10.1109/ACCESS.2021.3109490.
Please refer to this published paper to have an analytical and methodical understanding of how linear regression can be applied for a given machine learning problem or task:
• M.O Akinsolu, A. O. Sangodoyin, and U. E. Uyoata, “Behavioral Study of Software-Defined Network Parameters Using Exploratory Data Analysis and Regression-Based Sensitivity Analysis (Links to an external site.),” in Mathematics, vol. 10, no. 14, pp. 2536, 2022, doi: 10.3390/math10142536.
Note that you will need to include as much information as you can in your submission to sufficiently show that you have carried out your work independently. Consequently, the onus is on you to independently decide whatever inputs you want to use to fulfil the marking criteria (see below). For example, full scripts or “excerpts” or “narratives” or clear screenshots or whatever form you want to present your code(s) is for you to decide.
Grading:
Description Marks
Dataset (including citation of source and acknowledgment) 2.5
Definition and Justification of Problem According to the Dataset:
Classification Problem and/or Regression Problem 2.5
Data Pre-processing and Feature Extraction: Identification of Predictors, Categories and Targets, Handling Noisy Data and Missing Data, Others 10.0
Rationale Informing the Selection or Choice of the Three Machine Learning Techniques or Methods to Build and Train Models to Address the Problem. 10.0
Model Performance Assessment Using the 1st Machine Learning Technique or Method. 10.0
Model Performance Assessment Using the 2nd Machine Learning Technique or Method. 10.0
Model Performance Assessment Using the 3rd Machine Learning Technique or Method. 10.0
Comparisons between all Machine Learning Techniques or Methods 10.0
Recommendations and Conclusions 5.0
References 2.5
Organisation of Report 2.5
TOTAL= 75