Predicting Heart Disease Using Machine Learning
Attached you will find a pdf document that contains a machine learning experiment that was conducted end-to-end. Please read the comments on the code to get a better understanding/to feel free to google for an even better understanding. We’ve also attached an IEEE template as a reference. Please follow the template to write the paper. The paper should be 6 pages with references included. again, IEEE FORMAT. Feel free to reach out for clarification
line 1: 1st Given Name Surname
line 2: dept. name of organization (of Affiliation)
line 3: name of organization (of Affiliation)
line 4: City, Country
line 5: email address or ORCID
Introduction
Machine Learning is particularly a way of manipulating and extracting implicit, previously known or unknown and potentially valuable information from data sets [1]. Machine learning is a vast and diverse field such that it is increasingly being implemented. The approach will incorporate different classifiers of supervised, unsupervised and ensemble learning which is used in predicting and finding the accuracy of the identified dataset. One of the uses of machine learning is predicting heart diseases by using an individual’s medical history. Heart diseases have become very prevalent in the modern world [1]. It has become fundamental that early diagnosis are made on individuals who are at risk of suffering the cardiovascular disease and have them take precautions early enough to prevent the adversarial impact. To this effect, the prediction element attained through machine learning algorithms in conjunction with using various attributes related to heart diseases will allow the analysis of huge complex medical data which subsequently aids healthcare professionals to predict heart disease.
The National Heart, Lung and Blood Institute defined heart disease as a catch-all phrase for various conditions that affect the structure and function of the heart. Coronary heart disease, which is the leading cause of death in the United States, is a type of heart disease that develops when the arteries of the heart cannot deliver enough oxygen-rich blood to the heart [2]. The World Health Organization indicated cardiovascular diseases are the number one cause of death globally, taking an estimated 17.9 million lives each year [2]. This research aims to use various Python-based machine learning and data science libraries in an attempt to build a machine learning model capable of predicting whether or not someone has heart disease based on their medical attributes. To this effect, the research will follow a step by step approach comprising of We’re going to take the following approach:
● Problem definition
● Data
● Assessment
● Features
● Modeling
● Experimentation
1. Problem Definition
The major problem with managing heart disease is the detection stage [4]. The present instruments that could be used in predicting heart disease are normally very costly or have not been efficient enough to calculate the chance of the disease affecting an individual. Early detection of cardiac diseases could decrease the mortality rates and the complications associated with the illness. Nevertheless, it is not possible to conduct daily monitoring on patients in all the different cases accurately. It is impossible to undertake non-stop patient consultations for 24 hours by a health professional considering the extensive patience, expertise and the time required. Notably, considering that the current world has an extensive amount of data on various issues, machine learning algorithms could be incorporated for analyzing the hidden patterns. It is the hidden patterns within a patient’s medical information that a doctor can make a health diagnosis for a heart disease.
In a statement, the problem to be explored will be
Given clinical parameters about a patient, can we predict whether or not they have heart disease?
2. Data
Collection of data was the initial step taken. The researcher collected a dataset for the heart disease prediction system. After collecting the dataset, it was split into training data and testing data. The training dataset is used for prediction model learning and testing data is used for evaluating the prediction model.
The original data came from the Cleavland data from the UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/heart+Disease. A similar version of the data used is also available on Kaggle. https://www.kaggle.com/ronitf/heart-disease-uci. Records of about 300 patients from Cleveland, Ohio, EE.UU were available. The independent variables (or features) are a patient’s different medical attributes while the dependent variable (or label) is whether or not they have heart disease
3. Assessment
At the start of the project, the study proposes taking up a rough objective on the work. This the Assessment metric to be incorporated was that:
If the study could reach 95% accuracy at predicting whether or not a patient has heart disease during the proof of concept, the project is to be pursued.
It is prudent to note that considering that the nature of this experiment may prompt the Assessment metric to be changed over time.
4. Features
The selection of features or attributes is necessary for the prediction system. The features are used to increase the system’s efficiency. The different features to be considered for the prediction model in relation to assessing the possibility of a patient suffering heart disease include the patient’s age, gender, chest pain type, fasting blood pressure, serum, cholesterol, among others were identified. The Correlation matrix to be used for selecting the features or attributes to be used for this model.
Create data dictionary
1. age – age in years
2. sex – (1 = male; 0 = female)
3. cp – chest pain type
● 0: Typical angina: chest pain related decrease blood supply to the heart
● Atypical angina: chest pain not related to heart
● Non-anginal pain: typically esophageal spasms (non heart related)
● Asymptomatic: chest pain not showing signs of disease
4. trestbps – resting blood pressure (in mm Hg on admission to the hospital) anything above 130-140 is typically cause for concern
5. chol – serum cholesterol in mg/dl
● serum = LDL + HDL + .2 * triglycerides
● above 200 is cause for concern
6. fbs – (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
● ‘>126’ mg/dL signals diabetes
7. restecg – resting electrocardiographic results
● 0: Nothing to note
● 1: ST-T Wave abnormality
★ can range from mild symptoms to severe problems
★ signals non-normal heart beat
● 2: Possible or definite left ventricular hypertrophy
★ Enlarged heart’s main pumping chamber
8. thalach – maximum heart rate achieved
9. exang – exercise induced angina (1 = yes; 0 = no)
10. oldpeak – ST depression induced by exercise relative to rest looks at stress of heart during exercise unhealthy heart will stress more
11. slope – the slope of the peak exercise ST segment
● 0: Upsloping: better heart rate with exercise (uncommon)
● 1: Flat Sloping: minimal change (typical healthy heart)
● 2: Downsloping: signs of unhealthy heart
12. ca – number of major vessels (0-3) colored by fluoroscopy
colored vessel means the doctor can see the blood passing through
the more blood movement the better (no clots)
13. thal – thallium stress result
● 1,3: normal
● 6: fixed defect: used to be defect but ok now
● 7: reversible defect: no proper blood movement when exercising
5. Preparing the Tools
The study used pandas, Matplotlib and NumPy for data analysis and manipulation.
Load Data
6. Data Exploration
The goal at this point was finding more about the data and becoming a subject matter export on the dataset being used. The questions considered here included:
1. What question(s) is the study trying to solve?
2. What kind of data does the study have and how does the study treat different types?
3. What’s missing from the data and how does one deal with it?
4. Where are the outliers and why should the study care about them?
5. How can the study add, change or remove features to get more out of your data?
Heart Disease Frequency Per Chest Pain Type
Heart Disease Frequency according to Sex
Age vs. Max Heart Rate for Heart Disease
7. Modeling
In modeling, the data was split into two axes, X and Y denoting two different trains and sets. Thereafter, the process of building a machine learning model was started. Three different machine learning models were used including.”
Logistic Regression
K-Nearest Neighbors (KNN) Classifier
Random Forest Classifier
The logistic regression model had an accuracy of 0.8852459016393442, the ‘KNN’: 0.6885245901639344, and ‘Random Forest’: 0.8360655737704918. The graph below shows a comparison of the accuracy values of the three models.
Figure 1: Accuracy values of the three models
With the three models, the baseline for the modeling process had been established. The next step is to conduct further analysis using various analysis methods. First, through hyperparameter tuning which was done by hand, the train scores from the KNN model were tuned by creating a different list of values from the n neighbors. The analysis produced a maximum KNN score on the test data of 75.41%. The graph below shows the outcome of the tuning.
Figure 2: hyperparameter tuning
The hyperparameter tuning with RandomizedSearchCV was done on the logistic Regression and the random Forest Classifier models. The output was 0.8688524590163934.
The same process of hyperparameter tuning was repeated but this time with GridSearchCV. The output was 0.8852459016393442.
Further analysis of the models was carried out to evaluate the tuned machines classifiers beyond the accuracy levels. For this to be achieved, predictions were made first. The graph below shows the ROC curve after the calculation of the AUC metrics.
Figure 3: The ROC curve
From the same data, the confusion matrix was developed as shown below.
Figure 4: The confusion matrix
Now that all the information required has been produced, we calculate the accuracy, precision, recall and f1-score of our model using cross-validation and to do so we’ll be using cross_val_score(). This will help obtain a classification report as well as cross-validated precision, recall and f1-score. After running the cross validation analysis, the following graph was obtained.
Figure 5: Cross validation output
Feature importance helps to determine which features contributed most to the outcomes of the model and how did they contribute? It helps finding that feature importance is different for each machine learning model. In this study, feature importance was achieved by searching for “(MODEL NAME) feature importance”. From the logistic regression model, the following are the levels of each feature.
Figure 6: Feature importance
8. Conclusion
Heart diseases remain a major killer in the global environment and the application of promising technology like machine learning to the initial prediction of heart diseases will have a profound impact on society. An early diagnosis of heart disease will aid in making decisions on the appropriate lifestyle changes in high-risk patients hence reducing complications, which can be a great milestone in the field of medicine. The use of proper technology support in this regard can prove to be highly beneficial to the medical fraternity and patients.
In this research, various Python-based machine learning and data science libraries in an attempt to build a machine learning model capable of predicting whether or not someone has heart disease based on their medical attributes. The expected attributes leading to heart disease in patients were available in the dataset which contains numerous features fundamental in the system’s Assessment.. It was evident that a particular number of features needs to be selected for the model to be properly evaluated and a greater accuracy to be attained. which gives more accuracy. The correlation of some features in the dataset was almost equivalent hence they were eliminated. If all the attributes present in the dataset are taken into account then the efficiency decreases considerably. Also, the use of various Assessment metrics like confusion matrix, accuracy, precision, recall, and f1-score predicted the disease efficiently.
REFERENCES
[1] Soni J, Ansari U, Sharma D & Soni S (2011). Predictive data mining for medical diagnosis: an overview of heart disease prediction. International Journal of Computer Applications, 17(8), 43-8
[2] Dangare C S & Apte S S (2012). Improved study of heart disease prediction systems using data mining classification techniques. International Journal of Computer Applications, 47(10), 44-8.
[3] A. Priya, “Predicting heart disease using hybrid machine learning model”, Turkish Journal of Computer and Mathematics Education, vol. 12, no. 13, 2021.
[4] R. Jain, H. Jindal, S. Agrawal, R. Khera and P. Nagrath, “Heart Disease Prediction Using Various Algorithms of Machine Learning”, IOP Conf. Series: Materials Science and Engineering, 2021. Available: doi:10.1088/1757-899X/1022/1/012072.