Echocardiograms and Survival Prediction After A Heart Attack

16 minute read

png

Background:

This dataset is from the UCI database describing characteristics on an echocardiogram after someone sustained a heart attack or MI and their survival after one year. Echocardiograms, or more simply an ultrasound of the heart, have many different measurements and the purpose of this project was to see if certain measurements or characteristics were predictive of survival at 1 year post-MI.

Dataset URL:

https://archive.ics.uci.edu/ml/datasets/echocardiogram

Source Cited:

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Importing Dataset and Cleaning

#Loading our Echocardiogram Dataset into a Pandas Dataframe.
df = pd.read_csv("echocardiogram.csv")

#Viewing our Dataframe
df

	survival	alive	age	pericardialeffusion	fractionalshortening	epss	lvdd	wallmotion-score	wallmotion-index	mult	name	group	aliveat1
0	11.0	0.0	71.0	0.0	0.260	9.000	4.600	14.0	1.000	1.000	name	1	0.0
1	19.0	0.0	72.0	0.0	0.380	6.000	4.100	14.0	1.700	0.588	name	1	0.0
2	16.0	0.0	55.0	0.0	0.260	4.000	3.420	14.0	1.000	1.000	name	1	0.0
3	57.0	0.0	60.0	0.0	0.253	12.062	4.603	16.0	1.450	0.788	name	1	0.0
4	19.0	1.0	57.0	0.0	0.160	22.000	5.750	18.0	2.250	0.571	name	1	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
128	7.5	1.0	64.0	0.0	0.240	12.900	4.720	12.0	1.000	0.857	name	NaN	NaN
129	41.0	0.0	64.0	0.0	0.280	5.400	5.470	11.0	1.100	0.714	name	NaN	NaN
130	36.0	0.0	69.0	0.0	0.200	7.000	5.050	14.5	1.210	0.857	name	NaN	NaN
131	22.0	0.0	57.0	0.0	0.140	16.100	4.360	15.0	1.360	0.786	name	NaN	NaN
132	20.0	0.0	62.0	0.0	0.150	0.000	4.510	15.5	1.409	0.786	name	NaN	NaN

133 rows × 13 columns

There are 133 data points and 13 variables including the target variable.

Description of the Variables in the Dataframe

Survival (in Months): Numerical

Alive: Categorical (0 or 1)

Age in Years When Heart Attack Occurred: Numerical

Pericardial Effusion: Categorical (0 or 1)

Fractional Shortening Measurement: Numerical (Measurement of Contractility of Heart, higher the better)

EPSS (E Point Septal Separation, the lower the better): Numerical

Left Ventricular Diastolic Dimension (LVDD): Numerical

Wall Motion Score (Score Of How Walls of Heart Move): Numerical (Integers)

Wall Motion Index (Score Divided By Number of Segments Seen - Usually 12-13): Numerical

Mult, Name, and Group: All are case identifiers and not pertinent to the Analysis

Aliveat1: Categorical (Whether or Not Person Was Alive At One Year – Target)

The case identifiers were dropped from the dataset. The survival and alive variables together created the values for the target variable so these are redundant. Likewise, the wall motion index was derived from wall motion score so the wall motion score was also removed.

#Data Cleaning

#Dropping the mult, name, and group variables
df.drop(['mult', 'name', 'group'], axis=1, inplace = True)

#Since Alive At 1 Year is Our Target Variable and was derived form survival and alive, I will drop the survival and alive
#columns
df.drop(['survival', 'alive'], axis = 1, inplace = True)

#Since wallmotion index is also derived from the wall-motion score, I'm also going to drop the wallmotion-score column
df.drop(['wallmotion-score'], axis = 1, inplace = True)

#Viewing Dataframe Post Cleaning
df

	age	pericardialeffusion	fractionalshortening	epss	lvdd	wallmotion-index	aliveat1
0	71.0	0.0	0.260	9.000	4.600	1.000	0.0
1	72.0	0.0	0.380	6.000	4.100	1.700	0.0
2	55.0	0.0	0.260	4.000	3.420	1.000	0.0
3	60.0	0.0	0.253	12.062	4.603	1.450	0.0
4	57.0	0.0	0.160	22.000	5.750	2.250	0.0
...	...	...	...	...	...	...	...
128	64.0	0.0	0.240	12.900	4.720	1.000	NaN
129	64.0	0.0	0.280	5.400	5.470	1.100	NaN
130	69.0	0.0	0.200	7.000	5.050	1.210	NaN
131	57.0	0.0	0.140	16.100	4.360	1.360	NaN
132	62.0	0.0	0.150	0.000	4.510	1.409	NaN

133 rows × 7 columns

There were null values in this dataset and were removed for analysis.

#Dropping null values
df.dropna(inplace=True)
df

	age	pericardialeffusion	fractionalshortening	epss	lvdd	wallmotion-index	aliveat1
0	71.0	0.0	0.260	9.000	4.600	1.00	0.0
1	72.0	0.0	0.380	6.000	4.100	1.70	0.0
2	55.0	0.0	0.260	4.000	3.420	1.00	0.0
3	60.0	0.0	0.253	12.062	4.603	1.45	0.0
4	57.0	0.0	0.160	22.000	5.750	2.25	0.0
...	...	...	...	...	...	...	...
105	63.0	0.0	0.300	6.900	3.520	1.51	1.0
106	59.0	0.0	0.170	14.300	5.490	1.50	0.0
107	57.0	0.0	0.228	9.700	4.290	1.00	0.0
109	78.0	0.0	0.230	40.000	6.230	1.40	1.0
110	62.0	0.0	0.260	7.600	4.420	1.00	1.0

62 rows × 7 columns

The pertinent categorical variables were encoded in Python as categorical variables for further analysis and fitting to the machine learning algorithms. Columns were also renamed for simplicity.

Data Exploration and Analysis

A descriptive analysis was done on both the numerical and categorical variables.

print('Description of Data')
df.describe()

Description of Data

	age	fractionalshortening	epss	lvdd	wallmotion-index
count	62.000000	62.000000	62.000000	62.000000	62.000000
mean	64.419355	0.218452	12.307387	4.817129	1.406403
std	8.639498	0.106001	7.305048	0.774996	0.445460
min	46.000000	0.010000	0.000000	3.420000	1.000000
25%	58.250000	0.150000	8.125000	4.290000	1.000000
50%	62.000000	0.218500	11.000000	4.601500	1.321500
75%	70.000000	0.267500	15.900000	5.422500	1.625000
max	86.000000	0.610000	40.000000	6.730000	3.000000

The youngest person in the analysis was 46 years old which is relatively young to have a myocardial infarction. The oldest was 86 years old and the average age was 64 years old.

The left ventricular diastolic dimension mean was 4.8 which is within normal range. The largest was 6.73 cm which is large. The minimum was 3.42 though there is less of a concern with the lower LVDD as there is with elevated ones.

The higher the level of fractional shortening the better. The minimum was very low at 0.01 and the largest was 0.61. Mean values are around 0.21.

#Checking Variable Distributions

#Setting Figure Size
plt.rcParams['figure.figsize']= (20,10)

#Initiating Our Subplots
fig, axes = plt.subplots(nrows = 2, ncols = 3, constrained_layout = True)

#Identifying Numerical Features of Interest (Age, Fractional Shortening, EPSS, LVDD, Wall-Motion Score, and Wall-Motion
#Index)
num_features = ['age', 'fractionalshortening', 'epss', 'lvdd', 'wallmotion-index']
xaxes = num_features
yaxes = ['Counts', 'Counts', 'Counts', 'Counts', 'Counts']

#Histogram Creation
axes = axes.ravel()
for idx, ax in enumerate(axes):
    ax.hist(df[num_features[idx]], bins = 100)
    ax.set_xlabel(xaxes[idx], fontsize = 20)
    ax.set_ylabel(yaxes[idx], fontsize = 20)
    ax.tick_params(axis='both', labelsize = 15)
plt.show()

png

Age at heart attack appears to be normally distributed. The variables of fractional shortening, epss, lvdd, and wallmotionindex appear to be positively skewed.

png

The majority of people in the study group did not have a pericardial effusion and were alive after their heart attack. Those with pericardial effusions could potentially be higher cardiac risk so the fact that many did not have a pericardial effusion is a more favorable finding.

The majority of people in the study were alive after their heart attack.

The numerical variables were analyzed for high levels of correlation (absolute values of >0.95) and plotted using a correlation plot with the seaborn package.

png

Based on the above correlation plot, EPSS and LVDD appear to be positively correlated as do EPSS and wallmotion-index. LVDD and wallmotion index appear to be positively correlated as well. Age and fractional shortening appear to be slightly negatively correlated.

#Checking for Highly Correlated Features For Validation
corr_matrix = df.corr()
corr_matrix

	age	fractionalshortening	epss	lvdd	wallmotion-index
age	1.000000	-0.116029	0.079404	0.199105	0.086844
fractionalshortening	-0.116029	1.000000	-0.399324	-0.369920	-0.327916
epss	0.079404	-0.399324	1.000000	0.569668	0.442790
lvdd	0.199105	-0.369920	0.569668	1.000000	0.339077
wallmotion-index	0.086844	-0.327916	0.442790	0.339077	1.000000

Given these variables, none are significantly highly correlated that could prove problems in future data analysis (correlation absolute value of 0.95 or above).

Scatterplots were completed using our predictor variables.

png

There appeared to be a positive correlation between EPSS and LVDD as well as a negative correlation between LVDD and Fractional Shortening.

Next step was to plot our two categorical variables as stacked bar charts on top of each other to look how the presence or absence of a pericardial effusion correlated with survival.

#Plotting stacked bar charts for our categorical variables and survival
%matplotlib inline
plt.rcParams['figure.figsize'] = (6,5)

#Subplots
fig, axes = plt.subplots()

#Feeding Data Into Visualizer Based on Percardial Effusion Presence
pe_survived = df.replace({'aliveat1': {1: 'Alive', 0: 'Deceased'}})[df['aliveat1']==1]['pericardialeffusion'].value_counts()
pe_deceased = df.replace({'aliveat1': {1: 'Alive', 0: 'Deceased'}})[df['aliveat1']==0]['pericardialeffusion'].value_counts()
pe_deceased = pe_deceased.reindex(index=pe_survived.index)

#Making Bar Plot
p1 = plt.bar(pe_survived.index, pe_survived.values)
p2 = plt.bar(pe_deceased.index, pe_deceased.values, bottom = pe_survived.values)
plt.xticks([0,1], ['Absent', 'Present'])
plt.title('Pericardial Effusion Presence and Survival')
plt.ylabel('Counts')
plt.tick_params(axis = 'both')
plt.legend((p1[0], p2[0]), ('Alive', 'Deceased'))

plt.show()

png

Looking at this stacked bar chart, for those who did not have a pericardial effusion, approximately 30% were alive and 70% were deceased. For those who did have a pericardial effusion, approximately 50% survived and 50% died. I suspect that there may be other variables influencing this. Pericardial effusions are an uncommon entity and usually are higher risk so it is counter-intuitive that those without a pericardial effusion would be dead at 1 year.

Dimensionality Reduction and Fitting of Machine Learning Algorithms

For eventual fitting of our machine learning algorithms, pandas was used to get dummy variables for the categorical values in the dataframe.

#Dimensionality Reduction

#Getting Dummy Variables for Our Pericardial Effusion Variable and our Alive At 1 Variable
df = pd.get_dummies(df, drop_first=True)
df

	age	fractionalshortening	epss	lvdd	wallmotion-index	pericardialeffusion_1.0	aliveat1_1.0
0	71	0.260	9.000	4.600	1.00	0	0
1	72	0.380	6.000	4.100	1.70	0	0
2	55	0.260	4.000	3.420	1.00	0	0
3	60	0.253	12.062	4.603	1.45	0	0
4	57	0.160	22.000	5.750	2.25	0	0
...	...	...	...	...	...	...	...
105	63	0.300	6.900	3.520	1.51	0	1
106	59	0.170	14.300	5.490	1.50	0	0
107	57	0.228	9.700	4.290	1.00	0	0
109	78	0.230	40.000	6.230	1.40	0	1
110	62	0.260	7.600	4.420	1.00	0	1

62 rows × 7 columns

#Using Feature Selection to Eliminate Variables That Are Not Useful

#Setting our Target: Alive at 1 year
target = df['aliveat1_1.0']

#Setting all features
features = df[['age', 'pericardialeffusion_1.0','fractionalshortening', 'epss', 'lvdd', 'wallmotion-index']]

#Loading Libraries
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_classif
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectPercentile

#Standardizing Our Numerical Features
scaler = StandardScaler()
features_standardized = scaler.fit_transform(features)

#Select features with 75th Percentile
fvalue_selector = SelectPercentile(f_classif, percentile=75)
features_kbest = fvalue_selector.fit_transform(features_standardized, target)

#Show Results
print("Original Number of Features:", features.shape[1])
print("Reduced Number of Features:", features_kbest.shape[1])

#Getting the names of the columns that were kept
fvalue_selector.get_support()

Original Number of Features: 6
Reduced Number of Features: 4
array([ True, False,  True,  True, False,  True])

Based on this Boolean, it kept the variables Age, Fractional Shortening, EPSS, and Wall-Motion Index using 75th Percentile and LVDD and Pericardial Effusion should be dropped.

For fitting of our machine learning algorithms, performance will be compared between the models with all variables included and those using feature-selected variables.

Logistic Regression Feature Selection and Model Fitting

The first model will be using logistic regression and to use RFECV to recursively eliminate features of all variables including numerical and categoricals using negative mean squared error as the scoring metric to determine which features should be kept.

#Importing Our Packages
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV
import sklearn.linear_model as lm

#Setting Our Regression and using logistic regression since this is a binary predictor
regression = lm.LogisticRegression(random_state = 1)

#Setting Our Selector for Stratified K Fold Cross validation of 10 and Using Neg Mean Squared Error
selector = RFECV(estimator=regression, step=1, cv=StratifiedKFold(10), scoring='neg_mean_squared_error')
selector.fit(features_standardized,target)
print("Optimal Number of Features: %d" % selector.n_features_)

#Visualizing which features were kept
selector.get_support()

Optimal Number of Features: 1
array([False, False, False, False, False,  True])

Using negative mean squared error as our scoring metric with logistic regression, it appears that with this feature selection, only wall-motion index should be included.

Likewise, using accuracy as our scoring metric with logistic regression, it appears that with this round of feature selection, only wall motion index should be included. It does not appear that in this case that using negative mean squared error or accuracy would change the variables that were selected.

Next we’ll run Stratified K Fold CV using all features with logistic regression

#Using Stratified K Fold Cross Validation, We Will Run a Logistic Regression Model Using All Features

#Importing our Packages
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

#Creating Standardizer
standardizer = StandardScaler()

#Creating Logistic Regression Object
logit = LogisticRegression()

#Creating K Fold Cross Validation
kf = StratifiedKFold(n_splits=10, shuffle=True, random_state = 1)

#Doing Training/Test Split
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.3, random_state = 1)

#Fitting Standardizer
standardizer.fit(features_train)

#Applying to both training and test sets
features_train_std = standardizer.transform(features_train)
features_test_std = standardizer.transform(features_test)

#Creating Pipeline
pipeline = make_pipeline(standardizer, logit)

#Do K Fold Cross-Validation
cv_results = cross_val_score(pipeline, features, target, cv = kf, scoring = "accuracy", n_jobs = -1)
cv_results

#Evaluating our Metrics of Our Logistic Regression Classifier Using ALl Features

from yellowbrick.classifier import ConfusionMatrix
from yellowbrick.classifier import ClassificationReport
from yellowbrick.classifier import ROCAUC


#Confusion Matrix Visualizer
classes = ['Not Survived', 'Survived']
cm = ConfusionMatrix(logit, classes = classes, percent = False)

#Fitting the passed model
cm.fit(features_train, target_train)

#Create confusion matrix
cm.score(features_test, target_test)

#Change fontsize of labels in figure
for label in cm.ax.texts:
    label.set_size(20)

#Checking model performance?
cm.poof()

#Getting Precision, Recall, and F1 Score and ROC Curve and Setting Size of Figure/Font Size
%matplotlib inline
plt.rcParams['figure.figsize'] = (15,7)
plt.rcParams['font.size'] = 20

#Instantiate visualizer
visualizer1 = ClassificationReport(logit, classes = classes)

#Fit training data to visualizer
visualizer1.fit(features_train, target_train)

#Evaluating model on the test data
visualizer1.score(features_test, target_test)
g = visualizer1.poof()

#ROC and AUC: Instantiating the Visualizer
visualizer2 = ROCAUC(logit)
visualizer2.fit(features_train, target_train)
visualizer2.score(features_test, target_test)
g = visualizer2.poof()

png

Here, the algorithm was better at predicting death than survival. The ROC of both classes was 0.6 and the F1 scores were 0.462 for predicting survival and 0.720 for predicting death. Next the algorithm was run using the feature selected variables.

png

Using our feature selected variables, the models performance improved to ROC of 0.83 for both classes. Further the F1 score at predicting death was 0.8 and 0.615 at predicting survival. The model still seemed to predict death more often than survival.

Random Forest Feature Selection and Model Fitting

Using the same methodology as above, I then performed feature selection using a Random Forest Classifier with the plan to evaluate the models using all variables and then the feature selected variables.

#Using Random Forest Classifier as our Model with Negative Mean Squared Error as Scoring Metric

#Importing Package
from sklearn.ensemble import RandomForestClassifier

#Setting Up Our Random Forest Classifier to Call in our RFECV Function
rfc = RandomForestClassifier(n_jobs=-1, random_state = 1)

#Setting Our Selector for Stratified K Fold Cross validation of 10 and Using Neg Mean Squared Error
selector = RFECV(estimator=rfc, step=1, cv=StratifiedKFold(10), scoring='neg_mean_squared_error')
selector.fit(features_standardized,target)

print("Optimal Number of Features: %d" % selector.n_features_)

selector.get_support()

Optimal Number of Features: 6
array([ True,  True,  True,  True,  True,  True])

Using either accuracy or negative mean squared error as scoring, the model recommended keeping all variables.

#Using Stratified K Fold Cross Validation, We Will Run a Random Forest Classification Model Using All Features

#Creating Standardizer
standardizer = StandardScaler()

#Creating Random Forest Object
rfc = RandomForestClassifier(random_state = 111, n_jobs = -1, class_weight = "balanced")

#Creating K Fold Cross Validation
kf = StratifiedKFold(n_splits=10, shuffle=True, random_state = 1)

#Doing Training/Test Split
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.3, random_state = 1)

#Fitting Standardizer
standardizer.fit(features_train)

#Applying to both training and test sets
features_train_std = standardizer.transform(features_train)
features_test_std = standardizer.transform(features_test)

#Creating Pipeline
pipeline = make_pipeline(standardizer, rfc)

#Do K Fold Cross-Validation
cv_results = cross_val_score(pipeline, features, target, cv = kf, scoring = "accuracy", n_jobs = -1)

#Confusion Matrix Visualizer
classes = ['Not Survived', 'Survived']
cm = ConfusionMatrix(rfc, classes = classes, percent = False)

#Fitting the passed model
cm.fit(features_train, target_train)

#Create confusion matrix
cm.score(features_test, target_test)

#Change fontsize of labels in figure
for label in cm.ax.texts:
    label.set_size(20)

#Checking model performance
cm.poof()

#Getting Precision, Recall, and F1 Score and ROC Curve and Setting size of figure and font size
%matplotlib inline
plt.rcParams['figure.figsize'] = (15,7)
plt.rcParams['font.size'] = 20

#Instantiate visualizer
visualizer1 = ClassificationReport(rfc, classes = classes)

#Fit training data to visualizer
visualizer1.fit(features_train, target_train)

#Evaluating model on the test data
visualizer1.score(features_test, target_test)
g = visualizer1.poof()

#ROC and AUC and Instantiating the Visualizer
visualizer2 = ROCAUC(rfc)
visualizer2.fit(features_train, target_train)
visualizer2.score(features_test, target_test)
g = visualizer2.poof()

png

The Random Forest model was better at predicting death than survival like the logistic regression model. However, the ROC values were worse at 0.65 for both classes. The F1 score for predicting death was 0.714 while it was 0.2 (very low) for predicting survival. Since there was not much of a difference between which scoring metric used, this model was only run once using all variables.

SVM Feature Selection

Finally, using the same methodology as above, I then performed feature selection using SVM with the plan to evaluate the models using all variables and then the feature selected variables.

#Trying RFECV with SVM (Support Vector Machine) to Recursively Eliminate Features of All Variables Including Numerical and
#Categorical Variables using accuracy as my scoring metric

#Importing Our Packages
from sklearn.svm import SVC

#Setting Up SVM model
clf = SVC(kernel='linear', random_state = 1)

#Setting Our Selector for Stratified K Fold Cross validation of 10 and Using Accuracy
selector = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring='accuracy')
selector.fit(features_standardized,target)

print("Optimal Number of Features: %d" % selector.n_features_)

#Visualizing which features were kept
selector.get_support()

Optimal Number of Features: 3
array([ True,  True, False, False, False,  True])

Given the results of this Boolean, it recommended keeping age, pericardial effusion, and wall motion index for our feature selected variables whether using accuracy or negative mean squared error as a metric.

#Running SVM model using all variables with accuracy as scoring metric.

#Setting Our SVM Model
clf = SVC(kernel='linear')

#Creating K Fold Cross Validation
kf = StratifiedKFold(n_splits=10, shuffle=True, random_state = 1)

#Doing Training/Test Split
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.3, random_state = 1)

#Fitting Standardizer
standardizer.fit(features_train)

#Applying to both training and test sets
features_train_std = standardizer.transform(features_train)
features_test_std = standardizer.transform(features_test)

#Creating Pipeline
pipeline = make_pipeline(standardizer, clf)

#Do K Fold Cross-Validation
cv_results = cross_val_score(pipeline, features, target, cv = kf, scoring = "accuracy", n_jobs = -1)


#Confusion Matrix Visualizer
classes = ['Not Survived', 'Survived']
cm = ConfusionMatrix(clf, classes = classes, percent = False)

#Fitting the passed model
cm.fit(features_train, target_train)

#Create confusion matrix
cm.score(features_test, target_test)

#Change fontsize of labels in figure
for label in cm.ax.texts:
    label.set_size(20)

#Checking model performance
cm.poof()

#Getting Precision, Recall, and F1 Score and ROC Curve and Setting size of figure and font size
%matplotlib inline
plt.rcParams['figure.figsize'] = (15,7)
plt.rcParams['font.size'] = 20

#Instantiate visualizer
visualizer1 = ClassificationReport(clf, classes = classes)

#Fit training data to visualizer
visualizer1.fit(features_train, target_train)

#Evaluating model on the test data
visualizer1.score(features_test, target_test)
g = visualizer1.poof()

#ROC and AUC: Instantiating the Visualizer
visualizer1 = ROCAUC(clf, micro = False, macro = False, per_class = False)
visualizer1.fit(features_train, target_train)
visualizer1.score(features_test, target_test)
g = visualizer1.poof()

png

Using this method, the F1 score was 0.640 for predicting death and 0.308 for predicting survival. The ROC was 0.59 for the binary decision. This is similar to the Random Forest Algorithm results. Then the feature selected variables were used in the model.

png

Using the feature selected variables, the ROC improved to 0.76 and the F1 score was 0.815 for predicting death and 0.545 for predicting survival.

Overall, the best performing model overall appeared to the logistic regression model.

One limitation of this dataset is the small sample size after null values were removed. Another limitation of the dataset is that the deceased proportion of individuals was nearly two times those who survived in the dataset which will skew the results. The models tend to be better here at predicting death vs. survival. This is likely explained by the fact that our dataset had 2 times the amount of people deceased vs. survived so there was a smaller amount of data to help predict the survived categories.

Three different methods were used including Logistic Regression, Random Forest Classification, and Support Vector Machine Classification. The best performing model was Logistic Regression followed closely by Support Vector Machines. The Random Forest models performed the worst. Feature selection improved the results of all of the models based on their Precision, Recall, and F1 scores.

The feature selection eliminated most variables. With more variables and data points, I suspect that this would not be the case as it would be unlikely for just one value to accurately predict death. More likely is that there is a complex interaction between various measurements that could impact survival.

To view more specifics on the coding and project, please refer to the GitHub repository.

Brandon May

Echocardiograms and Survival Prediction After A Heart Attack

Background:

Importing Dataset and Cleaning

Description of the Variables in the Dataframe

Data Exploration and Analysis

Dimensionality Reduction and Fitting of Machine Learning Algorithms

Logistic Regression Feature Selection and Model Fitting

Random Forest Feature Selection and Model Fitting

SVM Feature Selection

You may also enjoy

Lung Cancer Survival Prediction After Surgical Treatment

Heart Failure Survival Prediction Using Machine Learning

Detection of Breast Cancer In Biopsy Specimens Using Machine Learning

The Story About Airline Safety - Part 3 - Blog Post