Logistic Regression — Breast Cancer Prediction
What is Logistic Regression ?
Logistic Regression is a supervised Machine Learning algorithm to classify data given. The algorithm can be classified into two types: Binary classification and Multi-class classification.
Examples of Logistic Regression:
Spam email detection: To classify emails into spam or non-spam.
Health care: To detect a tumour to be either Benign or Malignant.
Credit card transaction: To predict whether the transaction is fraudulent or not
Banking : To predict whether the customer will default on the loan or not
Logistic Regression algorithm is implemented on the breast cancer dataset provided by Wisconsin Cancer Hospital and can be downloaded from the following links:
- https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
- https://www.kaggle.com/uciml/breast-cancer-wisconsin-data?select=data.csv
The attributes/variables/features of the dataset are as below:
- ID number
- Diagnosis (M = malignant, B = benign)
3–32)
Ten real-valued features are computed for each cell nucleus:
a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter² / area — 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension (“coastline approximation” — 1)
The following are the steps followed in building, training, testing and assessing the performance of Logistic Regression classifier .
1. Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report,accuracy_score, confusion_matrix
Note: Numpy is a library for scientific computing, scikit learn is an open source Machine learning library with Python that contain Machine learning algorithms , Matplotlib and Seaborn is for data visualization and plots.
2) Reading the Dataset
df = pd.read_csv(‘data.csv’) #read the .csv dataset
df.head(5) #print the first five rows
3) Exploratory Data Analysis
df.columns # displays the columns/variables/featurtes in the dataset
Index(['id','diagnosis','radius_mean','texture_mean','perimeter_mean','area_mean','smoothness_mean','compactness_mean','concavity_mean',
'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'perimeter_se','area_se','smoothness_se',
'compactness_se', 'concavity_se', 'concave points_se','symmetry_se',
'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst','area_worst','smoothness_worst','compactness_worst', 'concavity_worst', 'concave points_worst','symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'],dtype='object')
df.info()#information about columns like their data types, Null values
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
id 569 non-null int64
diagnosis 569 non-null object
radius_mean 569 non-null float64
texture_mean 569 non-null float64
perimeter_mean 569 non-null float64
area_mean 569 non-null float64
smoothness_mean 569 non-null float64
compactness_mean 569 non-null float64
concavity_mean 569 non-null float64
concave points_mean 569 non-null float64
symmetry_mean 569 non-null float64
fractal_dimension_mean 569 non-null float64
radius_se 569 non-null float64
texture_se 569 non-null float64
perimeter_se 569 non-null float64
area_se 569 non-null float64
smoothness_se 569 non-null float64
compactness_se 569 non-null float64
concavity_se 569 non-null float64
concave points_se 569 non-null float64
symmetry_se 569 non-null float64
fractal_dimension_se 569 non-null float64
radius_worst 569 non-null float64
texture_worst 569 non-null float64
perimeter_worst 569 non-null float64
area_worst 569 non-null float64
smoothness_worst 569 non-null float64
compactness_worst 569 non-null float64
concavity_worst 569 non-null float64
concave points_worst 569 non-null float64
symmetry_worst 569 non-null float64
fractal_dimension_worst 569 non-null float64
Unnamed: 32 0 non-null float64
dtypes: float64(31), int64(1), object(1)
memory usage: 146.8+ KB
Note: The above dataset has 33 columns including Unnamed:32 Column. The column 32(Unnamed:32) has all NULL values and can be removed.
Deleting the Column:32
df.rename({“Unnamed: 32”:”a”}, axis=”columns”, inplace=True)
df.drop([“a”], axis=1, inplace=True)
df.head(3)
Target variable “diagnosis” is a categorical variable M: Malignant, B: Benign. It is changed to M:1, B:0
df[‘diagnosis’] = df[‘diagnosis’].apply(lambda x : ‘1’ if x == ‘M’ else ‘0’)
df = df.set_index(‘id’)
df.head(3)
Note: After dropping Unnamed:32 column and making id column as an index label the dataset has 31 columns/variables/features(including target variable)
print(len(df.columns)) # prints the length of columns
31
Number of Benign and Malignant observations
benign,malignant = df[‘diagnosis’].value_counts()#counting
print(“Number of Benign patients”, benign)#prints number of Benign
print(“Number of Malignant patients”, malignant)
plt.figure(figsize = (8,4))
sns.countplot(df[‘diagnosis’])
Number of Benign patients 357
Number of Malignant patients 212
4) Target Variables and Feature Variables into y and X
y = df[‘diagnosis’].values # y contains Target variable
X = df.drop(‘diagnosis’, axis =1).values # X contains Feature variables
5) Spliting data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 20)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(455, 30)
(114, 30)
(455,)
(114,)
Note: X_train, y_train is the training data and X_test,y_test is the test data. train_test_split() is a function used to split the data into train and test.
6) Data Normalization/Feature Scaling
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.fit_transform(X_test)
print(X_train_sc)
[[-1.34450029 0.5346355 -1.32325205 ... -1.34334388 -0.84057964
0.48739774]
[ 3.77500369 1.58209417 3.89649166 ... 2.25906721 -0.39639035
-0.51916346]
[-0.12789731 -0.68625859 -0.17338951 ... -0.40950938 -0.11981967
-0.31830386]
...
[-0.8067862 -1.4370514 -0.81092486 ... -0.40251322 -0.24218124
-0.18251146]
[-0.92324831 -0.84828378 -0.88563924 ... -0.52175203 -0.49696151
1.38928563]
[-0.44603771 -0.06097825 -0.41313236 ... -0.32418671 -1.26800706
-0.65439007]]
Note: Range of values of the features may vary widely. Transform the data such that all features are within range normally from 0 to1.Standardization is scaling technique used where the values are centered around the mean with a unit standard deviation.
7) Building Logistic Regression
model = LogisticRegression(C = 0.3)
model.fit(X_train_sc, y_train)
LogisticRegression(C=0.3, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=None, solver='warn', tol=0.0001, verbose=0,warm_start=False)
8) Predicting
y_pred_lr = model.predict(X_test_sc) # predicting on test data
accuracy_lr = accuracy_score(y_test,y_pred_lr) # accuarcy between actual test values and predicted values
print(“Accuracy on Test Data:”,accuracy_lr) # prints the accuracy score
Accuracy on Test Data: 0.9824561403508771
9) Confusion matrix
confusion_matrix(y_test,y_pred_lr)
lr_cm = confusion_matrix(y_test, y_pred_lr)
lr_cm = pd.DataFrame(lr_cm, columns=[‘Benign’, ‘Malignant’], index=[‘Benign’,’Malignant’])
lr_cm
Note: The performance of the logistic regression classification algorithm is measured by confusion matrix.
Classification report
print(classification_report(y_test, y_pred_lr)) # print the classification report
Note :The classification report displays the precision, recall, F1 score for the model.
The Logistic Regression Algorithm could predict the Malignant tumour from Benign tumour with an accuracy of 98.24%.