Logistic Regression — Breast Cancer Prediction

6 min readMar 3, 2021

What is Logistic Regression ?

Logistic Regression is a supervised Machine Learning algorithm to classify data given. The algorithm can be classified into two types: Binary classification and Multi-class classification.

Examples of Logistic Regression:

Spam email detection: To classify emails into spam or non-spam.

Health care: To detect a tumour to be either Benign or Malignant.

Credit card transaction: To predict whether the transaction is fraudulent or not

Banking : To predict whether the customer will default on the loan or not

Logistic Regression algorithm is implemented on the breast cancer dataset provided by Wisconsin Cancer Hospital and can be downloaded from the following links:

The attributes/variables/features of the dataset are as below:

ID number
Diagnosis (M = malignant, B = benign)

3–32)
Ten real-valued features are computed for each cell nucleus:
a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter² / area — 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension (“coastline approximation” — 1)

The following are the steps followed in building, training, testing and assessing the performance of Logistic Regression classifier .

1. Import Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report,accuracy_score, confusion_matrix

Note: Numpy is a library for scientific computing, scikit learn is an open source Machine learning library with Python that contain Machine learning algorithms , Matplotlib and Seaborn is for data visualization and plots.

scikit-learn

"We use scikit-learn to support leading-edge basic research [...]" "I think it's the most well-designed ML package I've…

scikit-learn.org

2) Reading the Dataset

df = pd.read_csv(‘data.csv’) #read the .csv dataset
df.head(5) #print the first five rows

3) Exploratory Data Analysis

df.columns # displays the columns/variables/featurtes in the dataset

Index(['id','diagnosis','radius_mean','texture_mean','perimeter_mean','area_mean','smoothness_mean','compactness_mean','concavity_mean',
'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'perimeter_se','area_se','smoothness_se',
'compactness_se', 'concavity_se', 'concave points_se','symmetry_se',
'fractal_dimension_se', 'radius_worst', 'texture_worst',      'perimeter_worst','area_worst','smoothness_worst','compactness_worst', 'concavity_worst', 'concave points_worst','symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'],dtype='object')

df.info()#information about columns like their data types, Null values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
id                         569 non-null int64
diagnosis                  569 non-null object
radius_mean                569 non-null float64
texture_mean               569 non-null float64
perimeter_mean             569 non-null float64
area_mean                  569 non-null float64
smoothness_mean            569 non-null float64
compactness_mean           569 non-null float64
concavity_mean             569 non-null float64
concave points_mean        569 non-null float64
symmetry_mean              569 non-null float64
fractal_dimension_mean     569 non-null float64
radius_se                  569 non-null float64
texture_se                 569 non-null float64
perimeter_se               569 non-null float64
area_se                    569 non-null float64
smoothness_se              569 non-null float64
compactness_se             569 non-null float64
concavity_se               569 non-null float64
concave points_se          569 non-null float64
symmetry_se                569 non-null float64
fractal_dimension_se       569 non-null float64
radius_worst               569 non-null float64
texture_worst              569 non-null float64
perimeter_worst            569 non-null float64
area_worst                 569 non-null float64
smoothness_worst           569 non-null float64
compactness_worst          569 non-null float64
concavity_worst            569 non-null float64
concave points_worst       569 non-null float64
symmetry_worst             569 non-null float64
fractal_dimension_worst    569 non-null float64
Unnamed: 32                0 non-null float64
dtypes: float64(31), int64(1), object(1)
memory usage: 146.8+ KB

Note: The above dataset has 33 columns including Unnamed:32 Column. The column 32(Unnamed:32) has all NULL values and can be removed.

Deleting the Column:32

df.rename({“Unnamed: 32”:”a”}, axis=”columns”, inplace=True)
df.drop([“a”], axis=1, inplace=True)
df.head(3)

Target variable “diagnosis” is a categorical variable M: Malignant, B: Benign. It is changed to M:1, B:0

df[‘diagnosis’] = df[‘diagnosis’].apply(lambda x : ‘1’ if x == ‘M’ else ‘0’)
df = df.set_index(‘id’)
df.head(3)

Note: After dropping Unnamed:32 column and making id column as an index label the dataset has 31 columns/variables/features(including target variable)

print(len(df.columns)) # prints the length of columns

Number of Benign and Malignant observations

benign,malignant = df[‘diagnosis’].value_counts()#counting
print(“Number of Benign patients”, benign)#prints number of Benign
print(“Number of Malignant patients”, malignant)

plt.figure(figsize = (8,4))
sns.countplot(df[‘diagnosis’])

Number of Benign patients 357
Number of Malignant patients 212

4) Target Variables and Feature Variables into y and X

y = df[‘diagnosis’].values # y contains Target variable
X = df.drop(‘diagnosis’, axis =1).values # X contains Feature variables

5) Spliting data into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 20)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(455, 30)
(114, 30)
(455,)
(114,)

Note: X_train, y_train is the training data and X_test,y_test is the test data. train_test_split() is a function used to split the data into train and test.

6) Data Normalization/Feature Scaling

sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.fit_transform(X_test)
print(X_train_sc)

[[-1.34450029  0.5346355  -1.32325205 ... -1.34334388 -0.84057964
   0.48739774]
 [ 3.77500369  1.58209417  3.89649166 ...  2.25906721 -0.39639035
  -0.51916346]
 [-0.12789731 -0.68625859 -0.17338951 ... -0.40950938 -0.11981967
  -0.31830386]
 ...
 [-0.8067862  -1.4370514  -0.81092486 ... -0.40251322 -0.24218124
  -0.18251146]
 [-0.92324831 -0.84828378 -0.88563924 ... -0.52175203 -0.49696151
   1.38928563]
 [-0.44603771 -0.06097825 -0.41313236 ... -0.32418671 -1.26800706
  -0.65439007]]

Note: Range of values of the features may vary widely. Transform the data such that all features are within range normally from 0 to1.Standardization is scaling technique used where the values are centered around the mean with a unit standard deviation.

7) Building Logistic Regression

model = LogisticRegression(C = 0.3)
model.fit(X_train_sc, y_train)

LogisticRegression(C=0.3, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,warm_start=False)

8) Predicting

y_pred_lr = model.predict(X_test_sc) # predicting on test data
accuracy_lr = accuracy_score(y_test,y_pred_lr) # accuarcy between actual test values and predicted values
print(“Accuracy on Test Data:”,accuracy_lr) # prints the accuracy score

Accuracy on Test Data: 0.9824561403508771

9) Confusion matrix

confusion_matrix(y_test,y_pred_lr)
lr_cm = confusion_matrix(y_test, y_pred_lr)
lr_cm = pd.DataFrame(lr_cm, columns=[‘Benign’, ‘Malignant’], index=[‘Benign’,’Malignant’])
lr_cm

Note: The performance of the logistic regression classification algorithm is measured by confusion matrix.

What is Confusion Matrix? | Analytics Steps

The confusion matrix is the most persuasive tool for predictive analysis in machine learning. In order to check the…

www.analyticssteps.com

Classification report

print(classification_report(y_test, y_pred_lr)) # print the classification report

Note :The classification report displays the precision, recall, F1 score for the model.

The Logistic Regression Algorithm could predict the Malignant tumour from Benign tumour with an accuracy of 98.24%.