Machine Learning Models Comparative Analysis

by datatabloid_difmmk

This article data science blogthon.

The term “machine learning” was coined by IBM’s Arthur Samuel. Machine learning is part of artificial intelligence. Machine learning is pro earnings.

  1. supervised learning
  2. unsupervised learning
  3. semi-supervised learning
  4. reinforcement learning.

Machine learning uses mathematical and statistical methods such as OCR, spam detection, etc.

Machine learning application

sauce: fs.ac.in/blog/

In this post, we will use the Book-My-Show dataset and apply three machine learning models to analyze which model is a good fit for this dataset.

affirmation of the problem

Book-My-Show has identified an issue that needs resolution. By allowing and displaying advertisements on its website, Book-My-Show raises concerns about users’ privacy and the information they access. The displayed advertisements may contain links intended to trick certain users into installing malicious programs, locking their computers, or exposing the user’s privacy. , can be devastating. Book-My-Show wants to analyze certain URLs to determine if they are vulnerable to phishing and fix the problem.

To tackle the Book-My-Show problem, the model works best with Logistic Regression, KNeighbors, and XGB.

About Book-My-Show datasets

11k samples correspond to 11k URLs in the input dataset. Each sample has a different description of the URL with different values. If the URL falls within the -1 (suspicious), 0 (phishing), or 1 (legitimate) range, the URL may be a legitimate or phishing link.

Implementation

The practical ones are implemented in Kaggle. You can access the datasets and notebooks from the following links. The book-my-show notebook link is my kaggle.com account.

Dataset link: https://www.kaggle.com/datasets/shibumohapatra/book-my-show

Notebook link: https://www.kaggle.com/code/shibumohapatra/logical-regression-kneighbors-xgb

Module and library information

  1. Numpy: linear algebra

  2. Pandas: data processing like pd.read_csv
  3. Matplotlib: plotting graphs
  4. Seaborn: Visualization of random distributions and heatmaps
  5. xgboost: a library that imports the xgbclassifier module
  6. Sklearn: Machine learning tools and statistical modeling. Sklearn has many modules used for machine learning and they are mentioned below –
    • Sklearn.model_selection: Split the data into random subsets of training and test sets.
    • Cross_val_score: Evaluate model performance by handling result set distribution problems. Evaluate the data and return a score.
    • KFold: Split the data into K folds to avoid overfitting.
    • GridSearchCV: cross-validation to extract optimal parameter values ​​for making predictions
    • StratifiedKFold: for stratified sampling (split into subgroups by gender, race, etc.)
    • Sklearn.linear_model: Import a machine learning model
    • Sklearn.neighbors: imports the KNeighbors Classifier module
    • Sklearn.metrics: for determining predictions and accuracy scores, confusion matrices, ROC curves, AUC
import numpy as np
import pandas as ad
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier,cv
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score, LeaveOneOut
from sklearn.model_selection import GridSearchCV, StratifiedKFold

Exploratory Data Analysis (EDA)

You should explore your data using heat maps and histograms. The number of samples in the data and unique elements in all features are determined. I need to check if any of the columns have null values.

df_data = df_data.drop('index',1)
data set
df_data.nunique()
data set
#check for NULL value in the dataset

print("The null values in the dataset are:",df_data.isnull().sum().sum())
data set
# NULL value check
df_data.info()
data set
# Duplicate f_data.T
print("The duplicate values in dataset is:",df1.duplicated().sum())

Plot Histograms for Data Exploration

df_data.hist(bins=50, figsize=(20,15))
plt.show()
Distribution of data

The distribution of the data is shown in the histogram above. -1, 0, and 1 are values.

pd.value_counts(df_data['Result']).plot.bar()
Distribution of legitimate (1) and phishing (0) URLs

The distribution of legitimate (1) and phishing (0) URLs is shown in the chart above.

Correlating Features and Feature Selection

I need to find out if there are any correlated features in the data and remove those features that may be correlated with the threshold.

#correlation map
f,ax=plt.subplots(figsize=(18,18))
sns.heatmap(df_data.corr(),annot=
data set

The above heatmap output ‘popUpWindow’ and ‘Favicon’ have a strong correlation of 0.9. It suggests data redundancy. So we need to remove one of the columns.

There is a correlation between ‘popUpWindow’ and ‘Favicon’ in the heatmap output. There are 9. It suggests that there are multiple data sources. I need to remove one of the columns. I need to find a highly correlated independent variable and drop the column. The threshold was set at 0.75. Columns with correlation values ​​greater than 0.75 are removed.

cor_matrix = df_data.corr().abs()
upper_tri=cor_matrix.where(np.triu(np.ones(cor_matrix.shape),k=1).astype(np.bool))
# threshold greater than 0.75
to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > 0.75)]
print(to_drop)
#drop the columns which are highly correlated
df_data.drop(to_drop,axis=1,inplace=True)
df_data.shape
X=df_data.drop(columns="Result")
X
Y=df_data['Result']
Y= pd.DataFrame(Y)
Y.head()

Here we split the data into a training set and a test set.

# split train - test to 70-30
train_X,test_X,train_Y,test_Y = train_test_split(X,Y,test_size=0.3, random_state=9)
print(train_X.shape)
print(train_Y.shape)
print(test_X.shape)
print(test_Y.shape)

classification model

With a clear understanding of your data, you can build a classification model. Classification models were first developed to identify malicious URLs.

#model build for different binary classification and show confusion matrix
def build_model(model_name,train_X, train_Y, test_X, test_Y):
   if model_name == 'LogisticRegression':
       model=LogisticRegression()
   elif model_name =='KNeighborsClassifier':
       model = KNeighborsClassifier(n_neighbors=4)
   elif model_name == 'XGBClassifier':
       model = XGBClassifier(objective="binary:logistic",eval_metric="auc")
   else:
       print('not a valid model name')
   model=model.fit(train_X,train_Y)
   pred_prob=model.predict_proba(test_X)
   fpr, tpr, thresh = roc_curve(test_Y, pred_prob[:,1], pos_label=1)
   model_predict= model.predict(test_X)
   acc=accuracy_score(model_predict,test_Y)
   print("Accuracy: ",acc)
   # Classification report
   print("Classification Report: ")
   print(classification_report(model_predict,test_Y))
   #print("Confusion Matrix for", model_name)
   con =  confusion_matrix(model_predict,test_Y)
   sns.heatmap(con,annot=True, fmt=".2f")
   plt.suptitle('Confusion Matrix for '+model_name, x=0.44, y=1.0, ha="center", fontsize=25)
   plt.xlabel('Predict Values', fontsize =25)
   plt.ylabel('Test Values', fontsize =25)
   plt.show()
   return model, acc, fpr, tpr, thresh

After building all the models, use heatmaps to see the confusion matrix of each model and its performance.

Model 1 – Logistic Regression

Logistic regression is a supervised method. Logistic regression is used to calculate or predict events. There are two conditions, yes or no. A classic example is if a person has a cold. Patients are either infectious or not. That’s the result. There are three types of logistic regression and one type. A binomial expression has two outcomes, 0 or 1, and either wins or loses. 2. A polynomial has more than one result. Categories of the target variable, such as test scores, are ordered. Scores are 0, 1, 2, and so on.

logistic regression

sauce: logistic regression

# Model 1 - LogisticRegression
lg_model,acc1, fpr1, tpr1, thresh1 = build_model('LogisticRegression',train_X, train_Y, test_X, test_Y.values.ravel())
Accuracy and Confusion Matrix for Logistic Regression

sauce: logistic regression

Model 2 – KNeighbors classifier

This is a supervised method typically used for classification and regression problems. Used for resampling. New data points predict classes or continuous values. KNeighbors are lazy machine learning models.

KNeighbors classifier

Source: KNeighbors

# Model 2 - KNeighborsClassifier
knn_model,acc2, fpr2, tpr2, thresh2 = build_model('KNeighborsClassifier',train_X, train_Y, test_X, test_Y.values.ravel())
Confusion matrix for KNeighbors Classifier

Source: KNeighbors

Model 3 – XGB classifier

It is a type of machine learning that uses distributed decision trees in ascending order. XGB offers parallel tree boosting. Boosting techniques are used in XGB to build powerful classifiers.

# Model 3 - XGBClassifier
xgb_model, acc3, fpr3, tpr3, thresh3 = build_model('XGBClassifier',train_X, train_Y, test_X, test_Y.values.ravel())
Confusion Matrix for XGBClassifier | Machine Learning Model

The confusion matrix above shows that the XGB classification has the highest accuracy.

To demonstrate the diagnostic power of this classifier, we need to plot the ROC curve.

# roc curve for tpr = fpr
random_probs = [0 for i in range(len(test_Y))]
p_fpr, p_tpr, _ = roc_curve(test_Y, random_probs, pos_label=1)

Plot ROC curve

plt.style.use('seaborn')
# plot roc curves
plt.plot(fpr1, tpr1, linestyle="--",color="orange", label="Logistic Regression")
plt.plot(fpr2, tpr2, linestyle="--",color="green", label="KNN")
plt.plot(fpr3, tpr3, linestyle="--",color="red", label="XGBClassifier")
plt.plot(p_fpr, p_tpr, linestyle="--", color="blue")
# title
plt.title('ROC curve')
# x label
plt.xlabel('False Positive Rate')
# y label
plt.ylabel('True Positive rate')
plt.legend(loc="best")
plt.savefig('ROC',dpi=300)
plt.show()
ROC curve | machine learning model

Shown in the ROC plot is the XGBClassifier. Higher (TPR) true positive rates compared to other models.

K-fold cross-validation should be used to verify the accuracy of the plotted data. We use GridSearchCV to find the best parameters for different models and the StratifiedKFold cross-validation technique to find the accuracy of the data.

liblinear and newton-CG solvers and l1, l2 penalties for logistic regression model evaluation

For logistic regression, 91% 4 StratifiedKFold Accuracy in folds.

import warnings
warnings.filterwarnings("ignore")
# Create the parameter grid based on the results of random search
param_grid = {
   'solver':['liblinear','newton-cg'],
   'C': [0.01,0.1,1,10,100],
   'penalty': ["l1","l2"]
}
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = LogisticRegression() , param_grid = param_grid,
cv = StratifiedKFold(4), n_jobs = -1, verbose = 1, scoring = 'accuracy' )
grid_search.fit(train_X,train_Y.values.ravel())
print('Best Parameter:')
print('F1 Score:', grid_search.best_score_)
print('Best Hyperparameters:', grid_search.best_params_)
print('Model object with best parameters:')
print(grid_search.best_estimator_)
machine learning model

GridSearchCV method for KNeighborsClassifier model evaluation

For the KNeighbors classifier, 95% 3 Accuracy in GridSearchCV fold.

grid_params = {
   'n_neighbors':[3,4,11,19],
   'weights':['uniform','distance'],
   'metric':['euclidean','manhattan']
}
gs= GridSearchCV(
KNeighborsClassifier(),
grid_params,
verbose=1,
cv=3,
n_jobs=-1
)
gs_results = gs.fit(train_X,train_Y.values.ravel())
print('Best Parameter:')
print('F1 Score:', gs_results.best_score_)
print('Best Hyperparameters:', gs_results.best_params_)
print('Model object with best parameters:')
print(gs_results.best_estimator_)
machine learning model

KFold cross-validation method for XGBClassifier model

For the XGB Classifier, 96% 10x more accurate.

xgb_cv = XGBClassifier(n_estimators=100,objective="binary:logistic",eval_metric="auc")
scores = cross_val_score(xgb_cv, train_X, train_Y.values.ravel(), cv=10, scoring = "accuracy")
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard Deviation:", scores.std())
machine learning model

model comparison

The final output of the model will have maximum accuracy on the validation dataset with the selected attributes. The output of the code below shows that XGBoost is the winner and a good model for Ads URL analysis.

results=pd.DataFrame({'Model':['LogisticRegression','KNeighbors','XGB'],
                    'Accuracy Score':[acc1,acc2,acc3]})
result_df=results.sort_values(by='Accuracy Score', ascending=False)
result_df=result_df.set_index('Model')
result_df
machine learning model

Comparison table

Conclusion

The implementation above helps identify a suitable machine learning model for the ad URL. The underlying structure of the data was analyzed with exploratory data analysis. You’ve used histograms and heat maps to understand the features of your dataset. We collected accuracy metrics for each model and created a comparison table showing the overall accuracy of each model. The XGBoost model is the winner of the three machine learning models with 96% accuracy. KNeighbors earned his 94% score. Logistic regression was 90%.

Important points –

  • The KNeighbors model was slower than the other two because it only outputs labels.
  • Logistic regression was not flexible enough to capture more complex relationships.
  • The other two models were slower than the XGB model.
  • XGB has the ability to handle both continuous and categorical data.
  • The above implementation makes the XGBoost model suitable for the Book-My-Show problem.

We hope the above implementation will help you in choosing a machine learning model.

Media shown in this article are not owned by Analytics Vidhya and are used at the author’s discretion.

You may also like

Leave a Comment

About Us

We’re a provider of Data IT News and we focus to provide best Data IT News and Tutorials for all its users, we are free and provide tutorials for free. We promise to tell you what’s new in the parts of modern life Data professional and we will share lessons to improve knowledge in data science and data analysis field.