Description of the three models
- Binary Logistic regression
Binary logistic regression is an extension of simple linear regression where the dependent variable is dichotomous or binary. In a scenario where the dependent variable is binary, we can’t use simple linear regression, and thus we opt to use logistic regression. Just like simple linear regression, logistic regression is a statistical technique that is used to predict the association between the predictor and the predicted variables where the outcome variable is binary. For instance, when the dependent variable is sex i.e., male vs. female, a yes vs. no response variable, a high vs. low score, etc. Another factor that is considered before conducting the binary logistic regression is that the independent variable should have more than 2. The sample size used should also be adequate to boost the accuracy of the model. This is a supervised machine learning techniques where the model can be partitioned to training and testing. The training data is used to create the model, and the model accuracy is used in the testing data to obtain the accuracy of the model (Ozdemir, A. (2011).
Lasso logistic regression
Lasso stands for Least Absolute Shrinkage and Selection Operator. It is also an extension of linear regression. This model addresses some issues that might occur in regression; for instance, a regression model can have variables with zero coefficients, which indicate that the variable is not contributing anything to the model. This is why LASSO regression is selected to address this issue. LASSO performs shrinkage and variable selection. The larger the value of lambda, the more coefficients will be set to zero. In the model, the coefficients of the variables that contribute less in the model are forced to be exact, and the only the most significant variables are kept in the final model. Similarly, like binary logistic regression, this model is a supervised machine learning model where the data is partitioned into training and testing data. The model is created using the training data, and the accuracy of the model is obtained using the testing data (Lu et al. 2012).
Random Forest
Random forest is a type of supervised machine learning technique which uses both classification and regression. It is more famous in the classification problem compared to the regression models. The ‘forest’ model resembles the decision tree, which is usually trained using the bagging method. The bagging method combines learning models to increase the efficiency of the final model. Random forest adds random models while growing the tree. It searches for the best features among the random subset features instead, if exploring the essential elements while splitting a node. This makes the mode to be better. Numerous trees are made by using random thresholds for each feature rather than searching for the best limit. The prediction is made precise by taking average or mode of the output of multiple decision trees (Cootes et al. 2012). The higher the number of decision trees, the more the price and accurate the model and output will be.
- Reporting the models
From the model, we obtained the following equation
isMalware=-0.549 -5.261e-04 (totalEmailSizeBytes)+1.304e+02 (hasExeYes)+ 5.169 (hasURLYes) – 1.19 (urlCountYes)
The odd ratio of size total email in bytes (-5.261e-04) means that an increase in the size of email reduces the chances of the email being Malware. The odd ration of whether the email has Exe (130.4) shows that there are higher chances of the email being Malware. The emails that have URL has high chances be Malware (odd ratio=5.169). The email that has the unknown has a high chance of being Malware (odd ratio=4.544). The emails with urlCount have a low chance of being Malware (odd ratio=-1.19)
The results from the random forest showed that the confusion matrix had an error of 3.9 % and 15.3 % for the absence and presence of Malware, respectively. Also, the OOB estimate had an error of 9.88 %.
The optimized shows that lambda 0 had an accuracy of 89.5 % and a Kappa of 79.11 %. The second lambda (lambda=0.25) has an accuracy of 87.3 % and a Kappa of 74.86 %. The third lambda had an accuracy of 52.5 % and a Kappa of 0 %. The fourth lambda had an accuracy of 52.5 % and a lambda of 0 %, and finally, the fifth lambda had an accuracy of 52.5 % and a Kappa of 0%.
- ii) Confusion matrix
Binary Logistic Regression
The following results were obtained after tuning the binary logistic regression by using the top five most important variables.
The confusion matrix shows that 34 samples were predicted correctly predicted not to have Malware. Five samples were wrongly predicted, not be Malware. Lastly, 37 samples were correctly predicted to be Malware. The accuracy level of the optimized was 93.42 %. The sensitivity score was 100 %. The specificity score was 88.1 %.
Random forest
The following output was obtained after tuning the parameters of the random forest model with 150 numbers of trees and by improving the accuracy of the model with a 95% significance level.
Similar to the optimized binary logistic regression, the confusion matrix shows that 34 samples were predicted was correctly predicted not to have Malware. Five samples were wrongly predicted, not be Malware. Lastly, 37 samples were correctly predicted to be Malware. The accuracy level of the optimized was 93.42 %. The sensitivity score was 100 %. The specificity score was 88.1 %.
Lasso Logistic regression
The following output was obtained after tuning the parameters of the lasso regression with a lambda sequence of 0.001 and 10 k-folds.
The confusion matrix above shows that out of 35 emails that were not Malware, 32 were accurately predicted while two were wrongly predicted. Also, out of the 42 emails were Malware, 37 were correctly predicted, while five were improperly predicted. The accuracy level of the model was 90.79 %. The sensitivity score was 94.12 %, and the specificity score was 88.1 %.
ii)
Binary logistic regression
The confusion matrix above shows that 42568 emails that were not Malware, 42517 were accurately predicted while 51 were wrongly predicted. Out of 7432 emails that were Malware, 6392 emails were correctly predicted, while 1040 were improperly predicted. The accuracy level of the model was 97.82 %. The sensitivity score was 99.88 %, and the specificity score was 86.1%.
Random forest
The confusion matrix above shows that 42568 emails that were not Malware, 42354 were accurately predicted while 214 were wrongly predicted. Out of 7432 emails that were Malware, 6410 emails were correctly predicted, while 1022 were improperly predicted. The accuracy level of the model was 97.53 %. The sensitivity score was 99.50 %, and the specificity score was 86.25%.
Lasso Regression
The confusion matrix above shows that 42568 emails that were not Malware, 40710 were accurately predicted while 1858 were wrongly predicted. Out of 7432 emails that were Malware, 6401 emails were correctly predicted, while 1031 were improperly predicted. The accuracy level of the model was 94.22 %. The sensitivity score was 95.64 %, and the specificity score was 86.13%.
Conclusion
The binary logistic regression was the best model compared to the other variables. It has the highest accuracy when used with real-world data. It has about 97.82 %, while random forest had an accuracy of 97.53 5, and lasso regression had an accuracy level of 94.22 %. It also has the highest sensitivity score of 99.88 %. Thus it is the chosen model.
Reference
Cootes, T. F., Ionita, M. C., Lindner, C., & Sauer, P. (2012, October). Robust and accurate shape model fitting using random forest regression voting. In European Conference on Computer Vision (pp. 278-291). Springer, Berlin, Heidelberg.
Lu, Y., Zhou, Y., Qu, W., Deng, M., & Zhang, C. (2011). A Lasso regression model for the construction of microRNA-target regulatory networks. Bioinformatics, 27(17), 2406-2413.
Ozdemir, A. (2011). Using a binary logistic regression method and GIS for evaluating and mapping the groundwater spring potential in the Sultan Mountains (Aksehir, Turkey). Journal of Hydrology, 405(1-2), 123-136.