DATA MINING BASED ENHANCED FEATURE SELECTION TECHNIQUE FOR THE PERFORMANCE PREDICTION OF SKIN DISEASE
Abstract
In recent years, a usual worldwide problem is a skin disease—the diagnostics of infection and skin disease prediction based on the data mining techniques. The precise and cost-effective treatments obtain a technology-based data mining system that can consider making the right decision. Depends on data, there are 34 UCI datasets have in the skin disease prediction. All of the datasets are not much important when predicting the skin disease problem. In this study, the essential datasets to be analyzed because they only give the best accuracy in skin disease prediction. For an outstanding selection of allocation, to propose a novel of hybrid technique through three feature selection methods such as Chi-Square, Information Gain, and Principal Component Analysis (PCA). Next, the above hybrid techniques combine and select the better data subset for the data set based skin disease problem. There are six base learners like as Gaussian Naive Bayesian (GNB), K-Nearest Neighbour (KNN), Decision Tree (DT), Support Vector Machine (SVM), Random Forest (RF), Multilayer Perceptron (MLP) used to calculate the prediction of base learners performance. The ensemble techniques, namely Bagging, Boosting, Stacking, added on the base learners to improve the proposed work model. In this paper, the proposed hybrid feature selection technique is used to calculate the base learners’ performance and compute the production of a reduced data subset, which is larger than the input dataset. The base learner’s parameters are essential to calculate the accuracy of skin disease prediction performance. The enhanced hybrid feature extraction technique result and the base learners’ performance given to the ensemble techniques such as boosting, bagging, and stacking to improve the performance of skin disease prediction. The final results will take and compare to every base learner, and the performance shows the improvement of the other existing skin disease prediction methods.
Key Words: Skin disease problem, NB, KNN, DT, SVM, RF, MLP, Ensemble techniques.
- Introduction
Nowadays, skin diseases are the worldwide extended problems and the skin disease predictions that possess the quick response and also identified at the early stage of the disease. Erythematous-squamous (ESDs) diseases are a common skin disease. The skin disease consists of six specifications, such as C1 as Psoriasis, C2 as Seborrheic Dermatitis, C3 as Lichen Planus, C4 as Pityriasis Rosea, C5 as Chronic Dermatitis, C6 as Pityriasis Rubra. The diagnosis of skin disease is quite complicated because six specifications possess identical clinical characteristics with very few changes. Conversely, Biopsy method that helps for skin disease treatments. In the past decade, the skin disease developed for giving prolific decisions in different fields, as the data relates to the medical areas that are readily available through the Internet.
Therefore, a significant enhancement for the several Diseases prediction by using machine learning (ML) algorithms and deep learning (DL) algorithms on several input data sets from the history of the patients. The development of the medical decision support system based data mining applications that benefit for the patients which do not attain the high cost of diagnostic tests.
The feature selection methods are helping to minimize unwanted data set attributes. Conversely, all the data set characteristics of a disease never play an essential role in the prediction outcome. Developing a new hybrid feature selection technique not only depends upon general feature selection methods also defined in the Machine Learning (ML) approaches shown in (Verma, Pal, & Kumar, 2020). Data mining assists doctors in helping them a better diagnosis. In the early days, the skin disease prediction through the help of the decision tree approach. DT enabled several kinds of research to concentrate on the six primary skin disease methods. The researchers used the following statistical purposes, such as Decision Tree (DT), Random Forest (RF), Chi-square Automatic Interaction Detector (CHAID), and several trees based classification methods to evolve the better predictive model in (Magesh & Swarnalatha, 2020). The input data set to be taken from the UCI Machine Learning Repository to estimate the accuracy of skin disease prediction. The skin disease based image and skin disease prediction should provide the exact accuracy. The Deep Convolution Neural Network (CNN) application has the changes in the computer-aided system’s quality shown in (Das, Naik, & Behera, 2020). A method of computer-aided prediction is a recent upcoming one. So, there is some issue to attain the best outcome. Currently, the help of Machine Learning (ML) algorithms and data mining techniques used in the following layers such as Convolution Layer, Activation Layer, Pooling Layer, Fully Connected Layer, and Soft-Max Classifier. Images from the DermNet database are used to validate the architecture necessary to gather satisfying the clinical experiments to attain the best outcome. The Convolution Neural Network (CNN) algorithms used for the classification of familiar skin disease images and obtained enhanced performance (Shanthi, Sabeenian, & Anand, 2020).
Several classification methods used ANN, KNN, and SVM algorithms to solve the diagnosis of skin disease prediction and also developed to achieve the best accuracy shown in (Kadampur & Al Riyaee, 2020). They developed models using popular soft computing techniques, namely Artificial Neural Network, Support Vector Machine, and deep leaning or combination of these techniques. These approaches are applied to the multi-class skin disease data set, and some comparative inferences are generated from various metrics like RMSE, Kappa Statistics, Accuracy, Sensitivity, F-scores, etc. The ELM and SVM are compared in the identification of erythema-squamous skin diseases was studied. Comparative studies have done through various experiments. These experiments result show that ELM is better than SVM. In this study, the outcome of changing the training and testing data depends on the classifier’s performance.
Ensemble methods are a way to combine various machine learning classifiers with improving the results obtained by single machine learning classifiers. There are multiple types of ensemble methods. Bagging, Adaptive Boosting, Gradient Tree Boosting, Stacking, and Bucket of Models are popular ensemble methods. Using ensemble methods allows producing better predictions compared to a single model. There are several articles based on these ensemble methods to improve accuracy as compared to a single classifier. Most of these articles discussed the use of ensemble methods to improve the results of various base classifiers. By considering a new hybrid feature selection techniques with ensemble techniques, the prediction of skin disease will identify precisely at an early stage of the disease accurately. The purpose of this paper is the accuracy of skin disease prediction.
.
Paper Organization
The review of the literature work depicts in section II, the problem statement of skin disease prediction describes in section III, the proposed methodology and its explanation represent in part IV, the results and discussion depict in section V, and the conclusion shows in section VI.
- Related works
(A. K. Sinha & Namdev, 2020) proposed the rough set method for feature selection and pattern recognition for various skin diseases in unconditional skins through information knowledge and data-intensive computer-based solutions. (da Fonseca et al., 2020) proposed the thermograms based imaging technique to be taken at the air temperatures ranges from 24 to 30 °C. The infrared skin temperature (IST) minimum and the infrared skin temperature (IST) maximum to find the stress target conditions with the certain and uncertainties to be considered for the mathematical modeling. The attributes based analysis to be classified using the data mining method. (A. Sinha, Sahoo, Rautaray, & Pandey, 2020) It proposed the early stage of the breast cancer prediction as it increases the chances of successful treatment because of the enhanced diagnostics methods like ultrasound, ductogram, diagnostics mammogram, MRI scans, and many more. So the prognosis of breast cancer prediction that the survival rate of women through the data mining classification methods like NB, KNN, SVM, DT, etc. (Yadav & Pal) proposed the bagging ensemble technique based thyroid cancer prediction through the thyroid disease dataset with the classification methods such as random forest, classification and regression tree (CART), and decision tree to enhanced the thyroid cancer prediction. (Zhang, 2020) proposed the ECG signal improved processing algorithms and applications to eliminate the uncomfortableness and inconvenience induced by traditional ECG configurations, i.e., the 12-lead ECG placement methods. (Verma, Pal, & Kumar, 2019) proposed the machine learning algorithms to split the classes of skin disease using ensemble techniques with a hybrid feature selection method. The six various data mining classification approaches to develop an ensemble approach using Bagging, AdaBoost, and Gradient Boosting classifier techniques to predict skin disease. (Verma & Pal, 2019) proposed the Stacking ensemble technique used for measuring the skin disease performance. The enhanced method is finding the optimal subset of the erythematous-squamous disease and well in correlation and heat map feature selection techniques. (Rajagopal, Murugan, Kottursamy, & Raju, 2019) proposed the K-means Classification algorithms and Bayesian prediction based data mining techniques are used to mention the lymphatic filariasis grade levels and also predict the curable lymphatic filariasis rate based control measurement.
(Alonso et al., 2019) proposed to overcome the skin diseases to reach the action at the efficacy concentration levels. The design of optimized properties and their molecules for skin disease penetration and assays for the need of the characterization method. (Patrick et al., 2019) proposed the drug candidates prediction were ensemble among the genes severally mentioned in psoriatic lesional skin from a large-scale RNA sequencing cohort. Although the algorithm cannot be used for the clinical efficacy determination and the drugs for repurposing to immune-mediated cutaneous diseases. (Foulkes et al., 2019) proposed the RNA sequencing used to find the mRNA and small RNA transcriptome in blood, lesional and nonlesional skin, and the SOMAscan procedure to analyze the serum proteome. The treatment response in genes and pathways associated with TNF signaling, psoriasis pathology, and the main histocompatibility complex region. (Erster et al., 2019) proposed the lumpy skin disease virus (LSDV) isolates from the various geographic areas and the assay times to be developed between the virulent Israeli viruses and Neethling vaccine (NVV) based endemic areas in which the use of NVV-based vaccines. (Chen et al., 2020) proposed a medical AI framework based on data width evolution and self-learning for the skin disease with the skin disease medical service meeting the requirement of real-time, extendibility, and individualization that adopt the auto-classification system to improving the accuracy rate of skin disease classification. (Nakai et al., 2019) proposed the feasibility of skin blotting technique for the Prognosis prediction of Category IPUs is necessary to provide successful intensive care for PUs with impaired healing. The long-term-care and general hospitals to examine the applicability of DESIGN-R and thermography. (Jha, Pan, Elahi, & Patel, 2019)
proposed the Healthcare data analysis is currently a challenging and crucial research issue for the development of robust disease diagnosis and prediction systems. The healthcare datasets related to thyroid, cancer, skin disease, heart disease, hepatitis, lymphography, audiology, diabetes, surgery, arrhythmia, post survival, liver, and tumor have been used in the performance assessment of the classification methods. (van Waateringe et al., 2019) proposed the measurement of skin autofluorescence can predict the four-year risk of incident type 2 diabetes, cardiovascular disease (CVD) and mortality in the general population. (Dehkordi & Sajedi, 2019) Predicting the prediction of disease through data mining techniques have been developed in recent years, including generalization, characterization, classification, clustering, association mining, pattern matching, data visualization, and meta-rule-guided mining. The discovered knowledge by data mining approaches can be applied for different applications in various sectors, such as the healthcare industry.
(Ardestani & Mokhtari, 2020) proposed the maximum entropy ecological niche modeling based New Lumpy Skin Disease (LSD) to outburst through the grid maps based grazing and zero grazings. To explore the environmental influences on LSD with the resolution is 1 km (MaxEnt). (Correia et al., 2020) proposed a deep neural network (DNN) to attain systemic sclerosis (SSc) skin. The collagen deposition and remodeling of the dermis based SSc measured through the clinical trials along with the modified Rodnan skin score (mRSS).
III. Problem Statement
The problem may occur in the skin disease diagnosis that never ensemble the accurate prediction at an early stage of the skin disease. The prediction of skin disease should follow to identify the disease in all the cases; it may have some misguidance to identify.
- Proposed Methodology
The proposed skin disease prediction based schematic diagram shown in figure 1.
Figure 1 Schematic diagram of the proposed methodology
The hybrid feature selection method should process through the skin disease-based input dataset, which consists of 34 attributes of data set, taken from the UCI Machine Learning Respiratory system. The three hybrid feature selection techniques, namely Chi-square, information gain, and Principal Component Analysis, then apply a hybrid feature selection technique to choose 10 data set attributes to find the new reduced data subset from the skin disease input data set. The six base learner classifiers namely as Naïve Bayesian (NB) classifier, K Nearest Neighbor (KNN), Decision Tree (DT) classifier, Support Vector Machine (SVM) classifier, Random Forest (RF) classifier, and Multilayer Perceptron (MLP) classifier are applied to calculate the skin disease prediction. The three ensemble methods, Bagging, Boosting, and Stacking, collected the MLC outcome results to enhance the base learner’s achievements.
4.1 Data set analysis
The input data set for skin disease prediction is collected. The data set has 12 clinical features listed below, namely, clinical features and Histopathological features.
The range of features is defined as
Family history (f11) = [
Other attributes = [
In clinical attributes, age is a nominal data set attribute.
Clinical features
fl: Erythema
f2: Scaling
f3: Definite borders
f4: Itching
f5: Koebner phenomenon
f6: Polygonal papules
f7: Follicular papules
f8: Oral mucosal
f9: Knee and elbow
f10: Scalp involvement
f11: Family history
f34: Age
Histopathological features
f12: Melanin incontinence
f13: Eosinophils in the infiltrate
f14: PNL infiltrate
f15: Fibrosis of the papillary dermis
f16: Exocytosis
f17: Acanthosis
f18: Hyperkeratosis
f19: Parakeratosis
f20: Clubbing of the rete ridges involvement
f21: Elongation of the rete ridges
f22: Thinning of the suprapapillary epidermis
f23: Spongiform pustule
f24: Munro microabscess
f25: Focal hypergranulosis
f26: Disappearance of the granular layer
f27: Vacuolization and damage of basal layer
f28: Spongiosis
f29: Saw-tooth appearance of rete ridges
f30: Follicular horn plug
f31: Perifollicular parakeratosis
f32: Inflammatory mononuclear infiltrate
f33: Band-like infiltrate
4.2 Pre-processing
The preprocessing stage used for data cleaning and data transformation. To specify the cleaning of the acquired dataset and transform it into a specific or fresh data set and remove the unwanted noise present in the skin disease data set.
4.3 Feature selection technique
The feature selection technique is essential and also time reducing characteristic. To predict the outcome results, give the preferences and ranking to the dataset. According to the importance, the features to be ranking. The following reasons attain the feature selection methods such as:
- The training time of classifiers to be reduce
- The complexity of the developed model to be reduced
- Reduce the over-fitting
- To improve the prediction results through reducing.
- Unwanted attributes from the data set.
The feature selection technique methods are followed:
Chi-square feature selection
Chi-square method is the primary feature selection method used to calculate the relationship between input dataset attributes to the target attribute. To test the independency of 2 events test that used for the statistics. Then the level of to test whether the particular attribute occurrence and a particular class are independent of occurrence. Now, the Chi-square values to be calculated.
= (1)
Where,
c is the degree of freedom, O is the observed value, E is an expected value.
Information gain
The importance of feature is the enhanced mean of the individual trees’ in the splitting category generated by each dataset. This specifies the improvement of the score (refers that the ‘impurity’on the notation of the decision tree) when splitting the tree using that specific data set attributes.
The common impurities are Entropy and Gini Impurity. An improved Gini Impurity is referred to as Gini importance, while improved Entropy is referred to as Information Gain.
Gini impurity = (2)
Entropy = – (3)
Principal Component Analysis (PCA)
Principal Component Analysis is used to transform the Dimensionality Reduction (DR) by selecting the essential features that contain the highest information of the attribute data set. Essential features to be chosen according to the variance that generates the target variable. The most upper variance feature referred to as the leading Principal Component. The second highest variance feature is regarded as the second Principal Component, so on. The Principal Components never have any relationship of correlation with each other.
Hybrid feature selection technique
A new hybrid feature used for selecting the essential features. Here, to choose the three feature selection methods, namely Chi-square, Information gain, and the Principal Component Analysis, to choosing essential features, but no one is complete because they perform several types of data sets at differently. That’s the reason to bee combined these approaches to get a new hybrid feature selection technique that operates better in all thee conditions. The algorithm for the combined feature selection techniques is followed below:
Step 1: To normalize the chi-square test values by calculating the most significant value of Chi-square and then splitting the remaining values by it.
Step 2: All the above feature selection techniques give a 0-1 range of values. To order the values in ascending order of the above feature selection techniques.
Step 3: Combine the values by using a merge sorting method.
Step 4: To choose a specified number of features n from the merged values.
4.4 Machine Learning Classifier
Here, to use the six various classifiers for skin disease prediction. The six multiple classifiers to be chosen as a combined homogeneous and heterogeneous classifiers because here to used the different type of ensemble method.
- Naïve Bayesian classifier
Gaussian Naive Bayesian calculates each attribute’s continuous values, and also, their distribution depends upon a Gaussian distribution that is also referred to as Normal Distribution. The results of Gaussian distribution draw as a bell-shaped curve that is symmetric about the mean of the featured values and these values to be calculated as,
p (xi/y) = exp (- ) (4)
- K-Nearest Neighbor Classifier
KNN classifier used for the problems in classification and regression, but usually, KNN used in classification problems. KNN is a non-parametric diagnostic algorithm. If there is no imagination for the basis of data distribution, so is referred to as Non-parametric, i.e., the structure model calculated from the attributes of data set. KNN used for prediction when the data sets never obey the hypothetical mathematical imaginations. KNN doesn’t need any training of data for further development. So it referred to as the Lazy learning algorithm—all the training of data used in the testing phase of data.
- Decision tree classifier
Decision Tree classifier is the most dominant and acceptable method for the process of classification and prediction. DT is a tree-based structure, where every interior node describes a feature test, every branch describes as a test result, and every terminal node holds a label of class. Decision trees can generate understanding rules quickly. A decision tree is a value-based method, accessible and helpful because of its
flowchart. The flow chart of DT shown in Figure 2.
Yes
Figure 2 Flow chart of Decision Tree
- Support vector machine
Support Vector Machines used for the analysis of classification and regression. SVM calculates the hyperplane, which is the margin to be increased within the two classes. The vectors of hyperplane are referred to as Support vectors. By considering the favorable conditions, SVM could build a hyperplane that completely splits the support vector into two non-overlapping classes. In many cases, however, this is not applied, so SVM will find hyperplanes of the support vectors that increase the margins and reduce the classification errors.
- Random forest classifier
RF is a supervised learning algorithm that can be used for the analysis of classification and regression approaches. But it is primarily used for classification related problems. A forest is a process of collection of trees, and a high number of trees means it is referred to as a Strong Forest. Similar to the decision trees, the RF algorithm also discovers the decision trees on the dataset. It gets the prediction results from each tree and then selects the better outcome by choosing the process. It is referred to as an ensemble technique, and also that is better than a single Decision Tree approach because which is minimizes the over-fitting through the average performance.
- Multilayer Perceptron
A Multilayer perception (MLP) is a masterminding based regression classifier. From this classifier method, input data is altered with the use of a non-linear conversion based learners. The attained changed from the input data into a layer, where input data change as a linearly distinct characteristic. This layer that alters data from an input is referred to as a hidden layer. Only a single hidden layer is used in Multilayer Perceptron Classifier, or else it will function as an Artificial Neural Network. Even the multiple hidden layer usages are benefits for the classification purpose.
4.5 Ensemble techniques
Ensemble methods are used to merge different base learners to predict problems in a single classifier-based prediction. The ensemble methods split into two types, namely 1. Similar types of joining multiple classifiers of and 2. The different types of joining various classifiers. Here both the method used for calculation.
- Bagging
The Bagging method is used to minimize another base learner variance. The aim is to initiate a subset of the data from the selection of a random process and the training set to be changed. Every subset of the data set utilized for training with its respective six base learners. From the performance, used for an average of all the predictions from the several base learners, which have much reliability than anyone base learner.
- Boosting
The Boosting method is used to discover a set of predictors. From this method, the learners learned consecutively while close to the beginning learners. Then to be calculated the errors in the data: the continuous trees, i.e., the random sample match, and in every step. The aim is to enhance the accuracy of the existing tree. When an input is classified wrongly by an imagination, if its weight to be maximized, another invention is much more classified correctly. The steps follow the base learners with the inability of the learning capabilities into the best performance outcome.
- Stacking
The Stacking method is used for merging several types based on multiple base learners using a meta-classifier. The six base learners NB, KNN, DT, SVM, RF, and MLP, are trained with the complete training data set; and then, meta classifier is concerned with the “meta-features” based outputs of every base Lerner. The Meta-classifier will train on the probabilities of the predicted class labels.
- Results and Discussion
The hybrid feature selection technique for calculating the better features for the skin disease prediction. The proposed methodology depends upon the three several feature selection techniques (Chi-square, Information Gain, Principal Component Analysis). Initially, to concern the hybrid feature selection method to select an essential feature using the attributes of the skin disease data sets. The ten essential features are calculated using a hybrid feature selection technique. The hybrid feature selection technique ensemble chose the important attributes along with the attributes of the dataset shown in Table 4. To implement the Python code for describing the base learners and hybrid feature selection approaches, the ensemble methods used to calculate the several metrics are used. To find the several metrics and prediction of the accuracy of the six base learners to find the following formula:
Accuracy = (5)
It also represented as another formula as,
Accuracy = (6)
Where, if the observation is Negative, TN is negative. If the observed outcome is Positive, FN also negative. From the cases, TP is positive; the observed results too positive.
The following hybrid feature selection technique considered for the feature attributes shown in Table 1.
Attributes | Taken from |
f5: koebner phenomenon | Information gain |
f10: scalp involvement | Principal Component Analysis |
f14: PNL infiltrate | Principal Component Analysis |
f15: fibrosis of the papillary dermis | Information gain |
f20: clubbing of the rete ridges involvement | Principal Component Analysis |
f21: elongation of the rete ridges | Information gain |
f22: thinning of the suprapapillary epidermis | Principal Component Analysis |
f27: vacuolization and damage of basal layer | Chi-square |
f31: perifollicular parakeratosis | Chi-square |
f33: band-like infiltrate | Chi-square |
Table 1 Selection of attributes by a hybrid feature selection technique
The Mean value, Standard deviation, and accuracy values for each base learner’ shown in Table 2.
Base learner’s | Mean value (%) | Standard deviation (%) = | Accuracy (%) |
NB | 85.95 | 5.27 | 89.10 |
KNN | 94.20 | 3.75 | 94.55 |
DT | 95.50 | 2.05 | 91.84 |
SVM | 96.60 | 2.50 | 97.25 |
RF | 94.87 | 4.25 | 91.86 |
MLP | 96.25 | 2.35 | 95.92 |
Table 2 Mean value, Stand deviation, and accuracy of each base learner.
To calculate RMSE, KSE, and AUC for the hybrid feature selection approach, and ensemble techniques. These terms are defined as:
Root mean square error:
RMSE calculated as the prediction of base learner’s values by a and the actual observed values. If to construct a good base learner, then the RMSE for training and testing of data, or else the base learner is not well. If you want higher RMSE values of training and testing of data to be calculated as,
RMSE = (7)
Kappa statistic error:
KSE metric is used to compare evaluated accuracy and expected accuracy. The value of KSE always between −1 to 1. If the calculated value of KSE is nearly one, then the performance of the classifier is more accurate than observation. KSE is evaluated for a single base learner, as well as for ensemble methods. KSE is assessed with the help of the following formula
KSE = (8)
The area under receiver operating characteristics (AUC):
With the help of TP, FP, FN, TN, we can calculate True Positive Rate (TPR) and True Negative Rate (TNR). The average of True Positive Rate and True Negative Rate (TNR) is called the area under receiver operating characteristics. These terms are calculated using the following formulas:
True Positive Rate (TPR) = (9)
True Negative Rate (TNR) = (10)
AUC = (TPR+ TNR) (11)
RMSE, KSE, and AUC evaluated through the use of base learners concerned on the minimized data subset estimation by hybrid feature selection technique. The metric values such as RMSE, KSE, and AUC for each base learners given below Table 3.
Base learners | RMSE | KSE | AUC |
NB | 0.0672 | 0.9840 | 0.980 |
KNN | 0.0695 | 0.9732 | 0.975 |
DT | 0.0790 | 0.9786 | 0.974 |
SVM | 0.0695 | 0.9981 | 0.978 |
RF | 0.0670 | 0.9795 | 0.970 |
MLP | 0.0672 | 0.9343 | 0.925 |
Table 3 Base learner’s metric values
The three ensemble methods namely Bagging, Boosting and Stacking to improve the base learners’ outcome with the predication of several base learner’s accuracies
Shown in Table 4.
Ensemble methods | Accuracy (100%)
| |||||
NB | KNN | DT | SVM | RF | MLP | |
Bagging | 94.50 | 93.21 | 92.25 | 94.69 | 93.55 | 93.60 |
Boosting | 96.10 | 95.55 | 96.58 | 96.80 | 97.69 | 95.93 |
Stacking | 97.65 | 98.75 | 96.55 | 99.65 | 98.59 | 97.95 |
Table 4 Accuracy of several ensemble methods
The six base learners combine with the use of ensemble methods like Bagging, Boosting, and Stacking. The skin disease data subset to be calculated through the combination of the hybrid feature selection technique and the three ensemble methods are followed in Table.
The several ensemble methods evaluate the confusion matrix based on precision, recall, and accuracy, and their respective equation is followed here,
Precision = (12)
Recall = (13)
F1-Score = 2() (14)
The highest accuracy 97.25% obtained from base learners through Support Vector Machine and second-largest accuracy calculated as 95.92% in Multi-Layer Perceptron. The accuracy of SVM that satisfies the best base learner’s performance. The combination of ensemble methods such as Bagging, Boosting, Stacking, and the Machine Learning Classifiers (MLC) to improve the predicted outcome. The values of the characteristic parameters such as accuracy, recall, precision, confusion matrix, F1-score, and the support values for three ensemble methods are shown in Table 5. The accuracy of the ensemble methods is 95.92%, 97.70%, and 99.67% that describes the enhanced accuracy as compared to the accuracy of the base learners. The accuracy of the base learner’s results shown in figure 3. The accuracy of ensemble methods shown in Figure 4.
Figure 3 Accuracy of Base learner’s
Table 5 Result of ensemble method values
Method | Accuracy | Confusion Matrix | Values | ||||
Precision | Recall | F1 | Score | Support value | |||
Bagging |
95.92% | [24 0 0 0 0 0] | 1 | 0.96 | 1.00 | 0.98 | 24 |
[ 0 10 0 0 0 0] | 2 | 0.83 | 1.00 | 0.91 | 10 | ||
[0 0 11 0 0 0 ] | 3 | 1.00 | 1.00 | 1.00 | 11 | ||
[0 2 0 12 0 0] | 4 | 1.00 | 0.86 | 0.92 | 14 | ||
[0 0 0 0 11 0] | 5 | 1.00 | 1.00 | 1.00 | 11 | ||
[1 0 0 0 0 3] | 6 | 1.00 | 0.75 | 0.86 | 4 | ||
avg/total | 0.99 | 0.99 | 0.99 | 74 | |||
Boosting |
97.70% | [24 0 0 0 0 0] | 1 | 0.96 | 1.00 | 0.98 | 24 |
[0 0 9 0 0 1] | 2 | 0.00 | 0.00 | 0.00 | 10 | ||
[0 0 11 0 0 0] | 3 | 0.25 | 1.00 | 0.40 | 11 | ||
[0 0 14 0 0 0] | 4 | 0.00 | 0.00 | 0.00 | 14 | ||
[0 0 10 0 0 0] | 5 | 0.00 | 0.00 | 0.00 | 11 | ||
avg/total | 0.39 | 0.53 | 0.43 | 74 | |||
Stacking |
99.67% | [23 1 0 0 0 0] | 1 | 1.00 | 0.96 | 0.98 | 24 |
[0 9 0 0 0 1] | 2 | 0.75 | 0.90 | 0.82 | 10 | ||
[0 0 11 0 0 0] | 3 | 1.00 | 1.00 | 1.00 | 24 | ||
[0 2 0 12 0 0] | 4 | 1.00 | 0.86 | 0.92 | 14 | ||
[0 0 0 0 11 0] | 5 | 1.00 | 1.00 | 1.00 | 11 | ||
[0 0 0 0 0 4] | 6 | 0.80 | 1.00 | 0.89 | 4 | ||
Avg/total | 0.96 | 0.95 | 0.95 | 74 |
Figure 4 Accuracy of ensemble methods
- Conclusion
Machine Learning (ML) algorithms are used for the potential knowledge creation with the use of stored data from the healthcare unit to construct the support systems. Here, a hybrid feature selection method to be built to improve the prediction of skin disease. The hybrid feature selection method developed through the Chi-square, Information Gain, and Principal Component Analysis to choose the ten essential features, i.e., input dataset. The six base learners, namely Naïve Bayesian, K-Nearest Neighbour, Decision Tree, Support Vector Machine, Random Forest, and Multi-Layer Perceptron classifiers, are used to manipulate the skin disease based data subset performance. By considering the hybrid feature selection techniques, SVM gives 97.25% of the outcome. The different metric values to be calculated to find the base classifiers’ performance. The Mean Value, Standard Deviation, Accuracy, the errors, namely RMSE, KSE, and AUC, are estimated to prove the base classifier’s performance. The accuracy of the ensemble methods is 95.92%, 97.70%, and 99.67%, which shows that the ensemble methods’ accuracy is better than the base learners. The most considerable accuracy in the ensemble method to be calculated is 99.67% in staking on the input data subset through the hybrid feature selection approach. However, the hybrid feature selection method strongly recommends to estimated the prediction of skin disease.
References
- Alonso, C., Carrer, V., Espinosa, S., Zanuy, M., Córdoba, M., Vidal, B., . . . Pont, M. (2019). Prediction of the skin permeability of topical drugs using in silico and in vitro models. European Journal of Pharmaceutical Sciences, 136, 104945.
- Ardestani, E. G., & Mokhtari, A. (2020). Modeling the Lumpy skin disease risk probability in central Zagros Mountains of Iran. Preventive Veterinary Medicine, 104887.
- Chen, M., Zhou, P., Wu, D., Hu, L., Hassan, M. M., & Alamri, A. (2020). AI-Skin: Skin disease recognition based on self-learning and wide data collection through a closed-loop framework. Information Fusion, 54, 1-9.
- Correia, C., Mawe, S., Lofgren, S., Marangoni, R. G., Lee, J., Saber, R., . . . Hoffmann, A. (2020). High-throughput quantitative histology in systemic sclerosis skin disease using computer vision. Arthritis research & therapy, 22(1), 1-11.
- da Fonseca, F. N., Abe, J. M., de Alencar Nääs, I., da Silva Cordeiro, A. F., do Amaral, F. V., & Ungaro, H. C. (2020). Automatic prediction of stress in piglets (Sus Scrofa) using infrared skin temperature. Computers and Electronics in Agriculture, 168, 105148.
- Das, H., Naik, B., & Behera, H. (2020). Medical disease analysis using neuro-fuzzy with feature extraction model for classification. Informatics in Medicine Unlocked, 18, 100288.
- Dehkordi, S. K., & Sajedi, H. (2019). Prediction of disease based on prescription using data mining methods. Health and Technology, 9(1), 37-44.
- Erster, O., Rubinstein, M. G., Menasherow, S., Ivanova, E., Venter, E., Šekler, M., . . . Stram, Y. (2019). Importance of the lumpy skin disease virus (LSDV) LSDV126 gene in differential diagnosis and epidemiology and its possible involvement in attenuation. Archives of virology, 164(9), 2285-2295.
- Foulkes, A. C., Watson, D. S., Carr, D. F., Kenny, J. G., Slidel, T., Parslew, R., . . . Griffiths, C. E. (2019). A framework for multi-omic prediction of treatment response to biologic therapy for psoriasis. Journal of Investigative Dermatology, 139(1), 100-107.
- Jha, S. K., Pan, Z., Elahi, E., & Patel, N. (2019). A comprehensive search for expert classification methods in disease diagnosis and prediction. Expert Systems, 36(1), e12343.
- Kadampur, M. A., & Al Riyaee, S. (2020). Skin cancer detection: Applying a deep learning based model driven architecture in the cloud for classifying dermal cell images. Informatics in Medicine Unlocked, 18, 100282.
- Magesh, G., & Swarnalatha, P. (2020). Optimal feature selection through a cluster-based DT learning (CDTL) in heart disease prediction. Evolutionary Intelligence, 1-11.
- Nakai, A., Minematsu, T., Tamai, N., Sugama, J., Urai, T., & Sanada, H. (2019). Prediction of healing in Category I pressure ulcers by skin blotting with plasminogen activator inhibitor 1, interleukin-1α, vascular endothelial growth factor C, and heat shock protein 90α: A pilot study. Journal of tissue viability, 28(2), 87-93.
- Patrick, M. T., Raja, K., Miller, K., Sotzen, J., Gudjonsson, J. E., Elder, J. T., & Tsoi, L. C. (2019). Drug Repurposing Prediction for Immune-Mediated Cutaneous Diseases using a Word-Embedding–Based Machine Learning Approach. Journal of Investigative Dermatology, 139(3), 683-691.
- Rajagopal, R. D., Murugan, S., Kottursamy, K., & Raju, V. (2019). Cluster based effective prediction approach for improving the curable rate of lymphatic filariasis affected patients. Cluster Computing, 22(1), 197-205.
- Shanthi, T., Sabeenian, R., & Anand, R. (2020). Automatic diagnosis of skin diseases using convolution neural network. Microprocessors and Microsystems, 103074.
- Sinha, A., Sahoo, B., Rautaray, S. S., & Pandey, M. (2020). Predictive Model Prototype for the Diagnosis of Breast Cancer Using Big Data Technology Advances in Data and Information Sciences (pp. 455-464): Springer.
- Sinha, A. K., & Namdev, N. (2020). Feature selection and pattern recognition for different types of skin disease in human body using the rough set method. Network Modeling Analysis in Health Informatics and Bioinformatics, 9, 1-11.
- van Waateringe, R. P., Fokkens, B. T., Slagter, S. N., van der Klauw, M. M., van Vliet-Ostaptchouk, J. V., Graaff, R., . . . Wolffenbuttel, B. H. (2019). Skin autofluorescence predicts incident type 2 diabetes, cardiovascular disease and mortality in the general population. Diabetologia, 62(2), 269-280.
- Verma, A. K., & Pal, S. (2019). Prediction of Skin Disease with Three Different Feature Selection Techniques Using Stacking Ensemble Method. Applied Biochemistry and Biotechnology, 1-20.
- Verma, A. K., Pal, S., & Kumar, S. (2019). Comparison of skin disease prediction by feature selection using ensemble data mining techniques. Informatics in Medicine Unlocked, 16, 100202.
- Verma, A. K., Pal, S., & Kumar, S. (2020). Prediction of skin disease using ensemble data mining techniques and feature selection method—a comparative study. Applied Biochemistry and Biotechnology, 190(2), 341-359.
- Yadav, D. C., & Pal, S. Prediction of thyroid disease using decision tree ensemble method. Human-Intelligent Systems Integration, 1-7.
- Zhang, Q. (2020). Artificial Intelligence-Enabled ECG Big Data Mining for Pervasive Heart Health Monitoring Biomedical Signal Processing (pp. 273-290): Springer.