Using C4.5 to predict Diabetes in Pima Indian Women

C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan in 1993. C4.5 is an extension of Quinlan's earlier ID3 algorithm. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier.

Today, I will use C4.5 algorithm to predict Diabetes in Pima Indian Women.

Meet the dataset

https://i0.wp.com/diabetes.niddk.nih.gov/dm/pubs/pima/genetic/images/portrait.gif

A population of women who were at least 21 years old, of Pima Indian heritage and living near Phoenix, Arizona, was tested for diabetes according to World Health Organization criteria. The data were collected by the US National Institute of Diabetes and Digestive and Kidney Diseases. We used the 532 complete records after dropping the (mainly missing) data on serum insulin.

Load the dataset

 library(MASS)
data.train <- Pima.tr
data.test <- Pima.te
 str(data.train)
#> 'data.frame':  200 obs. of  8 variables:
#>  $ npreg: int  5 7 5 0 0 5 3 1 3 2 ...
#>  $ glu  : int  86 195 77 165 107 97 83 193 142 128 ...
#>  $ bp   : int  68 70 82 76 60 76 58 50 80 78 ...
#>  $ skin : int  28 33 41 43 25 27 31 16 15 37 ...
#>  $ bmi  : num  30.2 25.1 35.8 47.9 26.4 35.6 34.3 25.9 32.4 43.3 ...
#>  $ ped  : num  0.364 0.163 0.156 0.259 0.133 ...
#>  $ age  : int  24 55 35 26 23 52 25 24 63 31 ...
#>  $ type : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 1 1 2 ...

Training data contains 7 numerical variables (npreg, glu, bp, skin, bmi, ped, age) and a categorical variable (type). 

  • npreg: number of pregnancies.
  • glu: plasma glucose concentration in an oral glucose tolerance test.
  • bp: diastolic blood pressure (mm Hg).
  • skin: triceps skin fold thickness (mm).
  • bmi: body mass index (weight in kg/(height in m)^2).
  • ped: diabetes pedigree function.
  • age: age in years.
  • type: Yes or No, for diabetic according to WHO criteria.

Fit the model

We now use caret package to fit model to the data set. Method J48 generates unpruned or pruned C4.5 decision trees.

 library(caret)
fit.c45 <- train(type ~ ., data=data.train, method="J48")
fit.c45
#>  C4.5-like Trees 
#>  
#>  200 samples
#>    7 predictors
#>    2 classes: 'No', 'Yes' 
#>  
#>  No pre-processing
#>  Resampling: Bootstrapped (25 reps) 
#>  
#>  Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
#>  
#>  Resampling results
#>  
#>    Accuracy   Kappa      Accuracy SD  Kappa SD 
#>    0.6824363  0.2718139  0.06675092   0.1395038
#>  
#>  Tuning parameter 'C' was held constant at a value of 0.25

The resampling results show us accuracy of this model is not very high, just 0.68.

Visualize result

 plot(fit.c45$finalModel)

One of greate point of decision tree is we can easily interpret the result. Now we can extract some useful rules from this tree:

  • If a person has glu less than or equal 123, she may not have diabetes.
  • If a person has glu greater than 123, and bmi less than or equal 28.6, she may not have diabetes; unless she has bp less than or equal 80 and pre greater than 0.162, in this case, she got a high risk with diabetes.
  • If a person has glu greater than 123, bmi greather than 28.6 and pred greather than 0.344, she may have diabetes.

Evaluation

The above rules is very interesting. But we must consider how well the model fit the dataset. It's only got 0.68 in accuracy when fit the training set, we now take a step further, we will use confusionMatrix to calculates a cross-tabulation of observed in testing set and predicted classes by the model with associated statistics. 

 confusionMatrix(predict(fit.c45, newdata=data.test), data.test$type)
#>  Confusion Matrix and Statistics
#>  
#>            Reference
#>  Prediction  No Yes
#>         No  193  58
#>         Yes  30  51
#>                                           
#>                 Accuracy : 0.7349         
#>                   95% CI : (0.684, 0.7816)
#>      No Information Rate : 0.6717         
#>      P-Value [Acc > NIR] : 0.007503       
#>                                           
#>                    Kappa : 0.3568         
#>   Mcnemar's Test P-Value : 0.003999       
#>                                           
#>              Sensitivity : 0.8655         
#>              Specificity : 0.4679         
#>           Pos Pred Value : 0.7689         
#>           Neg Pred Value : 0.6296         
#>               Prevalence : 0.6717         
#>           Detection Rate : 0.5813         
#>     Detection Prevalence : 0.7560         
#>        Balanced Accuracy : 0.6667         
#>                                           
#>         'Positive' Class : No

The overal accuracy is 0.73. It's not very high. However considering the ease of interpretation and explanation of C4.5 model to extract some useful rules, we can understand why C4.5 still in top 10 data mining algorithms.

References

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s