Python Leiden meetup: PR vs ROC curves, which to use - Sultan K. Imangaliyev¶

(One of my summaries of the Python Leiden meetup in Leiden, NL).

Precision-recall (PR) versus Receiver Operating Characteristics (ROC) curves: which one to use if data is imbalanced?

Imbalanced data: for instance when you’re investigating rare diseases. “Rare” means few people have them. So if you have data, most of the data will be of healthy people, there’s a huge imbalance in the data.

Sensitivity versus specificity: sensitive means you find most of the sick people, specificity means you want as few false negatives and false positives as possible. Sensitivity/specificity looks a bit like precision/recall.

Sensitivity: true positive rate.
Specificity: false positive rate

If you classify, you can classify immediately into healthy/sick, but you can also use a probabilistic classifier which returns a chance (percentage) that someone can be classified as sick. You can then tweak which threshold you want to use: how sensitive and/or specific do you want to be?

PR and ROC curves (curve = graph showing the sensitivity/specificity relation on two axis) are two ways of measuring/visualising the sensitivity/specificity relation. He showed some data: if the data is imbalanced, PR is much better at evaluating your model. He compared balanced and imbalanced data with ROC and there was hardly a change in the curve.

He used scikit-learn for his data evaluations and demos.