Python Leiden meetup: PR vs ROC curves, which to use - Sultan K. Imangaliyev

Tags: python, pun

(One of my summaries of the Python Leiden meetup in Leiden, NL).

Precision-recall (PR) versus Receiver Operating Characteristics (ROC) curves: which one to use if data is imbalanced?

Imbalanced data: for instance when you’re investigating rare diseases. “Rare” means few people have them. So if you have data, most of the data will be of healthy people, there’s a huge imbalance in the data.

Sensitivity versus specificity: sensitive means you find most of the sick people, specificity means you want as few false negatives and false positives as possible. Sensitivity/specificity looks a bit like precision/recall.

  • Sensitivity: true positive rate.

  • Specificity: false positive rate

If you classify, you can classify immediately into healthy/sick, but you can also use a probabilistic classifier which returns a chance (percentage) that someone can be classified as sick. You can then tweak which threshold you want to use: how sensitive and/or specific do you want to be?

PR and ROC curves (curve = graph showing the sensitivity/specificity relation on two axis) are two ways of measuring/visualising the sensitivity/specificity relation. He showed some data: if the data is imbalanced, PR is much better at evaluating your model. He compared balanced and imbalanced data with ROC and there was hardly a change in the curve.

He used scikit-learn for his data evaluations and demos.

 
vanrees.org logo

Reinout van Rees

My name is Reinout van Rees and I program in Python, I live in the Netherlands, I cycle recumbent bikes and I have a model railway.

Weblog feeds

Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):