Large scale search for social sciences - Wouter van AtteveldtΒΆ

Tags: python, django

(Summary of a talk at a Dutch Python meetup). Full title: large scale search and text analysis with Python, Elastic, Celery and a bit of R.

Wouter teaches political sciences at the university and uses a lot of text analysis. He showed an example of text analysis of the Gaza-Isreal conflict, comparing USA media talking about Isreal and Chinese media talking about Israel. You saw big differences! The USA media talks more about rocket attacks by Hamas on Israel. Chinese media talks more about the results on the ground: invasion, deaths, etc.

Text analysis is very important for social sciences. There’s a flood of digital information, online and archived. Major newspapers in the Netherlands have been digitalized by the royal library, for instance. Lots of text to work with. You see the same with facebook and so: they can extract lots of info from the texts people type in!

Facebook once did an experiment on positive/negative tweaks to timelines. Totally bullshit from a scientific viewpoint. So we cannot leave social research to the likes of facebook. So.... there is a need for a good tool, especially for education. They build it in python.

Why python? Open source, platform independent, relatively easy to learn. Large community. Lots of tools like django, numpy including large communities around it.

He also uses R. Also open source. Mostly a question of “go where the users are” as most social scientists are used to statistical languages like R.

What they build is, Amsterdam Content Analysis Toolkit. He demoed it with textual searches in old newspaper articles. Nice graphs, for instance, with occurrence of words in the various years.

AmCAT is written in python and django. Fully open source. Postgres, celery. The search happens with elasticsearch. The articles are all in elasticsearch, but for safety’s case they’re also keeping them in postgres. Perhaps not needed, but...

They use django management commands for a reasonably friendly command line interface. Both for maintenance commands and for queries.

Queries can be quite elaborate and take quite some time. Longer than the regular web timeout. For that, they use Celery for offloading long-running tasks.

Text analysis: for that you need NLP, Natural Language Processing. There are many different good tools like CoreNLP for English, Alpino for Dutch, ParZU for German. Only.... they’re hard to use for social scientists: hard to install and operate.

What they build themselves was NLPipe, a simple (Flask) website for interacting with the NLP libraries, working as a job manager. You get the results out as csv or json.

Ongerelateerde reclame: bij ons python programmeren in hartje Utrecht? Mail me even :-)

blog comments powered by Disqus logo

About me

My name is Reinout van Rees and I work a lot with Python (programming language) and Django (website framework). I live in The Netherlands and I'm happily married to Annie van Rees-Kooiman.

Weblog feeds

Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):