Using Django for a scientific document analysis - Wouter van Atteveldt¶

Tags: django, pun

(Talk at the April 2012 Dutch Django meeting. Very nicely presente talk, Wouter had a nice, friendly style and he had a clear voice. He made small jokes and got people laughing. Wouter, you’re a good presenter!)

The full title of the talk was “AmCAT: Using Django for a scientific document analysis website. Tastypie, unit tests, R, open platforms and open questions”. Wow, that’s the title of the month :-)

Wouter studied political communications and at his university they treat that in quite a quantitative manner. Using hunderds of newspaper articles; mapping positive and negative comments in them; predicting the election results from that.

AmCat (“Amsterdam content analysis toolkit”) is a system for document management and analysis which aims at social sciences and humanities. There’s a lot of scraping of web pages and articles and twitter feeds and newspapers going on. The substantive work is keyword analysis, linguistic processing. And also manual annotation (which means 10 Euro/hour students that do the hard work…).

Social sciences means that they scream at command line interfaces, so he turned it all into a website for interacting with the data and browsing/querying it.

The design choices behind the scenes look a bit different from normal Django projects. Default django is a web site backed by a database, AmCAT is a database with a web front end. Some results:

Data should be accessible from the outside.
ORM should be usable without web site code.
DB should have the final word in authentication/authentication (though he rues this decision as it is very hard; he will probably capitulate and just use Django’s authentication/authentication functionality).
All data and methods should be exposed through a web API.

They aim at an open access publication cycle where they have a web data source, export the data with REST or whatever, generate a PDF out of it. The PDF includes hyperlinks back to the original data. This works quite well and it is slowly turning into a standard way to deal with data in academia.

(Personal note. For some background on REST versus API see http://reinout.vanrees.org/weblog/2010/11/11/soap-is-dead-long-live-rest.html )

For their REST API they use tastypie, but he had the idea that it didn’t fit nicely with what he wanted. He still had to write a lot of code instead of letting tastypie generate it. Perhaps he used it in the wrong way?

People suggested django rest framework as a nice alternative. It has a great html-browseable generated API. Django rest framework is very flat, according to a comment, which means it uses Django a lot, which makes it customizable if you need it but it does a lot of things right out-of-the-box when you don’t want to customize it. It probably performs better than tastypie as tastypie does a lot itself instead of using Django functionality, which often means a performance hit.

He talked a bit about testing. Some suggestions from attendees regarding UI testing:

Sikuli
Selenium.
I think someone mentioned http://pypi.python.org/pypi/zope.testbrowser/ which is apparently more generic than what the “zope” part of the name suggests.

Fun thing: they raise missing license notes and some pylint errors as test errors when running the tests. Wow.