Large scale search for social sciences - Wouter van Atteveldt
Tags: python, django
(Summary of a talk at a Dutch Python meetup). Full title: large scale search
and text analysis with Python, Elastic, Celery and a bit of R.
Wouter teaches political sciences at the university and uses a lot of text
analysis. He showed an example of text analysis of the Gaza-Isreal conflict,
comparing USA media talking about Isreal and Chinese media talking about
Israel. You saw big differences! The USA media talks more about rocket attacks
by Hamas on Israel. Chinese media talks more about the results on the ground:
invasion, deaths, etc.
Text analysis is very important for social sciences. There’s a flood of
digital information, online and archived. Major newspapers in the Netherlands
have been digitalized by the royal library, for instance. Lots of text to work
with. You see the same with facebook and so: they can extract lots of info
from the texts people type in!
Facebook once did an experiment on positive/negative tweaks to
timelines. Totally bullshit from a scientific viewpoint. So we cannot leave
social research to the likes of facebook. So…. there is a need for a good
tool, especially for education. They build it in python.
Why python? Open source, platform independent, relatively easy to learn. Large
community. Lots of tools like django, numpy including large communities around
it.
He also uses R. Also open source. Mostly a question of “go where the users
are” as most social scientists are used to statistical languages like R.
What they build is https://amcat.nl, Amsterdam Content Analysis Toolkit. He
demoed it with textual searches in old newspaper articles. Nice graphs, for
instance, with occurrence of words in the various years.
AmCAT is written in python and django. Fully open source. Postgres,
celery. The search happens with elasticsearch. The articles are all in
elasticsearch, but for safety’s case they’re also keeping them in
postgres. Perhaps not needed, but…
They use django management commands for a reasonably friendly command line
interface. Both for maintenance commands and for queries.
Queries can be quite elaborate and take quite some time. Longer than the
regular web timeout. For that, they use Celery for offloading long-running tasks.
Text analysis: for that you need NLP, Natural Language Processing. There are
many different good tools like CoreNLP for English, Alpino for Dutch, ParZU
for German. Only…. they’re hard to use for social scientists: hard to
install and operate.
What they build themselves was NLPipe, a simple (Flask) website for
interacting with the NLP libraries, working as a job manager. You get the
results out as csv or json.
Ongerelateerde reclame: bij ons
python programmeren in hartje Utrecht? Mail
me even :-)