aiGrunn: state-of-the-art transformer pipelines in spaCy - Daniël de Kok & Madeeswaran Kannan

Tags: aigrunn, ai, python

(One of my summaries of the 2023 Dutch aiGrunn AI conference in Groningen, NL).

The company they work for is called “explosion”, so what can go wrong? :-)

SpaCy (https://spacy.io/) is a library for natural language processing. You give it text documents and you get them back with annotations.

Spacy mostly works with a pipeline. You always start with a tonenizer, afterwards multiple optional steps and at the end the annotated document.

A tokenizer splits op the text. The period at the end of a sentence doesn’t belong to the last word, for instance, it is a separate item. “Twitter’s” also is “twitter” and “‘s”. What comes out of the tokenization project is a Doc, which behaves as a list of tokens. doc[9] can be 's.

A useful step: lemmatisation. The token accepted is annotated with the lemma accept. This makes later searching easier. directors has the lemma director.

Span classification is entity recognition. A token Musk is recognised as a “person”. The tokens 25 and april in combination can be a “date”. The recognised entities and up as doc.ents[number].

You can do document classification. Categories like “newswire” or “love letter” with an attached estimation (“80% chance this is a newswire”).

Some of the transformers work with AI. Several kinds of pre-trained data are available. What they themselves use is the Groningen meaning bank (GMB), developed by the university of Groningen. More than 10k English texts, mostly newspaper texts from the public domain. You can also look at https://github.com/explosion/curated-transformers .

Spacy has its own plugins to provide annotations, but you can also plug in your own. It is configured through a .ini file. A project can be seen as a sort of “makefile” for running everything. Assets (=remote sources you want to have donwloaded), training data, what has to be run, the config, etc.

They showed a demo of how the whole system works. Looked nice and useful. You can play with the demo yourself: https://github.com/explosion/aiGrunn-2023

Compared to a LLM like chatgpt, at the moment targeted NLP often performs much better at classification.

 
vanrees.org logo

Reinout van Rees

My name is Reinout van Rees and I program in Python, I live in the Netherlands, I cycle recumbent bikes and I have a model railway.

Weblog feeds

Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):