(One of my summaries of the 2023 Dutch aiGrunn AI conference in Groningen, NL).
The company they work for is called “explosion”, so what can go wrong? :-)
SpaCy (https://spacy.io/) is a library for natural language processing. You give it text documents and you get them back with annotations.
Spacy mostly works with a pipeline. You always start with a tonenizer, afterwards multiple optional steps and at the end the annotated document.
A tokenizer splits op the text. The period at the end of a sentence
doesn’t belong to the last word, for instance, it is a separate
item. “Twitter’s” also is “twitter” and “‘s”. What comes out of the
tokenization project is a Doc
, which behaves as a list of
tokens. doc[9]
can be 's
.
A useful step: lemmatisation. The token accepted
is annotated with the
lemma accept
. This makes later searching easier. directors
has the
lemma director
.
Span classification is entity recognition. A token Musk
is recognised
as a “person”. The tokens 25
and april
in combination can be a
“date”. The recognised entities and up as doc.ents[number]
.
You can do document classification. Categories like “newswire” or “love letter” with an attached estimation (“80% chance this is a newswire”).
Some of the transformers work with AI. Several kinds of pre-trained data are available. What they themselves use is the Groningen meaning bank (GMB), developed by the university of Groningen. More than 10k English texts, mostly newspaper texts from the public domain. You can also look at https://github.com/explosion/curated-transformers .
Spacy has its own plugins to provide annotations, but you can also plug in
your own. It is configured through a .ini
file. A project can be seen
as a sort of “makefile” for running everything. Assets (=remote sources you
want to have donwloaded), training data, what has to be run, the config, etc.
They showed a demo of how the whole system works. Looked nice and useful. You can play with the demo yourself: https://github.com/explosion/aiGrunn-2023
Compared to a LLM like chatgpt, at the moment targeted NLP often performs much better at classification.
My name is Reinout van Rees and I program in Python, I live in the Netherlands, I cycle recumbent bikes and I have a model railway.
Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):