IceNLPy 🐍
By Hinrik Hafsteinsson
Published or Updated on
🐍🐍🐍
I never really told the world about a fun project I did a while back.
So here it is:
What is it?
IceNLPy is a Python wrapper for the Java-based IceNLP toolkit for Icelandic. Although originally made for fun, the package is fully functional and can be used in real-world projects.
>>> from icenlpy import icetagger, iceparser, tokenizer
>>> text = "Hann er mjög virtur málfræðingur að norðan. Hvað segirðu um það?"
>>> tokens = tokenizer.split_into_sentences(text)
>>> tokens
['Hann er mjög virtur málfræðingur að norðan .', 'Hvað segirðu um það ?']
>>> tagged = icetagger.tag_text(tokens)
>>> tagged
['Hann fpken er sfg3en mjög aa virtur lkensf málfræðingur nken að aa norðan aa . .\n',
'Hvað fshen segirðu sfg2en um ao það fpheo ? ?\n']
>>> parsed = iceparser.parse_text(tagged)
>>> parsed
[[[NP Hann fpken ] [VPb er sfg3en ] [NP [AP [AdvP mjög aa ] virtur lkensf ] málfræðingur nken ]
[AdvP að aa norðan aa ] . .], [[NP Hvað fshen ] [VP segirðu sfg2en ] [PP um ao [NP það fpheo ] ] ? ?]]
Frekari upplýsingar má nálgast á GitHub svæði pakkans.
Background
Last December (2023) I needed to use some of the tools in in the IceNLP library for a project I was working on.
IceNLP is a language processing library for Icelandic written in Java which, for a long time, was the state-of-the-art in general-purpose NLP for Icelandic. Even though it's a bit outdated now, it serves some purpose as a baseline for Icelandic NLP and is still applicable in various contexts.
At the time I was using the GreynirEngine package by extensively in a project, and the the thought occured to me that it would be nice to be able to use and integrate IceNLP in a Python environment similarly.
I decided to write a Python wrapper for the IceNLP library over the christmas break, which I call IceNLPy. The wrapper is available on GitHub and PyPi, and is installable via pip
.
Why?
As the “problem” of IceNLP being a Java library is near-trivial, the wrapper is just barely more than a for-fun project. However, it is a good somewhat example of how to wrap a Java library in Python with a thin subprocess
layer and custom Python data types. I also take any chance I get to work with Python outside the confines of individual scripts (which NLP tends to be a lot of).
Other
The first released iteration of the package includes the tokenizer, the tagger (IceTagger) and the parser (IceParser). I might end up working on it in the future to include the rest of the IceNLP tools and optimize the package, but it’s not a priority. We’ll see. 🤷♂️