IceNLPy 🐍

By Hinrik Hafsteinsson

Published or Updated on

🐍🐍🐍

I never really told the world about a fun project I did a while back.

So here it is:

What is it?

IceNLPy is a Python wrapper for the Java-based IceNLP toolkit for Icelandic. Although originally made for fun, the package is fully functional and can be used in real-world projects.

>>> from icenlpy import icetagger, iceparser, tokenizer

>>> text = "Hann er mjög virtur málfræðingur að norðan. Hvað segirðu um það?"
>>> tokens = tokenizer.split_into_sentences(text)
>>> tokens

['Hann er mjög virtur málfræðingur að norðan .', 'Hvað segirðu um það ?']

>>> tagged = icetagger.tag_text(tokens)
>>> tagged

['Hann fpken er sfg3en mjög aa virtur lkensf málfræðingur nken að aa norðan aa . .\n', 
'Hvað fshen segirðu sfg2en um ao það fpheo ? ?\n']

>>> parsed = iceparser.parse_text(tagged)
>>> parsed

[[[NP Hann fpken ] [VPb er sfg3en ] [NP [AP [AdvP mjög aa ] virtur lkensf ] málfræðingur nken ]
[AdvP að aa norðan aa ] . .], [[NP Hvað fshen ] [VP segirðu sfg2en ] [PP um ao [NP það fpheo ] ] ? ?]]

Frekari upplýsingar má nálgast á GitHub svæði pakkans.

Background

Last December (2023) I needed to use some of the tools in in the IceNLP library for a project I was working on.

IceNLP is a language processing library for Icelandic written in Java which, for a long time, was the state-of-the-art in general-purpose NLP for Icelandic. Even though it's a bit outdated now, it serves some purpose as a baseline for Icelandic NLP and is still applicable in various contexts.

At the time I was using the GreynirEngine package by extensively in a project, and the the thought occured to me that it would be nice to be able to use and integrate IceNLP in a Python environment similarly.

I decided to write a Python wrapper for the IceNLP library over the christmas break, which I call IceNLPy. The wrapper is available on GitHub and PyPi, and is installable via pip.

Why?

As the “problem” of IceNLP being a Java library is near-trivial, the wrapper is just barely more than a for-fun project. However, it is a good somewhat example of how to wrap a Java library in Python with a thin subprocess layer and custom Python data types. I also take any chance I get to work with Python outside the confines of individual scripts (which NLP tends to be a lot of).

Other

The first released iteration of the package includes the tokenizer, the tagger (IceTagger) and the parser (IceParser). I might end up working on it in the future to include the rest of the IceNLP tools and optimize the package, but it’s not a priority. We’ll see. 🤷‍♂️

Condor