I recently needed Greek stemmer in C# for a personal project. To my amazement, I could not find anything on the net (that, or my google skills are getting rusty). As I rarely give anything back to the community (shame alarm), I decided to share my findings with the world.
Disclaimer: actually, I've merely written any code, I tried to "borrow" bits from here and there. First, I came across a wonderfully simple description of a greek stemmer in Georgios Ntais dissertation. Looking for actual code, I came across a Java implementation included in Lucene 4.7 written in Java. Although a .NET port exists, its current version (3.0.3) did not include the Greek stemmer. As the time was around 2am and I was feeling reluctant to write any code whatsoever, I downloaded the free edition of Java to C# converter (shame alarm x2). The free edition has a 1000 lines limitation, and GreekStemmer.java from Lucene was 767 lines: hurrahhh!
I was now really close to getting what I wanted. The only thing missing was that GreekStemmer required that all letters were lower case, with diacritics removed, and small final sigma (end of word) replaced with normal small sigma. So I copied the logic found in GreekLowerCaseFilter.java (no brainer here) and typed a few lines by myself (great success!). As a last small bonus, I also "borrowed" the stop words from the Lucene distribution, I'm pretty sure they will come handy when analyzing text.
I've included all these 3 files for GreekStemmer.cs in a zip file for your download pleasure. See you in 4 years with another blog post...
Enjoy!
Themos
Disclaimer: actually, I've merely written any code, I tried to "borrow" bits from here and there. First, I came across a wonderfully simple description of a greek stemmer in Georgios Ntais dissertation. Looking for actual code, I came across a Java implementation included in Lucene 4.7 written in Java. Although a .NET port exists, its current version (3.0.3) did not include the Greek stemmer. As the time was around 2am and I was feeling reluctant to write any code whatsoever, I downloaded the free edition of Java to C# converter (shame alarm x2). The free edition has a 1000 lines limitation, and GreekStemmer.java from Lucene was 767 lines: hurrahhh!
I was now really close to getting what I wanted. The only thing missing was that GreekStemmer required that all letters were lower case, with diacritics removed, and small final sigma (end of word) replaced with normal small sigma. So I copied the logic found in GreekLowerCaseFilter.java (no brainer here) and typed a few lines by myself (great success!). As a last small bonus, I also "borrowed" the stop words from the Lucene distribution, I'm pretty sure they will come handy when analyzing text.
I've included all these 3 files for GreekStemmer.cs in a zip file for your download pleasure. See you in 4 years with another blog post...
Enjoy!
Themos