Reputation: 353
I have an awk script that finds word frequencies.
{$0 = tolower($0)} {gsub(/[[:punct:]]/, "")} {for(i=1;i<=NF;i++) a[$i]++} END {for(k in a) print k,a[k]}
I work with Turkish text. Turkish words mostly appear with suffixes.
A sample of results from this script:
kadınlar 1
kadınlara 1
kadınlarımızın 1
kadınlarına 1
kadınlarının 1
Here the root is “kadın” ("woman" in English).
So, “kadınlar” is “women”. “Kadınlara” is “to women” and so on.
Can awk extract the root “kadın” from these 5 words? Do we need to check a dictionary for this?
Expected output:
These 5 words with the same root (kadın),
kadınlar 1
kadınlara 1
kadınlarımızın 1
kadınlarına 1
kadınlarının 1
should be listed as such:
kadın 5
Upvotes: 1
Views: 104
Reputation: 16797
Rather than writing an awk script, it is probably simpler to use an existing tool.
snowballstemmer appears to be available for python.
I don't know python but it's easy enough to write something to use it:
$ pip install snowballstemmer
Defaulting to user installation because normal site-packages is not writeable
Collecting snowballstemmer
Downloading snowballstemmer-2.2.0-py2.py3-none-any.whl (93 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 93.0/93.0 KB 2.6 MB/s eta 0:00:00
Installing collected packages: snowballstemmer
Successfully installed snowballstemmer-2.2.0
$ cat >input <<'EOD'
kadınlar
kadınlara
kadınlarımızın
kadınlarına
kadınlarının
EOD
$ cat >tstem <<'EOD'
#!/usr/bin/python3
import snowballstemmer
stemmer = snowballstemmer.stemmer('turkish')
for word in open('input','r').read().splitlines():
print(word,"->",stemmer.stemWord(word))
EOD
$ chmod +x tstem
$ ./tstem
kadınlar -> kadın
kadınlara -> kadın
kadınlarımızın -> kadın
kadınlarına -> kadın
kadınlarının -> kadın
$
The most popular stemmer on github seems to be Turkish Stemmer for Python:
$ pip install TurkishStemmer
Defaulting to user installation because normal site-packages is not writeable
Collecting TurkishStemmer
Downloading TurkishStemmer-1.3-py3-none-any.whl (20 kB)
Installing collected packages: TurkishStemmer
Successfully installed TurkishStemmer-1.3
$ cat >tstem2 <<'EOD'
#!/usr/bin/python3
from TurkishStemmer import TurkishStemmer
stemmer = TurkishStemmer()
for word in open('input','r').read().splitlines()
print(word,"->",stemmer.stem(word))
EOD
$ chmod +x tstem2
$ ./tstem2
kadınlar -> kat
kadınlara -> kadın
kadınlarımızın -> kadın
kadınlarına -> kadın
kadınlarının -> kadın
$
This gets one wrong. (But perhaps it gets some right that snowballstemmer gets wrong?)
A sample complete implementation:
$ cat >tstem3 <<'EOD'
#!/usr/bin/python3
import sys
import snowballstemmer
stemmer = snowballstemmer.stemmer('turkish')
for line in sys.stdin:
for word in line.split():
print(stemmer.stemWord(word))
EOD
$ chmod +x tstem3
$ <original-input.txt tr '[:upper:]' '[:lower:]' |
tr -s '[:punct:]' ' ' |
./tstem3 |
sort |
uniq -c
5 kadın
$
Upvotes: 3