zeynel
zeynel

Reputation: 353

Extracting the root of words with awk

I have an awk script that finds word frequencies.

{$0 = tolower($0)}  {gsub(/[[:punct:]]/, "")} {for(i=1;i<=NF;i++) a[$i]++} END {for(k in a) print k,a[k]} 

I work with Turkish text. Turkish words mostly appear with suffixes.

A sample of results from this script:

kadınlar       1
kadınlara      1
kadınlarımızın 1
kadınlarına    1
kadınlarının   1

Here the root is “kadın” ("woman" in English).

So, “kadınlar” is “women”. “Kadınlara” is “to women” and so on.

Can awk extract the root “kadın” from these 5 words? Do we need to check a dictionary for this?

Expected output:

These 5 words with the same root (kadın),

kadınlar       1
kadınlara      1
kadınlarımızın 1
kadınlarına    1
kadınlarının   1

should be listed as such:

kadın 5

Upvotes: 1

Views: 104

Answers (1)

jhnc
jhnc

Reputation: 16797

Rather than writing an awk script, it is probably simpler to use an existing tool.

snowballstemmer appears to be available for python.

I don't know python but it's easy enough to write something to use it:

$ pip install snowballstemmer
Defaulting to user installation because normal site-packages is not writeable
Collecting snowballstemmer
  Downloading snowballstemmer-2.2.0-py2.py3-none-any.whl (93 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 93.0/93.0 KB 2.6 MB/s eta 0:00:00
Installing collected packages: snowballstemmer
Successfully installed snowballstemmer-2.2.0
$ cat >input <<'EOD'
kadınlar
kadınlara
kadınlarımızın
kadınlarına
kadınlarının
EOD
$ cat >tstem <<'EOD'
#!/usr/bin/python3

import snowballstemmer
stemmer = snowballstemmer.stemmer('turkish')

for word in open('input','r').read().splitlines():
    print(word,"->",stemmer.stemWord(word))

EOD
$ chmod +x tstem
$ ./tstem
kadınlar -> kadın
kadınlara -> kadın
kadınlarımızın -> kadın
kadınlarına -> kadın
kadınlarının -> kadın
$

The most popular stemmer on github seems to be Turkish Stemmer for Python:

$ pip install TurkishStemmer
Defaulting to user installation because normal site-packages is not writeable
Collecting TurkishStemmer
  Downloading TurkishStemmer-1.3-py3-none-any.whl (20 kB)
Installing collected packages: TurkishStemmer
Successfully installed TurkishStemmer-1.3
$ cat >tstem2 <<'EOD'
#!/usr/bin/python3

from TurkishStemmer import TurkishStemmer
stemmer = TurkishStemmer()

for word in open('input','r').read().splitlines()
    print(word,"->",stemmer.stem(word))

EOD
$ chmod +x tstem2
$ ./tstem2
kadınlar -> kat
kadınlara -> kadın
kadınlarımızın -> kadın
kadınlarına -> kadın
kadınlarının -> kadın
$

This gets one wrong. (But perhaps it gets some right that snowballstemmer gets wrong?)


A sample complete implementation:

$ cat >tstem3 <<'EOD'
#!/usr/bin/python3

import sys
import snowballstemmer
stemmer = snowballstemmer.stemmer('turkish')

for line in sys.stdin:
    for word in line.split():
        print(stemmer.stemWord(word))
EOD
$ chmod +x tstem3
$ <original-input.txt tr '[:upper:]' '[:lower:]' |
  tr -s '[:punct:]' ' ' |
  ./tstem3 |
  sort |
  uniq -c
      5 kadın
$

Upvotes: 3

Related Questions