Harvey
Harvey

Reputation: 572

Using RegEx to find and print plural words in Turkish

I am fairly new in python. In the code, I've read a text file as input, and put each line readen on this text file into a list as elements.

I'm trying to write the code using RegEx to find and print plural words. In turkish, plural words are '-ler' or '-lar' suffixes.

my code is as follows:

import re

f = open('C:/Users/ENE/Desktop/CSE & Kodlar/nlp/utf8textfile.txt', encoding='utf-8-sig', errors='ignore')


with f as file:
    list = file.readlines()
list = [x.strip() for x in list]

print(list)

total = 0
for i in list:
    total += len(i)
ave_size = float(total) / float(len(list))
print("Average word length = " + str(ave_size))

#p = re.compile('.*l[ae]r.*')

for element in list:
    m = re.findall(".*l[ae]r.*", element)
    if m:
        print(m)

which gives an output of

list = ['Aliler geldiler', 'Selam olsun sana', 'Merhabalar', 'Java kitabı nerede']

for loop: ['Aliler geldiler'] ['Merhabalar']

I am trying to print word by word, like ['Aliler'], ['geldiler'] and ['Merhabalar']. How can I do this?

Upvotes: 2

Views: 436

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627082

You may just find all words ending in lar or ler using a \w*l[ea]r\b regex:

results = re.findall(r'\w*l[ea]r\b', s)

See the regex demo. In Python 3.x, \b word boundary is Unicode aware by default, in Python 2.x, I'd recommend adding re.U flag.

Here, s can be the whole line, or even the whole document.

Details

  • \w* - 0+ letters, digits and _ (in Python 3.x, it will match all Unicode letters, digits or _, you may use [^\W\d_]* to only match letters)
  • l - an l letter
  • [ea] - e or a
  • r - an r letter
  • \b - a word boundary (note the r'..' notation used to avoid double escaping \b to make the engine parse it as a word boundary).

Upvotes: 1

Andreas
Andreas

Reputation: 159135

.* matches everything (except line terminators).

This means that .*l[ae]r.* will make entire input, if it contains lar or ler, and will otherwise match nothing.

You want to match words, not entire lines.

Since the word must end with l[ae]r, you need to ensure that the r is the end of the word. That can be done using \b (word boundary).

Since the word must end with l[ae]r, it has to be prededed by 1 or more (+) word characters, i.e. \w.

Now, \w only matches ASCII letters (A-Z), so you need to enable Unicode mode, so it matches all letters (e.g. ñ and ı). Also note that \w matches digits (0-9) and underscore (_), but that's generally ok.

So, your regex should be:

r"\w+l[ae]r\b"u

See regex101.com for demo.

Upvotes: 0

CodeCupboard
CodeCupboard

Reputation: 1585

You can achieve what you want with the following:

import re

example = "example words Aliler Merhabalar"

words = example.split()

for word in words:
    if (re.search(r"ler$", word)):
        print (word)
    elif (re.search(r"lar$", word)):
        print (word)

This will output:

Aliler
Merhabalar

Upvotes: 1

Related Questions