Reputation: 572
I am fairly new in python. In the code, I've read a text file as input, and put each line readen on this text file into a list as elements.
I'm trying to write the code using RegEx to find and print plural words. In turkish, plural words are '-ler' or '-lar' suffixes.
my code is as follows:
import re
f = open('C:/Users/ENE/Desktop/CSE & Kodlar/nlp/utf8textfile.txt', encoding='utf-8-sig', errors='ignore')
with f as file:
list = file.readlines()
list = [x.strip() for x in list]
print(list)
total = 0
for i in list:
total += len(i)
ave_size = float(total) / float(len(list))
print("Average word length = " + str(ave_size))
#p = re.compile('.*l[ae]r.*')
for element in list:
m = re.findall(".*l[ae]r.*", element)
if m:
print(m)
which gives an output of
list = ['Aliler geldiler', 'Selam olsun sana', 'Merhabalar', 'Java kitabı nerede']
for loop: ['Aliler geldiler'] ['Merhabalar']
I am trying to print word by word, like ['Aliler'], ['geldiler'] and ['Merhabalar']. How can I do this?
Upvotes: 2
Views: 436
Reputation: 627082
You may just find all words ending in lar
or ler
using a \w*l[ea]r\b
regex:
results = re.findall(r'\w*l[ea]r\b', s)
See the regex demo. In Python 3.x, \b
word boundary is Unicode aware by default, in Python 2.x, I'd recommend adding re.U
flag.
Here, s
can be the whole line, or even the whole document.
Details
\w*
- 0+ letters, digits and _
(in Python 3.x, it will match all Unicode letters, digits or _
, you may use [^\W\d_]*
to only match letters)l
- an l
letter[ea]
- e
or a
r
- an r
letter\b
- a word boundary (note the r'..'
notation used to avoid double escaping \b
to make the engine parse it as a word boundary).Upvotes: 1
Reputation: 159135
.*
matches everything (except line terminators).
This means that .*l[ae]r.*
will make entire input, if it contains lar
or ler
, and will otherwise match nothing.
You want to match words, not entire lines.
Since the word must end with l[ae]r
, you need to ensure that the r
is the end of the word. That can be done using \b
(word boundary).
Since the word must end with l[ae]r
, it has to be prededed by 1 or more (+
) word characters, i.e. \w
.
Now, \w
only matches ASCII letters (A-Z), so you need to enable Unicode mode, so it matches all letters (e.g. ñ
and ı
). Also note that \w
matches digits (0-9) and underscore (_), but that's generally ok.
So, your regex should be:
r"\w+l[ae]r\b"u
See regex101.com for demo.
Upvotes: 0
Reputation: 1585
You can achieve what you want with the following:
import re
example = "example words Aliler Merhabalar"
words = example.split()
for word in words:
if (re.search(r"ler$", word)):
print (word)
elif (re.search(r"lar$", word)):
print (word)
This will output:
Aliler
Merhabalar
Upvotes: 1