Reputation: 41
I have a list like a bellow that need to split into prefix/root/suffix
Input
form
jalan
ba-jalan
pem-porut#an
daun #kulu
daun#kulu
tarik-napas
tarik#napas
n-cium #bow
arau/araw
imbaw//nimbaw
dengo | nengo
dodop=am
{di} dalam
di {dalam}
I have done it by bellow regex on python:
import sys
sys.stdout = open('final.txt', 'w')
import re
open('split.txt') as f:
new_split = [item.strip() for item in f.readlines()]
for word in new_split:
m = re.match(r"(?:\{[^-#={}/|]+\})?(?:([^-#={}/|]+)-)?([^-#={}/|]+)(?:/[^-#={}/|]+)?(?:[#=]([^-#={}/|]+))?", word)
if m:
print("\t".join([str(item) for item in m.groups()]))
else:
print("(no match: %s)" % word)
the output which is final looks like this.
None jalan None
ba jalan None
pem porut an
None daun kulu
None daun kulu
tarik napas None
None tarik napas
n cium bow
None arau None
None imbaw None
None dengo None
None dodop am
None dalam None
None di None
now as you see in the word dalam at the bottom of the output file there is extra space before dalam and some other words also have extra space before strings how to remove those extra space from the final.txt file can I do it at the same above script or should I do that in the separate script? thanks.
Upvotes: 1
Views: 1263
Reputation: 1623
Add lstrip() to the string to remove leading whitespaces.
str(item).lstrip()
Code:
import re
with open('split.txt') as w:
new_split = [item.strip() for item in w.readlines()]
for word in new_split:
m = re.match(r"(?:\{[^-#={}/|]+\})?(?:([^-#={}/|]+)-)?([^-#={}/|]+)(?:/[^-#={}/|]+)?(?:[#=]([^-#={}/|]+))?", word)
if m:
print("\t".join([str(item).lstrip() for item in m.groups()]))
else:
print("(no match: %s)" % word)
Upvotes: 1