Reputation: 942
I have a file with Unicode characters with pattern like
a unicode string1 । b unicode string2 ॥ १ ॥ c unicode string3 । d unicode string4 ॥ २ ॥
Here '१', '२' these are not responding to the numerical query as those are Unicode characters. There is space between '॥' and '२'.
Now there is no newline, no break. I want to have newline after every alternate '॥' so I could have pattern like
a unicode string1 । b unicode string2 ॥ १ ॥
c unicode string3 । d unicode string4 ॥ २ ॥
I tried few regex but could not achieve it with my poor knowledge of regex. The sample of my code is, which provides a newline after every '॥', below.
import csv
txt_file = "/path/to/file/file_name.txt"
csv_file = "mycsv.csv"
regex = "॥"
with open(txt_file,'r+') as fr, open('vc','r+') as fw:
for line in fr:
fw.write(line.replace(regex, "॥\n"))
It is giving result like
a unicode string1 । b unicode string2 ॥
१ ॥
c unicode string3 । d unicode string4 ॥
२ ॥
Upvotes: 1
Views: 92
Reputation: 1402
This is because it is finding each instance of " ॥ " and then putting a new line after it. You may want to rewrite your loop to find a more specific example.
regex = '॥ १ ॥'
txt_file = open("newTextFile.txt", "r")
rawFileString=txt_file.read()
rawFileString=rawFileString.replace(regex,'॥ १ ॥\n')
print(rawFileString)
And from here you can get new lines, and write this string to a new file etc.
Note: this will work because there is a pattern in your text file. If you have something more complicated you may need to do several replacements or other modifications to the text to retrieve the result you want.
Edit: Although this method can get messy, you can avoid using very complicated regex and create a substring from the index of the find instance of a delimiter.
The way your file looks to be patterned this may work for you:
txt_file = open("newTextFile.txt", "r")
rawFileString=txt_file.read()
startOfText = 0
delimiter = '॥'
instance1= rawFileString.find(delimiter)
#print rawFileString.find(delimiter)
instance2= rawFileString.find(delimiter, instance1+1)
#print rawFileString.find(delimiter,instance1+1)
counter=0
#for this while loop you may want to change 10 to be the number of lines in the document multiplied by 2.
while counter<10:
substring=rawFileString[startOfText:instance2+3]
print(substring)
startOfText = instance2+4
instance1 = rawFileString.find(delimiter, startOfText)
instance2 = rawFileString.find(delimiter, instance1+1)
counter=counter+1
txt_file.close()
Upvotes: 1
Reputation: 2269
There is also another way to solve, by considering the fact the "॥ ", followed by an alphabet character is always the case for a new line insertion.
s = r'unicode string1 । b unicode string2 ॥ १ ॥ c unicode string3 । d unicode string4 ॥ २ ॥'
occurrences = re.split(r'॥ [a-z]{1,}', s)
for item in occurrences[:-1]:
print item.strip()+" ॥"
print occurrences[:-1].strip()
Upvotes: 1
Reputation: 4067
Welcome to the confusing world of regex...
I suggest using the re library, which can easily handle what you want to do. For example:
import re
text = "a unicode string1 । b unicode string2 ॥ १ ॥ c unicode string3 । d unicode string4 ॥ २ ॥"
pattern = '(॥ .{1} ॥ )'
new = re.sub(pattern,
lambda m: m.groups()[0][:-1] + '\n',
text)
print(new)
>> a unicode string1 । b unicode string2 ॥ १ ॥
c unicode string3 । d unicode string4 ॥ २ ॥
A bit of explanation:
pattern
is a regular expression defining the '॥ [any character] ॥' pattern you want to place a newline after. The .{1}
means 'any single character', and I've left a space after the second ॥
that the \n
is added after the space, and it doesn't hang around at the start of the next line. The whole pattern is placed in brackets, identifying it as a single regex 'group'.m.groups()[0]
), after trimming off the trailing space ([:-1]
), and adding a newline character (+\n
)There might be a simpler way of doing this that doesn't involve using groups... but this works!
Upvotes: 2