Reputation: 7805
I have files with the words written on different speaking languages. I would like to parse them using python programming language to have the same structure in all files. Currently files contain the lines like
1. word1
24. word2
- word3
word4
** word5
The goal is to have all of them written like
** word
I have already some code reading from one file, fr, and writing to new one, fw, like this
for line in fr:
match = re.search(r'^\*\* .*', line)
if match:
fw.write(line)
I have two questions.
First question. How to write regexp, so it will be searching for line starting not from alpha character and remove everything that is before alpha character?
I have tried like this
fw.write(re.sub(r'(^([^a-zA-Z].*)([a-zA-Z])*.*)', "** \1", line))
but it doesn't work.
Second question. How to verify if the string starts with alpha character. I have tried
print line[0].isalpha()
it returns ?. Do I need to have it unicode first?
Thank you.
Upvotes: 1
Views: 201
Reputation: 564
Try matching any of the possible line prefixes, then collect the rest of the line as your word of interest.
pat = re.compile(r'^(\d+\. |- |\*\* )?(?P<word>.*)')
The first group defines the possible prefixes (you might want to fix it up for one or more spaces instead of a literal space). The second, named, group gets the word.
Upvotes: 0
Reputation: 91430
The unicode property for a letter is: \pL
. Put this in place of [a-zA-Z]
use it as:
^\PL*(\pL+)
That means 0 or more non letter followed by 1 or more letters captured in group 1.
Upvotes: 2
Reputation: 50200
Import the codecs
module and open the file with
fp = codecs.open(filename, encoding='utf-8')
If your file has a mix of languages, this is the most likely to be right. If not, figure out which encoding you should be using. This will give you unicode and your REs will have a hope of working correctly.
Upvotes: 0