yart
yart

Reputation: 7805

How to parse the file with international words in python

I have files with the words written on different speaking languages. I would like to parse them using python programming language to have the same structure in all files. Currently files contain the lines like

1. word1
24. word2
- word3
word4
** word5

The goal is to have all of them written like

** word

I have already some code reading from one file, fr, and writing to new one, fw, like this

    for line in fr:
        match = re.search(r'^\*\* .*', line)
        if match:
        fw.write(line)

I have two questions.

First question. How to write regexp, so it will be searching for line starting not from alpha character and remove everything that is before alpha character?

I have tried like this

fw.write(re.sub(r'(^([^a-zA-Z].*)([a-zA-Z])*.*)', "** \1", line))

but it doesn't work.

Second question. How to verify if the string starts with alpha character. I have tried

print line[0].isalpha()

it returns ?. Do I need to have it unicode first?

Thank you.

Upvotes: 1

Views: 201

Answers (3)

Pierce
Pierce

Reputation: 564

Try matching any of the possible line prefixes, then collect the rest of the line as your word of interest.

pat = re.compile(r'^(\d+\. |- |\*\* )?(?P<word>.*)')

The first group defines the possible prefixes (you might want to fix it up for one or more spaces instead of a literal space). The second, named, group gets the word.

Upvotes: 0

Toto
Toto

Reputation: 91430

The unicode property for a letter is: \pL. Put this in place of [a-zA-Z]

use it as:

^\PL*(\pL+)

That means 0 or more non letter followed by 1 or more letters captured in group 1.

Upvotes: 2

alexis
alexis

Reputation: 50200

Import the codecs module and open the file with

fp = codecs.open(filename, encoding='utf-8')

If your file has a mix of languages, this is the most likely to be right. If not, figure out which encoding you should be using. This will give you unicode and your REs will have a hope of working correctly.

Upvotes: 0

Related Questions