angelica
angelica

Reputation: 45

Working with pieces of text in python

Data.txt includes words that are upper and lower-cased.

I need to lower case them all except for the upper-cased characters that appear in braces which are located immediately following a word that can end in either lower or upper case, but there is no space before the first brace. e.g.

CAT{TT} Dog{DD} Horse{AA}
Snail{LL} RAT{TT}
ANT{AA}

These should be transformed into:

cat{TT} dog{DD} horse{AA}
snail{LL} rat{TT}
ant{AA}

As a first start, I lower-cased everything in the list and placed them in lcChar(code as below). I was then trying to find the lower-cased characters within braces so that I could upper case them again.

Being a python newbie, I got stuck in my code below. This gives only the very first item in braces. Also I am assuming I need another loop in order to upper case all the items that appear in the braces. Any help please so I can understand the best methodology for handling these type of issues?

import re
f = open(r'C:\Python27\MyScripts\Data.txt')
for line in f:
    lcChar = (line.lower())

patFinder1 = re.compile('{[a-z]+}')
findPat1=re.findall(patFinder1, lcChar)

Upvotes: 0

Views: 68

Answers (2)

cdarke
cdarke

Reputation: 44344

re.sub and re.subn allow the second parameter to be a function. A Match Object is passed into that function and whatever the function returns is used for the substitution.

This is my take on it:

import re

def manip(m):
    return m.groups()[0].lower()

data = ['CAT{TT} Dog{DD} Horse{AA}',
        'Snail{LL} RAT{TT}',
        'ANT{AA}']

for line in data:
    new_line = re.sub(r'((?:[^{]|^)[A-Z]+(?:[^}]|$))', manip, line)
    print new_line

Produces:

cat{TT} dog{DD} horse{AA}
snail{LL} rat{TT}
ant{AA}

I could have used a lambda instead, but that's arguably less clear.

Upvotes: 2

Cyrbil
Cyrbil

Reputation: 6478

A straight forward way of doing it:

import re

regex = re.compile('([^}]*?{)')
str_ = '''CAT{TT} Dog{DD} Horse{AA}
Snail{LL} RAT{TT}
ANT{AA}'''

new_str =  re.sub(regex, lambda match: match.groups()[0].lower(), str_)
assert new_str == '''cat{TT} dog{DD} horse{AA}
snail{LL} rat{TT}
ant{AA}'''

print new_str

Explaination:

I use the regex to only match what need to be lowercased:

enter image description here

Then I loop over the results and replace to lowercase version.

Edit: more optimize version using sub to replace.

Upvotes: 1

Related Questions