Reputation: 181
I have file contains several lines of strings written as :
[(W)40(indo)25(ws )20(XP)111(, )20(with )20(the )20(fragment )20(enlar)18(ged )20(for )20(clarity )20(on )20(Fig. )] TJ
I need the text inside the parentheses only. I try to use the following code :
import re
readstream = open ("E:\\New folder\\output5.txt","r").read()
stringExtract = re.findall('\[(.*?)\]', readstream, re.DOTALL)
string = re.compile ('\(.*?\)')
stringExtract2 = string.findall (str(stringExtract))
but some strings (or text) not exist in the output e.g, for the above string the word (with) not found in the output. Also the arrangement of strings differs from the file, e.g, for strings (enlar) and (ged ) above, the second one (ged ) appeared before (enlar), such as : ( ged other strings ..... enlar) How I can fix these problems?
Upvotes: 5
Views: 14249
Reputation: 5877
findall looks like your friend here. Don't you just want:
re.findall(r'\(.*?\)',readstream)
returns:
['(W)',
'(indo)',
'(ws )',
'(XP)',
'(, )',
'(with )',
'(the )',
'(fragment )',
'(enlar)',
'(ged )',
'(for )',
'(clarity )',
'(on )',
'(Fig. )']
Edit:
as @vikramis showed, to remove the parens, use: re.findall(r'\((.*?)\)', readstream)
. Also, note that it is common (but not requested here) to trim trailing whitespace with something like:
re.findall(r'\((.*?) *\)', readstream)
Upvotes: 6
Reputation: 120598
Without regexp:
[p.split(')')[0] for p in s.split('(') if ')' in p]
Output:
['W', 'indo', 'ws ', 'XP', ', ', 'with ', 'the ', 'fragment ', 'enlar', 'ged ', 'for ', 'clarity ', 'on ', 'Fig. ']
Upvotes: 7
Reputation: 113950
your first problem is
stringExtract = re.findall('\[(.*?)\]', readstream, re.DOTALL)
I have no idea why you are doing this and im pretty sure you dont want to do this
try this instead
readstream = "[(W)40(indo)25(ws )20(XP)111(, )20(with )20(the )20(fragment )20(enlar)18(ged )20(for )20(clarity )20(on )20(Fig. )] TJ"
stringExtract = re.findall('\(([^)]+)\)', readstream, re.DOTALL)
which says find everything inside parenthesis that is not a closing parenthesis
Upvotes: 0
Reputation: 1822
Try this:
import re
readstream = open ("E:\\New folder\\output5.txt","r").read()
stringExtract2 = re.findall(r'\(([^()]+)\)', readstream)
readstream = r'[(W)40(indo)25(ws )20(XP)111(, )20(with )20(the )20(fragment )20(enlar)18(ged )20(for )20(clarity )20(on )20(Fig. )]'
['W', 'indo', 'ws ', 'XP', ', ', 'with ', 'the ', 'fragment ', 'enlar', 'ged ', 'for ', 'clarity ', 'on ', 'Fig. ']
Upvotes: 3