Reputation: 23
I have data of the below mentioned form:
<a> <b> <c> <This is a string>
<World Bank> <provides> <loans for> <"a Country's Welfare">
<Facebook> <is a> <social networking site> <"Happy Facebooking => Enjoy">
Now I want to split each line given above based on the delimiter <>. That is I want to split as:
['<a>', '<b>', '<c>', '<This is a string>']
['<World Bank>', '<provides>', '<loans for>', '<"a Country\'s Welfare">']
['<Facebook>', '<is a>', '<social networking site>', '<"Happy Facebooking => Enjoy">']
I tried splitting based on space and "> " but it does not work. Is there some other way in python by which I may split in the manner described above. Since my file size is 1 TB therefore I can not do so manually.
Upvotes: 2
Views: 532
Reputation: 1121864
You want to split on the whitespace between the >
and <
characters. For that you need a regular expression split with look-behind and look-ahead assertions:
import re
re.split('(?<=>)\s+(?=<)', line)
This splits on any whitespace (\s+
) that is preceded by a >
and followed by a <
character.
The (?<=...)
expression is a look-behind assertion; it matches a location in the input text, namely anywhere the pattern inside the assertion precedes the location. In the above it matches anywhere there is a >
character just before the current location.
The (?=...)
expression works just like the look-behind assertion, but instead looks for matching characters after the current location. It is known as a look-ahead assertion. (?=<)
means it'll match to any location that is followed by the <
character.
Together these form two anchors, an the \s+
in between will only match whitespace that sits between a >
and a <
, but not those two characters themselves. The split breaks up the input string by removing the matched text, and only the spaces are matched, leaving the >
and <
characters attached to the text being split.
Demo:
>>> re.split('(?<=>)\s+(?=<)', '<a> <b> <c> <This is a string>')
['<a>', '<b>', '<c>', '<This is a string>']
>>> re.split('(?<=>)\s+(?=<)', '''<World Bank> <provides> <loans for> <"a Country's Welfare">''')
['<World Bank>', '<provides>', '<loans for>', '<"a Country\'s Welfare">']
>>> re.split('(?<=>)\s+(?=<)', '<Facebook> <is a> <social networking site> <"Happy Facebooking => Enjoy">')
['<Facebook>', '<is a>', '<social networking site>', '<"Happy Facebooking => Enjoy">']
Upvotes: 7
Reputation: 60070
Here's a sort of "build your own parser" approach, which just goes through the file character-by-character, and doesn't use any fancy regex features:
def tag_yielder(line):
in_tag = False
escape = False
current_tag = ''
for char in line:
if in_tag:
current_tag += char
if char == '>' and not escape:
yield current_tag
current_tag = ''
in_tag = False
if char == '=':
escape = True
else:
escape = False
else:
if char == '<':
current_tag = '<'
in_tag = True
for line in open('tag_text.txt'):
print([tag for tag in tag_yielder(line.strip())])
Output:
['<a>', '<b>', '<c>', '<This is a string>']
['<World Bank>', '<provides>', '<loans for>', '<"a Country\'s Welfare">']
['<Facebook>', '<is a>', '<social networking site>', '<"Happy Facebooking => Enjoy">']
Upvotes: 0