Isha Khosla
Isha Khosla

Reputation: 23

splitting line in python

I have data of the below mentioned form:

<a> <b> <c> <This is a string>
<World Bank> <provides> <loans for> <"a Country's Welfare">
<Facebook> <is a> <social networking site> <"Happy Facebooking => Enjoy">

Now I want to split each line given above based on the delimiter <>. That is I want to split as:

['<a>', '<b>', '<c>', '<This is a string>']
['<World Bank>', '<provides>', '<loans for>', '<"a Country\'s Welfare">']
['<Facebook>', '<is a>', '<social networking site>', '<"Happy Facebooking => Enjoy">']

I tried splitting based on space and "> " but it does not work. Is there some other way in python by which I may split in the manner described above. Since my file size is 1 TB therefore I can not do so manually.

Upvotes: 2

Views: 532

Answers (2)

Martijn Pieters
Martijn Pieters

Reputation: 1121864

You want to split on the whitespace between the > and < characters. For that you need a regular expression split with look-behind and look-ahead assertions:

import re

re.split('(?<=>)\s+(?=<)', line)

This splits on any whitespace (\s+) that is preceded by a > and followed by a < character.

The (?<=...) expression is a look-behind assertion; it matches a location in the input text, namely anywhere the pattern inside the assertion precedes the location. In the above it matches anywhere there is a > character just before the current location.

The (?=...) expression works just like the look-behind assertion, but instead looks for matching characters after the current location. It is known as a look-ahead assertion. (?=<) means it'll match to any location that is followed by the < character.

Together these form two anchors, an the \s+ in between will only match whitespace that sits between a > and a <, but not those two characters themselves. The split breaks up the input string by removing the matched text, and only the spaces are matched, leaving the > and < characters attached to the text being split.

Demo:

>>> re.split('(?<=>)\s+(?=<)', '<a> <b> <c> <This is a string>')
['<a>', '<b>', '<c>', '<This is a string>']
>>> re.split('(?<=>)\s+(?=<)', '''<World Bank> <provides> <loans for> <"a Country's Welfare">''')
['<World Bank>', '<provides>', '<loans for>', '<"a Country\'s Welfare">']
>>> re.split('(?<=>)\s+(?=<)', '<Facebook> <is a> <social networking site> <"Happy Facebooking => Enjoy">')
['<Facebook>', '<is a>', '<social networking site>', '<"Happy Facebooking => Enjoy">']

Upvotes: 7

Marius
Marius

Reputation: 60070

Here's a sort of "build your own parser" approach, which just goes through the file character-by-character, and doesn't use any fancy regex features:

def tag_yielder(line):
    in_tag = False
    escape = False
    current_tag = ''
    for char in line:
        if in_tag:
            current_tag += char
            if char == '>' and not escape:
                yield current_tag
                current_tag = ''
                in_tag = False
            if char == '=':
                escape = True
            else:
                escape = False
        else:
            if char == '<':
                current_tag = '<'
                in_tag = True

for line in open('tag_text.txt'):
    print([tag for tag in tag_yielder(line.strip())])

Output:

['<a>', '<b>', '<c>', '<This is a string>']
['<World Bank>', '<provides>', '<loans for>', '<"a Country\'s Welfare">']
['<Facebook>', '<is a>', '<social networking site>', '<"Happy Facebooking => Enjoy">']

Upvotes: 0

Related Questions