user112947
user112947

Reputation: 473

Regex to get text between multiple newlines in Python

I am trying to split a text where it is between \n\n and \n, in that order. Take this string for example:

\n\nMy take on fruits.\n\nHealthy Fruits\nAn apple is a fruit and it\'s very good.\n\nPears are good as well. Bananas are very good too and healthy.\n\nSour Fruits\nOranges are on the sour side and contains a lot of vitamin C.\n\nGrapefruits are even more sour, if you can believe it.

My desired output is:

[('Healthy Fruits', "An apple is a fruit and it's very good.", 'Pears are good as well. Bananas are very good too and healthy.'), ('Sour Fruits', 'Oranges are on the sour side and contains a lot of vitamin C.', 'Grapefruits are even more sour, if you can believe it.')]

I want to parse like this because anything between \n\n and \n is the title and the rest is text under the title (So "Healthy Fruits" and "Sour Fruits" . Not sure if this is the best way to grab the titles and its text.

Upvotes: 1

Views: 63

Answers (2)

dawg
dawg

Reputation: 103754

Given:

txt='''\
\n\nMy take on fruits.\n\nHealthy Fruits\nAn apple is a fruit and it\'s very good.\n\nPears are good as well. Bananas are very good too and healthy.\n\nSour Fruits\nOranges are on the sour side and contains a lot of vitamin C.\n\nGrapefruits are even more sour, if you can believe it.'''

desired=[('Healthy Fruits',   "An apple is a fruit and it's very good.", 'Pears are good as well. Bananas are very good too and healthy.'),  ('Sour Fruits',   'Oranges are on the sour side and contains a lot of vitamin C.', 'Grapefruits are even more sour, if you can believe it.')]

You can use the regex:

r'\n\n([\s\S]*?)(?=(?:\n\n.*\n[^\n])|\Z)'

Demo

Python demo:

>>> sp=[tuple(re.split('\n+',l)) for l in re.findall(r'\n\n([\s\S]*?)(?=(?:\n\n.*\n[^\n])|\Z)',txt) if '\n' in l]

>>> sp
[('Healthy Fruits', "An apple is a fruit and it's very good.", 'Pears are good as well. Bananas are very good too and healthy.'), ('Sour Fruits', 'Oranges are on the sour side and contains a lot of vitamin C.', 'Grapefruits are even more sour, if you can believe it.')]

>>> sp==desired
True

Upvotes: 1

Nir Elbaz
Nir Elbaz

Reputation: 616

This not regex but it works:

text="\n\nMy take on fruits.\n\nHealthy Fruits\nAn apple is a fruit and it\'s very good. Bananas are very good too and healthy.\n\nSour Fruits\nOranges are on the sour side and contains a lot of vitamin C.\n\nGrapefruits are even more sour, if you can believe it."
    NewList=[]
    Newtext=text.split("\n\n")
    for line in Newtext:
        if line.find("\n")>=0:
            NewList.extend(line.split('\n'))
    
    NewList[len(NewList)-1]=str(NewList[len(NewList)-1])+str(Newtext[len(Newtext)-1])

Upvotes: 1

Related Questions