anfwkdrn
anfwkdrn

Reputation: 297

python regex get number and paragraph between number

I have a string like below.

10. Title text
text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2


text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2

text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2

11. Title text
text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2


text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2

12. Title text
text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2

13. Title text
text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2


text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2


text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2

What I want to do is to separate the title and content in chunks and put them in a list.

result = [10. Title text\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2\n\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2, 11. Title text\ntext1text2text1text2text1text2text1text2text1text2text1text2text1 text2text1 text2text1 text2text1 text2text1text2\n\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2, ........]

I've tried this, but honestly I have no idea what to do. help

la_text = []
num = 1
for a in range(3):
    sepa = re.findall(r"\d*(.*)\d*", text)[num]
    la_text.append(sepa)
    num += 1

print(la_text)

Upvotes: 2

Views: 54

Answers (4)

bobble bubble
bobble bubble

Reputation: 18525

If you don't need to separate titles from paragraphs, another idea is to use re.split

re.split(r"\n\s*(?=\d+\.)", test_str)

See this demo at regex101 or a Python demo at tio.run

  • \n\s* this splits at a newline and any amount of whitespace
  • (?=\d+\.) if followed by one or more digits and a period

Upvotes: 2

Shahab Rahnama
Shahab Rahnama

Reputation: 1012

In case you need a dictionary with numbers as keys:

import re
m = re.findall(r'^(\d+).(.*)((?:\n(?!\d+\.).*)*)',s , re.M )
 
{element[0]:[element[1], element[2]] for element in m }

Output:

{'10': [' Title text',
  '\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2\n\n\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2\n\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2\n'],
 '11': [' Title text',
  '\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2\n\n\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2\n'],
 '12': [' Title text',
  '\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2\n'],
 '13': [' Title text',
  '\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2\n\n\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2\n\n\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2\n']}

Upvotes: 0

The fourth bird
The fourth bird

Reputation: 163457

To get the title and paragraph separated, you can make use of a negative lookahead with only the multiline flag re.M

Using re.findall will return a list of tuples with 2 values for the capture groups

^(\d+\..*)((?:\n(?!\d+\.).*)*)

See a regex demo.

To get them together as a single match:

^\d+\..*(?:\n(?!\d+\.).*)*

See another regex demo.

import re

pattern = r"^\d+\..*(?:\n(?!\d+\.).*)*"

s = "...."

print(re.findall(pattern, s, re.M))

Output

[
'10. Title text\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2\n\n\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2\n\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2\n',
'11. Title text\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2\n\n\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2\n',
'12. Title text\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2\n',
'13. Title text\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2\n\n\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2\n\n\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2'
]

Upvotes: 3

Andrej Kesely
Andrej Kesely

Reputation: 195528

If s contains your string from the question you can do:

import re

pat = re.compile(r"^(\d+\.\s+.*?)(?=\n^\d+\.|\Z)", flags=re.M | re.S)

print(pat.findall(s))

Prints:

[
    "10. Title text\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2\n\n\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2\n\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2\n",
    "11. Title text\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2\n\n\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2\n",
    "12. Title text\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2\n",
    "13. Title text\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2\n\n\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2\n\n\ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2\n",
]

Upvotes: 2

Related Questions