Reputation: 81
I have some text that is in the following format
\r\n
1. \r\n
par1 par1 par1 \r\n
\r\n
par1 par1 par1 \r\n
\r\n
2. \r\n
\r\n
par2 par2 par2
What I want to do is to join them into paragraphs so that the end result would be:
1. par1 par1 par1 par1 par1 par1 \n
2. par2 par2 par2 \n
I have tried with multiple string manipulations such as str.split(), str.strip() and others, as well as searchign the internet for solutions but nothing seems to work.
Is there any easy way to do this programatically? The text is very long so doing by hand is out of the question.
Upvotes: 2
Views: 1008
Reputation: 788
Here is a slightly different approach using replace and re.
import re
# assuming d is the string you wanted to parse
d = """
\r\n
1. \r\n
par1 par1 par1 \r\n
\r\n
par1 par1 par1 \r\n
\r\n
2. \r\n
\r\n
par2 par2 par2
"""
d = d.replace("\r", "").replace("\n", "")
d = re.sub(r'([0-9]+\.\s)\s*',r'\n\1', d).strip()
print(d)
Upvotes: 1
Reputation: 106553
Assuming your input text is stored in variable s
, you can use the following generator expression with regex:
import re
print('\n'.join(re.sub(r'\s+', ' ', ''.join(t)).strip() for t in re.findall(r'^(\d+\.)(.*?)(?=^\d+\.|\Z)', s, flags=re.MULTILINE | re.DOTALL)))
This outputs:
1. par1 par1 par1 par1 par1 par1
2. par2 par2 par2
Upvotes: 2
Reputation: 3447
I've used regex to find out all the words in the string and rejoined them based on the type of element in list. Hope this helps.
import re
line1 = '''\r\n
1. \r\n
par1 par1 par1 \r\n
\r\n
par1 par1 par1 \r\n
\r\n
2. \r\n
\r\n
par2 par2 par2'''
line2 = re.findall(r"[\w']+", line1)
op = ""
def isInt(item):
try:
int(item)
return True
except ValueError:
return False
for item in line2:
if isInt(item):
op += "\n" + item + ". "
else:
op += item + " "
print(op)
O/P
1. par1 par1 par1 par1 par1 par1
2. par2 par2 par2
Be wary of the extra \n
in front of 1.
Upvotes: 0