Python Regex - concatenating multiple lines based on a criteria

Question

I have a text file and I want to remove all newline characters in between the adjacent lines where both have only 'capital letter' words/characters. So if one line is ABCD and the next line is AB, the result should be ABCD AB. I can do it with looping over the text line by line, but I need a more elegant way preferably with regex. Here is a text example:

ABCD  
AB
abcd ABB
cd
AB
ABC
ABCD
ab

and I want to get this:

ABCD AB
abcd ABB
cd
AB ABC ABCD
ab

I've written the following, but only works for two capital lines in a row and not more.

r = re.compile(r'(
)([A-Z ]+)(
)([A-Z ]+)(
)')
text = r.sub(r'\1\2 \4\5',text)

Assume there are no other complexities than this (the text is clean already as the example is). I am a newbie struggling to learn regex! Thanks.

zx81 · Accepted Answer

See this demo:

Search: (?m)([A-Z ]+)[ ]+(?=[A-Z ]+$)

Replace: \1

Note that we are inserting a space where you used to have a newline.

result = re.sub(r"(?m)([A-Z ]+)[ ]+(?=[A-Z ]+$)", r"\1 ", subject)

Python Regex - concatenating multiple lines based on a criteria

Answers (1)

Related Questions