BlackHawk
BlackHawk

Reputation: 861

Splitting string on several delimiters without considering new line

I have a string representing conversation turns as follows:

s = "person alpha:\nHow are you today?\n\nperson beta:\nI'm fine, thank you.\n\nperson alpha:\nWhat's up?\n\nperson beta:\nNot much, just hanging around."

In plain text, it looks as follows.

person alpha:
How are you today?

person beta:
I'm fine, thank you.

person alpha:
What's up?

person beta:
Not much, just hanging around.

Now, I would like to split the string on person alpha and person beta, so that the resulting list looks as follows:

["person alpha:\nHow are you today?", "person beta:\nI'm fine, thank you.", "person alpha:\nWhat's up?", "person beta:\nNot much, just hanging around."]

I have tried the following approach

import re
res = re.split('person alpha |person beta |\*|\n', s)

But the results is as follows:

['person alpha:', 'How are you today?', '', 'person beta:', "I'm fine, thank you.", '', 'person alpha:', "What's up?", '', 'person beta:', 'Not much, just hanging around.']

What is wrong with my regex?

Upvotes: 1

Views: 180

Answers (3)

Jamiu S.
Jamiu S.

Reputation: 5721

re.findall() will be the appropriate approach. Use re.DOTALL flag to match new lines. The (alpha|beta) in the regex is a group that matches either "alpha" or "beta", .*? is a non-greedy pattern that matches any characters and (?=\n\nperson|$) is a positive lookahead which asserts that the match is only successful if it immediately followed by a new line characters and person string

import re

s = "person alpha:\nHow are you today?\n\nperson beta:\nI'm fine, thank you.\n\nperson alpha:\nWhat's up?\n\nperson beta:\nNot much, just hanging around."

match = re.findall(r"(person (alpha|beta):\n.*?(?=\n\nperson|$))", s, re.DOTALL)
result = list(map(lambda x: x[0], match ))
# or
# result = [x[0] for x in match]
print(result)

Ouput:

['person alpha:\nHow are you today?', "person beta:\nI'm fine, thank you.", "person alpha:\nWhat's up?", 'person beta:\nNot much, just hanging around.']

To match "person alpha:" and "Human:" use the below regex:

match = re.findall(r"((person alpha|Human):\n.*?(?=\n\n(person alpha|Human)|$))", s, re.DOTALL)

To match both: use the below regex:

match = re.findall(r"((person|Human) (alpha|beta):\n.*?(?=\n\n(person|Human)|$))", s, re.DOTALL)

Upvotes: 1

Karl Knechtel
Karl Knechtel

Reputation: 61498

'person alpha |person beta |\*|\n'

There are multiple things wrong with this attempt. First off: the pattern passed to re.split is supposed to match the delimiters, not the items. That is, parts of the text that will not appear inside any of the items in the result list, but instead are in between:

>>> re.split('delimiter', 'foodelimiterbardelimiterbaz')
['foo', 'bar', 'baz']

Aside from that, person alpha and person beta in the text are followed consistently by a colon, not a space; so those alternatives don't ever match. \* would match a literal asterisk (since it's escaped); but there's no apparent reason to look for that. \n is a literal newline in this regex; as it happens, Python's regex engine will accept a newline in the regex string and treat it as if it were backslash-n escape sequence - but it's important to understand in general that there are two layers of escaping going on here.

Anyway, the point is: this regex matches against a single newline in the input, and also matches some other possibilities that never come up. Then, re.split returns a list of things that are between those matches - i.e., the lines of the input.


Now, I would like to split the string on person alpha and person beta, so that the resulting list looks as follows

Generally, we say "split on X" to mean that X is the delimiter. Since person alpha and person beta are both things that should appear at the beginnings of results, they are not delimiters.

Instead, the delimiter we are looking for is the word boundary before those phrases.

When we look for that delimiter, we want to make sure that it is followed by the person identifier (so that we know that it's the delimiter), but the regex needs to not match that identifer. To address this, we use positive lookahead.

We want: a word boundary (\b), with a positive lookahead ((?=...)) for person , followed by one of the person names, followed by a colon. To simplify, I'll assume that the person name can be anything after the word person, and shouldn't be restricted to alpha and beta.

So the lookahead should match person.*:, meaning the entire lookahead clause is (?=person.*:). The entire regex is \b(?=person.*:), and we use a raw string for this, so that the backslash is understood literally by Python and passed literally to the regex engine (which will do its own interpretation of the \b sequence, instead of Python's).

Putting it together:

>>> re.split(r'\b(?=person.*:)', s)
['', 'person alpha:\nHow are you today?\n\n', "person beta:\nI'm fine, thank you.\n\n", "person alpha:\nWhat's up?\n\n", 'person beta:\nNot much, just hanging around.']

Notice that that left an empty string at the beginning of the output list. That's because the delimiter that we're looking for is at the beginning of the input. re.split gives us whatever's before, between and after the delimiters. Before the first delimiter, in our case, is an empty string.

To avoid this, one simple approach is to recast the problem. Instead of searching for the points between the dialog items, we'll search for the dialog items themselves (it doesn't matter that there isn't any text in between them).

Each item looks like person , a name, :, whatever text, and two newlines - as a regex, person.*?:.*?\n\n. Because the regex will now actually match text rather than just looking ahead, it's important to use reluctant qualifiers - the ?s in that regex.

Then, we use that regex with re.findall. It needs to use a raw string again, and we also need to use the re.DOTALL option for the regex, to tell the regex engine that . should be able to match a newline. (Otherwise, the regex would fail, because the second .*? won't match the single newlines within each dialog item before the double newline is reached.)

Putting it together:

>>> re.findall(r'person.*?:.*?\n\n', s, flags=re.DOTALL)
['person alpha:\nHow are you today?\n\n', "person beta:\nI'm fine, thank you.\n\n", "person alpha:\nWhat's up?\n\n"]

There are many other ways to write the regex, depending on how the requirements are interpreted. For example, rather than matching a word boundary (\b) with the re.split approach, we could look for a beginning-of-line anchor (^). In fact, we don't need anything besides the lookahead pattern, as long as we don't mind splitting the text anywhere that it says person someone: (even if it isn't at the beginning of a line (^), isn't at the beginning of a word (\b), or whatever else). With the re.findall approach, on the other hand, we could exclude the \n\n from the matches by checking for it with lookahead.

But if the items are always separated by \n\n, and it isn't really necessary to verify that they start with a person label, we could just split the text on that literal sequence. That doesn't even require regex:

>>> s.split('\n\n')
['person alpha:\nHow are you today?', "person beta:\nI'm fine, thank you.", "person alpha:\nWhat's up?", 'person beta:\nNot much, just hanging around.']

Upvotes: 3

The fourth bird
The fourth bird

Reputation: 163217

Your pattern only matches a newline, as in the example data there is a colon : after alpha: and beta: so you are basically splitting on a newline yielding those results.

You could re.split the string using a lookahead (?= asserting instead of matching, and remove empty strings and strip the results.

import re

s = "person alpha:\nHow are you today?\n\nperson beta:\nI'm fine, thank you.\n\nperson alpha:\nWhat's up?\n\nperson beta:\nNot much, just hanging around."
pattern = r"(?=^person (?:alpha|beta):)"
res = [v.rstrip() for v in re.split(pattern, s, 0, re.M) if v]

print(res)

Output

['person alpha:\nHow are you today?', "person beta:\nI'm fine, thank you.", "person alpha:\nWhat's up?", 'person beta:\nNot much, just hanging around.']

See a Python demo.


Using re.findall you can match all lines with at least a single character asserting that the next line does not start with the person pattern:

import re

s = "person alpha:\nHow are you today?\n\nperson beta:\nI'm fine, thank you.\n\nperson alpha:\nWhat's up?\n\nperson beta:\nNot much, just hanging around."
pattern = r"^person (?:alpha|beta):\n(?:(?!person (?:alpha|beta):).+(?=\n|$))*"
print(re.findall(pattern, s, re.M))

Output

['person alpha:\nHow are you today?', "person beta:\nI'm fine, thank you.", "person alpha:\nWhat's up?", 'person beta:\nNot much, just hanging around.']

See a Python demo.

Upvotes: 2

Related Questions