David542
David542

Reputation: 110093

Match regex with \\n in it

I have the following string:

>>> repr(s)
"    NBCUniversal\\n63  VOLGAFILM, INC               VOLGAFILMINC\\n64  Video Service Corp  

I want to match the string before the \\n -- everything before a whitespace character. The output should be:

['NBCUniversal', 'VOLGAFILMINC']

Here is what I have so far:

re.findall(r'[^s].+\\n\d{1,2}', s)

What would be the correct regex for this?

Upvotes: 0

Views: 105

Answers (3)

abarnert
abarnert

Reputation: 365627

If you want to fix your existing code instead of replace it, you're on the right track, you've just got a few minor problems.

Let's start with your pattern:

>>> re.findall(r'[^s].+\\n\d{1,2}', s)
['    NBCUniversal\\n63  VOLGAFILM, INC               VOLGAFILMINC\\n64']

The first problem is that .+ will match everything that it can, all the way up to the very last \\n\d{1,2}, rather than just to the next \\n\d{1,2}. To fix that, add a ? to make it non-greedy:

>>> re.findall(r'[^s].+?\\n\d{1,2}', s)
['    NBCUniversal\\n63', '  VOLGAFILM, INC               VOLGAFILMINC\\n64']

Notice that we now have two strings, as we should. The problem is, those strings don't just have whatever matched the .+?, they have whatever matched the entire pattern. To fix that, wrap the part you want to capture in () to make it a capturing group:

>>> re.findall(r'[^s](.+?)\\n\d{1,2}', s)
['   NBCUniversal', ' VOLGAFILM, INC               VOLGAFILMINC']

That's nicer, but it still has a bunch of extra stuff on the left end. Why? Well, you're capturing everything after [^s]. That means any character except the letter s. You almost certainly meant [\s], meaning any character in the whitespace class. (Note that \s is already the whitespace class, so [\s], meaning the class consisting of the whitespace class, is unnecessary.) That's better, but that's still only going to match one space, not all the spaces. And it will match the earliest space it can that still leaves .+? something to match, not the latest. So if you want to suck all all the excess spaces, you need to repeat it:

re.findall(r'\s+(.+?)\\n\d{1,2}', s)
['NBCUniversal', 'VOLGAFILM, INC               VOLGAFILMINC']

Getting closer… but the .+? matches anything, including the space between VOLGAFILM and VOLGAFILMINC, and again, the \s+ is going to match the first run of spaces it can, leaving the .+? to match everything after that.

You could fiddle with the prefix , but there's an easier solution. If you don't want spaces in your capture group, just capture a run of nonspaces instead of a run of anything, using \S:

re.findall(r'\s+(\S+?)\\n\d{1,2}', s)
['NBCUniversal', 'VOLGAFILMINC']

And notice that once you've done that, the \s+ isn't really doing anything anymore, so let's just drop it:

re.findall(r'(\S+?)\\n\d{1,2}', s)
['NBCUniversal', 'VOLGAFILMINC']

I've obviously made some assumptions above that are correct for your sample input, but may not be correct for real data. For example, if you had a string like Weyland-Yutani\\n…, I'm assuming you want Weyland-Yutani, not just Yutani. If you have a different rule, like only letters, just change the part in parentheses to whatever fits that rule, like (\w+?) or ([A-Za-z]+?).

Upvotes: 1

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89547

EDIT: sorry I haven't read carefully your question

If you want to find all groups of letters immediatly before a literal \n, re.findall is appropriate. You can obtain the result you want with:

>>> import re
>>> s = "    NBCUniversal\\n63  VOLGAFILM, INC               VOLGAFILMINC\\n64  Video Service Corp  "
>>> re.findall(r'(?i)[a-z]+(?=\\n)', s)
['NBCUniversal', 'VOLGAFILMINC']

OLD ANSWER:

re.findall is not the appropriate method since you only need one result (that is a pair of strings). Here the re.search method is more appropriate:

>>> import re
>>> s = "    NBCUniversal\\n63  VOLGAFILM, INC               VOLGAFILMINC\\n64  Video Service Corp  "
>>> res = re.search(r'^(?i)[^a-z\\]*([a-z]+)\\n[^a-z\\]*([a-z]+)', s)
>>> res.groups()
('NBCUniversal', 'VOLGAFILM')

Note: I have assumed that there are no other characters between the first word and the literal \n, but if it isn't the case, you can add [^a-z\\]* before the \\n in the pattern.

Upvotes: 1

cdhowie
cdhowie

Reputation: 168988

Assuming that the input actually has the sequence \n (backslash followed by letter 'n') and not a newline, this will work:

>>> re.findall(r'(\S+)\\n', s)
['NBCUniversal', 'VOLGAFILMINC']

If the string actually contains newlines then replace \\n with \n in the regular expression.

Upvotes: 0

Related Questions