Reputation: 110093
I have the following string:
>>> repr(s)
" NBCUniversal\\n63 VOLGAFILM, INC VOLGAFILMINC\\n64 Video Service Corp
I want to match the string before the \\n
-- everything before a whitespace character. The output should be:
['NBCUniversal', 'VOLGAFILMINC']
Here is what I have so far:
re.findall(r'[^s].+\\n\d{1,2}', s)
What would be the correct regex for this?
Upvotes: 0
Views: 105
Reputation: 365627
If you want to fix your existing code instead of replace it, you're on the right track, you've just got a few minor problems.
Let's start with your pattern:
>>> re.findall(r'[^s].+\\n\d{1,2}', s)
[' NBCUniversal\\n63 VOLGAFILM, INC VOLGAFILMINC\\n64']
The first problem is that .+
will match everything that it can, all the way up to the very last \\n\d{1,2}
, rather than just to the next \\n\d{1,2}
. To fix that, add a ?
to make it non-greedy:
>>> re.findall(r'[^s].+?\\n\d{1,2}', s)
[' NBCUniversal\\n63', ' VOLGAFILM, INC VOLGAFILMINC\\n64']
Notice that we now have two strings, as we should. The problem is, those strings don't just have whatever matched the .+?
, they have whatever matched the entire pattern. To fix that, wrap the part you want to capture in ()
to make it a capturing group:
>>> re.findall(r'[^s](.+?)\\n\d{1,2}', s)
[' NBCUniversal', ' VOLGAFILM, INC VOLGAFILMINC']
That's nicer, but it still has a bunch of extra stuff on the left end. Why? Well, you're capturing everything after [^s]
. That means any character except the letter s
. You almost certainly meant [\s]
, meaning any character in the whitespace class. (Note that \s
is already the whitespace class, so [\s]
, meaning the class consisting of the whitespace class, is unnecessary.) That's better, but that's still only going to match one space, not all the spaces. And it will match the earliest space it can that still leaves .+?
something to match, not the latest. So if you want to suck all all the excess spaces, you need to repeat it:
re.findall(r'\s+(.+?)\\n\d{1,2}', s)
['NBCUniversal', 'VOLGAFILM, INC VOLGAFILMINC']
Getting closer… but the .+?
matches anything, including the space between VOLGAFILM
and VOLGAFILMINC
, and again, the \s+
is going to match the first run of spaces it can, leaving the .+?
to match everything after that.
You could fiddle with the prefix , but there's an easier solution. If you don't want spaces in your capture group, just capture a run of nonspaces instead of a run of anything, using \S
:
re.findall(r'\s+(\S+?)\\n\d{1,2}', s)
['NBCUniversal', 'VOLGAFILMINC']
And notice that once you've done that, the \s+
isn't really doing anything anymore, so let's just drop it:
re.findall(r'(\S+?)\\n\d{1,2}', s)
['NBCUniversal', 'VOLGAFILMINC']
I've obviously made some assumptions above that are correct for your sample input, but may not be correct for real data. For example, if you had a string like Weyland-Yutani\\n…
, I'm assuming you want Weyland-Yutani
, not just Yutani
. If you have a different rule, like only letters, just change the part in parentheses to whatever fits that rule, like (\w+?)
or ([A-Za-z]+?)
.
Upvotes: 1
Reputation: 89547
EDIT: sorry I haven't read carefully your question
If you want to find all groups of letters immediatly before a literal \n
, re.findall
is appropriate. You can obtain the result you want with:
>>> import re
>>> s = " NBCUniversal\\n63 VOLGAFILM, INC VOLGAFILMINC\\n64 Video Service Corp "
>>> re.findall(r'(?i)[a-z]+(?=\\n)', s)
['NBCUniversal', 'VOLGAFILMINC']
OLD ANSWER:
re.findall
is not the appropriate method since you only need one result (that is a pair of strings). Here the re.search
method is more appropriate:
>>> import re
>>> s = " NBCUniversal\\n63 VOLGAFILM, INC VOLGAFILMINC\\n64 Video Service Corp "
>>> res = re.search(r'^(?i)[^a-z\\]*([a-z]+)\\n[^a-z\\]*([a-z]+)', s)
>>> res.groups()
('NBCUniversal', 'VOLGAFILM')
Note: I have assumed that there are no other characters between the first word and the literal \n
, but if it isn't the case, you can add [^a-z\\]*
before the \\n
in the pattern.
Upvotes: 1
Reputation: 168988
Assuming that the input actually has the sequence \n
(backslash followed by letter 'n') and not a newline, this will work:
>>> re.findall(r'(\S+)\\n', s)
['NBCUniversal', 'VOLGAFILMINC']
If the string actually contains newlines then replace \\n
with \n
in the regular expression.
Upvotes: 0