lilshadowy
lilshadowy

Reputation: 23

Extracting words outside of double quotes with Regex Python

I have this sentence: "int open(const char *" pathname ", int " flags );

I am trying to find a regex to extract the words outside the double quotes. Example: "pathname" and "flags". I created a regex expression, but it only catches the word "flags" and not the word "pathname". Here is what I have:

 reg2 = r"""(\".*\" (.*) )+\);"""
 pattern2 = re.compile(reg2)

 inner = m.group(1)
 m2 = pattern2.search(inner)
 EntityI = m2.group(2)
 print EntityI

Note: m.group(1) is: "int open(const char *" pathname ", int " flags );

Thanks for the help!

Edit: Just the clarify some more. Another possible case could be:

"int open(const char *" pathname ", int " flags ", mode_t " mode );

And I would want to extract the words: "pathname", "flags", and "mode".

Upvotes: 1

Views: 903

Answers (2)

wp78de
wp78de

Reputation: 18950

This is a perfect case for the trash-can-appraoch: forget everything that is not in capture group 1.

".*?"|(\w+)

Explanation: We select from two alternatives |

  • ".?" matches a string from start to end using the quotes as an anchor and anything in-between using the .and the * quantifier that any number of repetitions. The ? changes the behavior of the star to match as few times as possible (lazy) to avoid to match too much with a default greedy match.
  • (\w+) the parenthesis define a capture group that captures one or more + alphanumerics: \w itself is a shorthand character class that stands for [a-zA-Z0-9_] (this is called a character range).

Sample code:

import re
regex = r'".*?"|(\w+)'
test_str = "\"int open(const char *\" pathname \", int \" flags );"
matches = re.finditer(regex, test_str, re.MULTILINE)
for match in matches:
    if match.group(1):
        print ("Found at {start}-{end}: {group}".format(start = match.start(1), end = match.end(1), group = match.group(1)))

Output:

Found at 24-32: pathname
Found at 42-47: flags

Upvotes: 2

Calum You
Calum You

Reputation: 15072

Here's one way that replaces things inside quotes and then splits the resulting string. You'll probably want to do more processing since as noted the ); is also outside the quotes.

import re
my_string = '"int open(const char *" pathname ", int " flags );'
re.sub('".*?"', '_', my_string).split('_')[1:]
## [' pathname ', ' flags );']

Upvotes: 0

Related Questions