TomSelleck
TomSelleck

Reputation: 6968

Using regex to extract information from string

I am trying to write a regex in Python to extract some information from a string.

Given:

"Only in Api_git/Api/folder A: new.txt"

I would like to print:

Folder Path: Api_git/Api/folder A
Filename: new.txt

After having a look at some examples on the re manual page, I'm still a bit stuck.

This is what I've tried so far

m = re.match(r"(Only in ?P<folder_path>\w+:?P<filename>\w+)","Only in Api_git/Api/folder A: new.txt")

print m.group('folder_path')
print m.group('filename')

Can anybody point me in the right direction??

Upvotes: 1

Views: 219

Answers (3)

Robᵩ
Robᵩ

Reputation: 168616

Your pattern: (Only in ?P<folder_path>\w+:?P<filename>\w+) has a few flaws in it.

The ?P construct is only valid as the first bit inside a parenthesized expression, so we need this.

(Only in (?P<folder_path>\w+):(?P<filename>\w+))

The \w character class is only for letters and underscores. It won't match / or ., for example. We need to use a different character class that more closely aligns with requirements. In fact, we can just use ., the class of nearly all characters:

(Only in (?P<folder_path>.+):(?P<filename>.+))

The colon has a space after it in your example text. We need to match it:

(Only in (?P<folder_path>.+): (?P<filename>.+))

The outermost parentheses are not needed. They aren't wrong, just not needed:

Only in (?P<folder_path>.+): (?P<filename>.+)

It is often convenient to provide the regular expression separate from the call to the regular expression engine. This is easily accomplished by creating a new variable, for example:

regex = r'Only in (?P<folder_path>.+): (?P<filename>.+)'
... # several lines later
m = re.match(regex, "Only in Api_git/Api/folder A: new.txt") 

The above is purely for the convenience of the programmer: it neither saves nor squanders time or memory space. There is, however, a technique that can save some of the time involved in regular expressions: compiling.

Consider this code segment:

regex = r'Only in (?P<folder_path>.+): (?P<filename>.+)'
for line in input_file:
    m = re.match(regex, line)
    ...

For each iteration of the loop, the regular expression engine must interpret the regular expression and apply it to the line variable. The re module allows us to separate the interpretation from the application; we can interpret once but apply several times:

regex = re.compile(r'Only in (?P<folder_path>.+): (?P<filename>.+)')
for line in input_file:
    m = re.match(regex, line)
    ...

Now, your original program should look like this:

regex = re.compile(r'Only in (?P<folder_path>.+): (?P<filename>.+)')
m = re.match(regex, "Only in Api_git/Api/folder A: new.txt")
print m.group('folder_path')
print m.group('filename')

However, I'm a fan of using comments to explain regular expressions. My version, including some general cleanup, looks like this:

import re
regex = re.compile(r'''(?x)                # Verbose
            Only\ in\             # Literal match
            (?P<folder_path>.+)   # match longest sequence of anything, and put in 'folder_path'
            :\                    # Literal match
            (?P<filename>.+)      # match longest sequence of anything and put in 'filename'
            ''')

with open('diff.out') as input_file:
    for line in input_file:
        m = re.match(regex, line)
        if m:
            print m.group('folder_path')
            print m.group('filename')

Upvotes: 1

f.rodrigues
f.rodrigues

Reputation: 3587

It really depends on the limitation of the input, if this is the only input this will do the trick.

^Only in (?P<folder_path>[a-zA-Z_/ ]*): (?P<filename>[a-z]*.txt)$

Upvotes: 0

Braj
Braj

Reputation: 46841

Get the matched group from index 1 and 2 using capturing groups.

^Only in ([^:]*): (.*)$

Here is demo

sample code:

import re
p = re.compile(ur'^Only in ([^:]*): (.*)$')
test_str = u"Only in Api_git/Api/folder A: new.txt"

re.findall(p, test_str)

If you want to print in the below format then try with substitution.

Folder Path: Api_git/Api/folder A 
Filename: new.txt

DEMO

sample code:

import re
p = re.compile(ur'^Only in ([^:]*): (.*)$')
test_str = u"Only in Api_git/Api/folder A: new.txt"
subst = u"Folder Path: $1\nFilename: $2"

result = re.sub(p, subst, test_str)

Upvotes: 4

Related Questions