Alexander Borochkin
Alexander Borochkin

Reputation: 4621

Extract word form string using regex word boundaries in python

Suppose I have such a file name and I want to extract part of it as a string in Python

import re
fn = "DC_QnA_bo_v.15.12.3_DE_duplicates.xlsx"
rgx = re.compile('\b_[A-Z]{2}\b')
print(re.findall(rgx, fn))

Expected out put [DE], but actual out is [].

Upvotes: 1

Views: 840

Answers (6)

Daweo
Daweo

Reputation: 36540

You could use regular expression (re module) for that as already shown, however this could be done without using any imports, following way:

fn = "DC_QnA_bo_v.15.12.3_DE_duplicates.xlsx"
out = [i for i in fn.split('_')[1:] if len(i)==2 and i.isalpha() and i.isupper()]
print(out) # ['DE']

Explanation: I split fn at _ then discard 1st element and filter elements so only strs of length 2, consisting of letters and consisting of uppercases remain.

Upvotes: 0

MichaΕ‚ Turczyn
MichaΕ‚ Turczyn

Reputation: 37377

Try pattern: \_([^\_]+)\_[^\_\.]+\.xlsx

Explanation:

\_ - match _ literally

[^\_]+ - negated character class with + operator: match one or more times character other than _

[^\_\.]+ - same as above, but this time match characters other than _ and .

\.xlsx - match .xlsx literally

Demo

The idea is to match last pattern _something_ before extension .xlsx

Upvotes: 1

U13-Forward
U13-Forward

Reputation: 71580

Another re solution:

rgx = re.compile('_([A-Z]{1,})_')
print(re.findall(rgx, fn))

Upvotes: 0

Emma
Emma

Reputation: 27723

Your desired output seems to be DE which is in bounded with two _ from left and right. This expression might also work:

# -*- coding: UTF-8 -*-
import re

string = "DC_QnA_bo_v.15.12.3_DE_duplicates.xlsx"
expression = r'_([A-Z]+)_'
match = re.search(expression, string)
if match:
    print("YAAAY! \"" + match.group(1) + "\" is a match πŸ’šπŸ’šπŸ’š ")
else: 
    print('πŸ™€ Sorry! No matches!')

Output

YAAAY! "DE" is a match πŸ’šπŸ’šπŸ’š

Or you can add a 2 quantifier, if you might want:

# -*- coding: UTF-8 -*-
import re

string = "DC_QnA_bo_v.15.12.3_DE_duplicates.xlsx"
expression = r'_([A-Z]{2})_'
match = re.search(expression, string)
if match:
    print("YAAAY! \"" + match.group(1) + "\" is a match πŸ’šπŸ’šπŸ’š ")
else: 
    print('πŸ™€ Sorry! No matches!')

enter image description here

DEMO

Upvotes: 1

Jan
Jan

Reputation: 43169

You could use

(?<=_)[A-Z]+(?=_)

This makes use of lookarounds on both sides, see a demo on regex101.com. For tighter results, you'd need to specify more sample inputs though.

Upvotes: 2

Rakesh
Rakesh

Reputation: 82765

Use _([A-Z]{2})

Ex:

import re
fn = "DC_QnA_bo_v.15.12.3_DE_duplicates.xlsx"
rgx = re.compile('_([A-Z]{2})')
print(rgx.findall(fn))           #You can use the compiled pattern to do findall. 

Output:

['DE']

Upvotes: 1

Related Questions