sre
sre

Reputation: 279

How to find a pattern have word and non-word characters after a words combo and before next space using regex

Input text

str_ = '''abc xyz pq m_www.google.in_10 -name itel.google.in
abc xyz pq I_www.google.in_9 -name itel.google.com
abc xyz pq I_www.google.in_8 
abc xyz pq I.www_google.com_10 -name itel_google.com_9'''

Need to extract the combination coming after 'abc xyz pq ' till next space. This combo can contain \w & dot. Also want to extract combination coming after '-name '. These 2 combination should be a list

Expected output (as a list)

'[['m_www.google.in_10', 'itel.google.in']
['I_www.google.in_9', 'itel.google.com']
['I_www.google.in_8', '']
['I_www.google.com_10', 'itel.google.com_9']]'

My Pseudo Code

import re
re.findall(r'abc xyz pq (\w+)\.(\w+)\.(\w+) -name? (\w+?)\.(\w+?)\.(\w+?)',str_ )

Upvotes: 4

Views: 78

Answers (2)

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

With specific regex pattern:

import re

s = '''abc xyz pq m_www.google.in_10 -name itel.google.in
abc xyz pq I_www.google.in_9 -name itel.google.com
abc xyz pq I_www.google.in_8 
abc xyz pq I.www_google.com_10 -name itel_google.com_9'''

res = list(map(list, re.findall(r'\babc xyz pq (\w+[.\w]+)(?: -name (\w+[.\w]+))?', s)))
pprint(res)

The expected output (list of lists):

[['m_www.google.in_10', 'itel.google.in'],
 ['I_www.google.in_9', 'itel.google.com'],
 ['I_www.google.in_8', ''],
 ['I.www_google.com_10', 'itel_google.com_9']]

Regex pattern details:

  • \b - word boundary

  • (\w+[.\w]+) - capture word character(s) \w+ followed by either . char or word character sequence [.\w]+

  • (?: ...) - marks group as non-capturing, though in the above case it contains another captured group (inner group)
  • (...)? - marks group as optional (? quantifier matches between zero and one times)

Upvotes: 3

anubhava
anubhava

Reputation: 785376

You may use this regex in re.findall:

>>> for i in re.findall(r'abc xyz pq\s+([\w.]+)(?:\s+-name\s+([\w.]+))?', str_):
...     print (i)
...
('m_www.google.in_10', 'itel.google.in')
('I_www.google.in_9', 'itel.google.com')
('I_www.google.in_8', '')
('I.www_google.com_10', 'itel_google.com_9')

Note that the list doesn't match your expected data structure but you can iterate this list and create your custom structure.

Alternatively you may use re.finditer and prepare your custom list.

Upvotes: 3

Related Questions