Sven Dugojevic
Sven Dugojevic

Reputation: 35

Regex .search with grouping is not collecting groups

I am trying to search through the following list

/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/3_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/6_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/7_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/8_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/

using this code:

next_page = re.compile(r'/(\d+)_p/$')
matches = list(filter(next_page.search, href_search)) #search or .match

for match in matches:
    #refining_nextpage = re.compile()
    print(match.group())

and am getting the following error: AttributeError: 'str' object has no attribute 'group'.

I thought that the parenthesis around the \d+ would group the one or more numbers. My goal is to get the number preceding "_p/" at the end of the string.

Upvotes: 1

Views: 119

Answers (5)

Nick
Nick

Reputation: 147166

You are filtering your original list, so what is being returned are the original strings, not the match objects. If you want to return the match objects, you need to map the search to the list, then filter the match objects. For example:

next_page = re.compile(r'/(\d+)_p/$')
matches = filter(lambda m:m is not None, map(next_page.search, href_search))

for match in matches:
    #refining_nextpage = re.compile()
    print(match.group())

Output:

/2_p/
/3_p/
/6_p/
/7_p/
/8_p/
/2_p/

If you only want the number part of the match, use match.group(1) instead of match.group().

Upvotes: 1

Joe Ferndz
Joe Ferndz

Reputation: 8508

You can do regex (?<=\/)\d+(?=\_p\/$). See regex101 as example

Explanation:

(?<=\/) : Look behind for /

\d+ : Look for one or more digits

(?=\_p\/$) : Look ahead for _p/ at the end of string

If there is a match, then return only \d+ value.

You can either write the code to grab all the data at once or iterate through them line by line and get the data you need.

Below is the code for both:

text_line = '''/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/3_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/6_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/7_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/8_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/'''

import re
for txt in text_line.split('\n'):
    t = re.findall(r'(?<=\/)\d+(?=\_p\/$)',txt)
    print (t)

t = re.findall(r'(?<=\/)\d+(?=\_p\/)',text_line)
print (t)

The first part does it line by line and the second result is to grab it one shot.

Output of both are:

Line by line:

['2']
['3']
['6']
['7']
['8']
['2']

Grab all at once:

['2', '3', '6', '7', '8', '2']

For the second one, I didn't give the $ sign as we need to grab all of it.

Upvotes: 0

Luiz Amaral
Luiz Amaral

Reputation: 57

The filter function will only remove the lines that don't match the regex and will return the string, eg:

>>> example = ["abc", "def", "ghi", "123"]
>>> my_match = re.compile(r"\d+$")
>>> list(filter(my_match.search, example))
['123']

If you want the match object then a list comprehension could do the trick:

>>> example = ["abc", "def45", "67ghi", "123"]
>>> my_match = re.compile(r"\d+$")
>>> [my_match.search(line) for line in example]  # Get the matches
[None,
 <re.Match object; span=(3, 5), match='45'>,
 None,
 <re.Match object; span=(0, 3), match='123'>]
>>> [match.group() for match in [my_match.search(line) for line in example] if match is not None]  # Filter None values
['45', '123']

Upvotes: 0

bb1
bb1

Reputation: 7863

You can try this:

import re

# add re.M to match the end of each line
next_page = re.compile(r'/(\d+)_p/$',  re.M)
matches = next_page.findall(href_search)
print(matches)

It gives:

['2', '3', '6', '7', '8', '2']

Upvotes: 0

Kris
Kris

Reputation: 23569

I think re.findall should do the trick:

next_page.findall(href_search)  # ['2', '3', '6', '7', '8', '2']

Alternatively, you could split the lines and then search them individually:

matches = []
for line in href_search.splitlines():
    match = next_page.search(line)
    if match:
        matches.append(match.group(1))

matches  # ['2', '3', '6', '7', '8', '2']

Upvotes: 0

Related Questions