Reputation: 35
I am trying to search through the following list
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/3_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/6_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/7_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/8_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/
using this code:
next_page = re.compile(r'/(\d+)_p/$')
matches = list(filter(next_page.search, href_search)) #search or .match
for match in matches:
#refining_nextpage = re.compile()
print(match.group())
and am getting the following error: AttributeError: 'str' object has no attribute 'group'
.
I thought that the parenthesis around the \d+
would group the one or more numbers. My goal is to get the number preceding "_p/"
at the end of the string.
Upvotes: 1
Views: 119
Reputation: 147166
You are filtering your original list, so what is being returned are the original strings, not the match objects. If you want to return the match objects, you need to map
the search to the list, then filter the match objects. For example:
next_page = re.compile(r'/(\d+)_p/$')
matches = filter(lambda m:m is not None, map(next_page.search, href_search))
for match in matches:
#refining_nextpage = re.compile()
print(match.group())
Output:
/2_p/
/3_p/
/6_p/
/7_p/
/8_p/
/2_p/
If you only want the number part of the match, use match.group(1)
instead of match.group()
.
Upvotes: 1
Reputation: 8508
You can do regex (?<=\/)\d+(?=\_p\/$)
. See regex101 as example
Explanation:
(?<=\/)
: Look behind for /
\d+
: Look for one or more digits
(?=\_p\/$)
: Look ahead for _p/
at the end of string
If there is a match, then return only \d+
value.
You can either write the code to grab all the data at once or iterate through them line by line and get the data you need.
Below is the code for both:
text_line = '''/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/3_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/6_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/7_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/8_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/'''
import re
for txt in text_line.split('\n'):
t = re.findall(r'(?<=\/)\d+(?=\_p\/$)',txt)
print (t)
t = re.findall(r'(?<=\/)\d+(?=\_p\/)',text_line)
print (t)
The first part does it line by line and the second result is to grab it one shot.
Output of both are:
Line by line:
['2']
['3']
['6']
['7']
['8']
['2']
Grab all at once:
['2', '3', '6', '7', '8', '2']
For the second one, I didn't give the $ sign as we need to grab all of it.
Upvotes: 0
Reputation: 57
The filter
function will only remove the lines that don't match the regex and will return the string, eg:
>>> example = ["abc", "def", "ghi", "123"]
>>> my_match = re.compile(r"\d+$")
>>> list(filter(my_match.search, example))
['123']
If you want the match
object then a list comprehension could do the trick:
>>> example = ["abc", "def45", "67ghi", "123"]
>>> my_match = re.compile(r"\d+$")
>>> [my_match.search(line) for line in example] # Get the matches
[None,
<re.Match object; span=(3, 5), match='45'>,
None,
<re.Match object; span=(0, 3), match='123'>]
>>> [match.group() for match in [my_match.search(line) for line in example] if match is not None] # Filter None values
['45', '123']
Upvotes: 0
Reputation: 7863
You can try this:
import re
# add re.M to match the end of each line
next_page = re.compile(r'/(\d+)_p/$', re.M)
matches = next_page.findall(href_search)
print(matches)
It gives:
['2', '3', '6', '7', '8', '2']
Upvotes: 0
Reputation: 23569
I think re.findall
should do the trick:
next_page.findall(href_search) # ['2', '3', '6', '7', '8', '2']
Alternatively, you could split the lines and then search them individually:
matches = []
for line in href_search.splitlines():
match = next_page.search(line)
if match:
matches.append(match.group(1))
matches # ['2', '3', '6', '7', '8', '2']
Upvotes: 0