Reputation: 35
The code below works until:
print(salary_range)
This is the code:
url = "https://nofluffjobs.com/pl/job/c-c-junior-software-developer-vesoftx-wroclaw-n6bgtv5f"
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, "html.parser")
salaries = soup.find_all("h4", class_="tw-mb-0")
markup2 = str(salaries[0])
soup2 = BeautifulSoup(str(salaries[0]), 'html.parser')
salary_range = soup2.get_text().strip()
print(salary_range) #output: "10 000 – 16 000 PLN"
# error on line below
bottom_salary = re.search(r"^(\d{0,2} ?\d{3})", salary_range).group(1)
print(bottom_salary)
bottom_salary_int = re.sub(" ", "", bottom_salary)
print(bottom_salary_int)
Why doesn't re.search()
find any match? I've tried many other regular expressions, but it never finds a match and I always get the error AttributeError: 'NoneType' object has no attribute 'group'
Upvotes: 1
Views: 82
Reputation: 11080
The issue is that the character you think is a space is not actually a space, it is a non-breaking space. Despite looking the same, they are completely different characters. It has the same function of a regular space, but it doesn't count for line wrapping purposes. See this small diagram:
10 000 – 16 000 PLN
^ ^^
NBSP SP ... same deal here
To match the non-breaking space instead, specify its hex value, 0xA0
. Like this:
from bs4 import BeautifulSoup
import re
import requests
url = "https://nofluffjobs.com/pl/job/c-c-junior-software-developer-vesoftx-wroclaw-n6bgtv5f"
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, "html.parser")
salaries = soup.find_all("h4", class_="tw-mb-0")
markup2 = str(salaries[0])
soup2 = BeautifulSoup(str(salaries[0]), 'html.parser')
salary_range = soup2.get_text().strip()
print(salary_range)
bottom_salary = re.search(r"^(\d{0,2}\xa0?\d{3})", salary_range).group(1)
print(bottom_salary)
bottom_salary_int = re.sub(" ", "", bottom_salary)
print(bottom_salary_int)
If you're trying to match a space, but the regular space character doesn't match, then it might be a NBSP instead. You can also tell by the website's source code if it uses
instead of a regular space to encode a NBSP.
Upvotes: 1
Reputation: 25073
Just in addition, if you prefer a less explicit definition of a character (non-breaking space), simply change the pattern to (\d+.\d+)
or (\d+\s\d+)
to get your group, also ^
is not needed in this specific case:
.
Matches any character.
re.search(r"(\d+.\d+)", e.get_text()).group(1)
\s
Matches any space, tab or newline character.
re.search(r"(\d+\s\d+)", e.get_text()).group(1)
To fix the spacing simply split()
and join()
:
''.join(re.search(r"(\d+.\d+)", e.get_text()).group(1).split())
import requests, re
from bs4 import BeautifulSoup
url = "https://nofluffjobs.com/pl/job/c-c-junior-software-developer-vesoftx-wroclaw-n6bgtv5f"
reqs = requests.get(url)
soup = BeautifulSoup(reqs.content)
for e in soup.find_all("h4", class_="tw-mb-0"):
print(''.join(re.search(r"(\d+.\d+)", e.get_text()).group(1).split()))
10000
9000
Upvotes: 1