Reputation:
I have a csv file from scraped data that is in HTML format for prices. I would like to only keep the number and the euro sign, and I am trying to use html2text to do this. (If you have a better alternative, please say so!). One cell in the csv looks like this for example:
<p class="price ">
€1,750
<span class="type">
/month
</span>
<span class="inclusive">
(ex.)
</span>
</p>
I thought about using unescape from html2text but I am getting an import error for unescape. This is the code I would use:
import pandas as pd
import html2text
from html2text import unescape
df = pd.read_csv('filename.csv')
print(df.head())
df.Price = df.Price.apply(unescape, unicode_snob=True)
but it gives me the error:
ImportError: cannot import name 'unescape' from 'html2text' (/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/html2text/__init__.py)
Upvotes: 0
Views: 233
Reputation: 1099
Assuming your prices are stored in prices.html
in the format you described, you can use BeautifulSoup to solve your problem. It has functions dedicating to extracting data from HTML.
You can install it by running pip install bs4
.
#!/usr/bin/env python
import pandas as pd
from bs4 import BeautifulSoup
prices = []
with open("prices.html") as prices_file:
# Create a soup based on the file
soup = BeautifulSoup(prices_file, "html.parser")
# Find all <p> tags that have the "class" set as "price"
price_html_tags = soup.find_all("p", attrs={"class": "price"})
# Iterate over the <p> tags, extract their text and strip the whitespace
prices = [tag.find(text=True).strip() for tag in price_html_tags]
print(prices)
Upvotes: 1