user11642562
user11642562

Reputation:

Delete HTML code and leave only the content? (html2text error)

I have a csv file from scraped data that is in HTML format for prices. I would like to only keep the number and the euro sign, and I am trying to use html2text to do this. (If you have a better alternative, please say so!). One cell in the csv looks like this for example:

<p class="price ">

        €1,750 

        <span class="type">
                            /month
                    </span>
        <span class="inclusive">
                    (ex.)
                </span>

    </p>

I thought about using unescape from html2text but I am getting an import error for unescape. This is the code I would use:

import pandas as pd
import html2text
from html2text import unescape 

df = pd.read_csv('filename.csv')

print(df.head())

df.Price = df.Price.apply(unescape, unicode_snob=True)

but it gives me the error:

ImportError: cannot import name 'unescape' from 'html2text' (/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/html2text/__init__.py)

Upvotes: 0

Views: 233

Answers (1)

Pierre
Pierre

Reputation: 1099

Assuming your prices are stored in prices.html in the format you described, you can use BeautifulSoup to solve your problem. It has functions dedicating to extracting data from HTML.

You can install it by running pip install bs4.

#!/usr/bin/env python

import pandas as pd

from bs4 import BeautifulSoup

prices = []
with open("prices.html") as prices_file:
    # Create a soup based on the file
    soup = BeautifulSoup(prices_file, "html.parser")

    # Find all <p> tags that have the "class" set as "price"
    price_html_tags = soup.find_all("p", attrs={"class": "price"})

    # Iterate over the <p> tags, extract their text and strip the whitespace
    prices = [tag.find(text=True).strip() for tag in price_html_tags]

print(prices)

Upvotes: 1

Related Questions