Reputation: 78
I've searched through a few threads and either I'm getting an error or I'm not getting the expected result when trying either the pandas replace method or the regex re.sub method(python 3.x).
I'm pulling in html data and due to the odd tagging nature I can't extract the data I need. For example each row looks like below
<div class="song">
<p><span class="small">07/06 4:21 AM</span> - <span class="small">Title:</span> Crazy For You - <span class="small">Artist:</span> Scars On 45
<span class="small"><a href="http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Ddigital-music&field-keywords=Crazy For You+Scars On 45&tag=wt897fmrafomu-20" target="_blank">Buy Song</a> </span>
I'm using the code below to pull in html data and I want to remove a large chunk of the text to pull out the time/date (ex: 07/06 4:21 AM), artist (ex: Scars on 45), and song (ex: Crazy For You). I'm encountering either errors or the code not working as expected when I try either of the last three lines.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import re
import numpy as np
html = urlopen("http://wtmd.org/radio/RecentSongs.html")
soup = BeautifulSoup(html.read())
Songs=soup.select('div.song')
#data=np.asarray(Songs)
df = pd.DataFrame({'col1':Songs})
df['col1']=df['col1'].apply(str)
#errors below
df['col1']=df['col1'].replace("<div class=\"song\">",",") #this does not get replaced
df['col1']=re.sub("<div class=\"song\">",",",df['col1']) #this throws TypeError: expected string or buffer
df['col1']=re.sub("<(.*?)>",",",df['col1']) #this throws TypeError: expected string or buffer
I've tried these methods both with and without using the
.apply(str)
method, but neither seem to work.
I've tried a few different ways of escaping the quotes in the replace function, (ie using """ and ' to define the find part). Any ideas or insights are greatly appreciated!
Upvotes: 0
Views: 74
Reputation: 2127
This will do the substitution you're looking for:
df['col1'].str.replace('<div class="song">', '')
Your .replace()
method is trying to find an exact string match to replace rather than a subsitution based on existing parameters: Pandas.DataFrame.replace(). The .str.replace()
uses the built in string method in pandas which achieves what you were looking for: Pandas - Working with Text Data.
This would also work:
df['col1'].replace('<div class="song">', '',regex=True)
What HappyLeapSecond has provided though is the better approach to scraping the data.
Upvotes: 0
Reputation: 880419
Don't try to parse the HTML with regex. Extract the data using BeautifulSoup, and then pump the data into a DataFrame:
from bs4 import BeautifulSoup
import pandas as pd
content = '''
<div class="song">
<p><span class="small">07/06 4:21 AM</span> - <span class="small">Title:</span> Crazy For You - <span class="small">Artist:</span> Scars On 45
<span class="small"><a href="http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Ddigital-music&field-keywords=Crazy For You+Scars On 45&tag=wt897fmrafomu-20" target="_blank">Buy Song</a> </span>
'''
soup = BeautifulSoup(content)
data = list()
for p in soup.select('div.song p'):
row = list(p.stripped_strings)
date = row[0]
title = row[3].strip('- ')
artist = row[5]
data.append([date, title, artist])
df = pd.DataFrame(data, columns=['date', 'title', 'artist'])
df['date'] = pd.to_datetime(df['date'])
print(df)
yields
date title artist
0 2015-07-06 04:21:00 Crazy For You Scars On 45
Upvotes: 1