How do I resolve error while removing parts of a string in Python DataFrame?

Question

I've searched through a few threads and either I'm getting an error or I'm not getting the expected result when trying either the pandas replace method or the regex re.sub method(python 3.x).

I'm pulling in html data and due to the odd tagging nature I can't extract the data I need. For example each row looks like below


07/06 4:21 AM     -     Title:   Crazy For You   -     Artist:  Scars On 45    
 Buy Song

I'm using the code below to pull in html data and I want to remove a large chunk of the text to pull out the time/date (ex: 07/06 4:21 AM), artist (ex: Scars on 45), and song (ex: Crazy For You). I'm encountering either errors or the code not working as expected when I try either of the last three lines.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import re
import numpy as np

html = urlopen("http://wtmd.org/radio/RecentSongs.html")
soup = BeautifulSoup(html.read())
Songs=soup.select('div.song')
#data=np.asarray(Songs)

df = pd.DataFrame({'col1':Songs})
df['col1']=df['col1'].apply(str)

#errors below

df['col1']=df['col1'].replace("",",") #this does not get replaced

df['col1']=re.sub("",",",df['col1'])  #this throws TypeError: expected string or buffer

df['col1']=re.sub("<(.*?)>",",",df['col1'])  #this throws TypeError: expected string or buffer

I've tried these methods both with and without using the

.apply(str)

method, but neither seem to work.

I've tried a few different ways of escaping the quotes in the replace function, (ie using """ and ' to define the find part). Any ideas or insights are greatly appreciated!

unutbu · Accepted Answer

Don't try to parse the HTML with regex. Extract the data using BeautifulSoup, and then pump the data into a DataFrame:

from bs4 import BeautifulSoup
import pandas as pd

content = '''

07/06 4:21 AM     -     Title:   Crazy For You   -     Artist:  Scars On 45    
 Buy Song 
'''

soup = BeautifulSoup(content)

data = list()
for p in soup.select('div.song p'):
    row = list(p.stripped_strings)
    date = row[0]
    title = row[3].strip('- ')
    artist = row[5]
    data.append([date, title, artist])
df = pd.DataFrame(data, columns=['date', 'title', 'artist'])
df['date'] = pd.to_datetime(df['date'])
print(df)

yields

                 date          title       artist
0 2015-07-06 04:21:00  Crazy For You  Scars On 45

How do I resolve error while removing parts of a string in Python DataFrame?

Answers (2)

Related Questions