Reputation: 125
I am having some trouble creating a pandas df from lists I generate while scraping data from the web. Here I am using beautifulsoup to pull a few pieces of information about local farms from localharvest.org (farm name, city, and description). I am able to scrape the data effectively, creating a list of objects on each pass. The trouble I'm having is outputting these lists into a tabular df.
My complete code is as follows:
import requests
from bs4 import BeautifulSoup
import pandas
url = "http://www.localharvest.org/search.jsp?jmp&lat=44.80798&lon=-69.22736&scale=8&ty=6"
r = requests.get(url)
soup = BeautifulSoup(r.content)
data = soup.find_all("div", {'class': 'membercell'})
fname = []
fcity = []
fdesc = []
for item in data:
name = item.contents[1].text
fname.append(name)
city = item.contents[3].text
fcity.append(city)
desc = item.find_all("div", {'class': 'short-desc'})[0].text
fdesc.append(desc)
df = pandas.DataFrame({'fname': fname, 'fcity': fcity, 'fdesc': fdesc})
print (df)
df.to_csv('farmdata.csv')
Interestingly, the print(df)
function shows that all three lists have been passed to the dataframe. But the resultant .CSV output contains only a single column of values (fcity) with the fname and fdesc column labels present. Interstingly, If I do something crazy like try to force tab delineated output with df.to_csv('farmdata.csv', sep='\t')
, I get a single column with jumbled output, but it appears to at least be passing the other elements of the dataframe.
Thanks in advance for any input.
Upvotes: 0
Views: 331
Reputation: 29711
It works for me:
# Taking a few slices of each substring of a given string after stripping off whitespaces
df['fname'] = df['fname'].str.strip().str.slice(start=0, stop=20)
df['fdesc'] = df['fdesc'].str.strip().str.slice(start=0, stop=20)
df.to_csv('farmdata.csv')
df
fcity fdesc fname
0 South Portland, ME Gromaine Farm is pro Gromaine Farm
1 Newport, ME We are a diversified Parker Family Farm
2 Unity, ME The Buckle Farm is a The Buckle Farm
3 Kenduskeag, ME Visit wiseacresfarm. Wise Acres Farm
4 Winterport, ME Winter Cove Farm is Winter Cove Farm
5 Albion, ME MISTY BROOK FARM off Misty Brook Farm
6 Dover-Foxcroft, ME We want you to becom Ripley Farm
7 Madison, ME Hide and Go Peep Far Hide and Go Peep Far
8 Etna, ME Fail Better Farm is Fail Better Farm
9 Pittsfield, ME We are a family farm Snakeroot Organic Fa
Maybe you had a lot of empty spaces which was misinterpreted by the default delimiter(,) and hence picked up fcity
column as it contained(,) in it which led to the ordering getting affected.
Upvotes: 1
Reputation: 3003
Try stripping out the newline and space characters:
import requests
from bs4 import BeautifulSoup
import pandas
url = "http://www.localharvest.org/search.jsp?jmp&lat=44.80798&lon=-69.22736&scale=8&ty=6"
r = requests.get(url)
soup = BeautifulSoup(r.content)
data = soup.find_all("div", {'class': 'membercell'})
fname = []
fcity = []
fdesc = []
for item in data:
name = item.contents[1].text.split()
fname.append(' '.join(name))
city = item.contents[3].text.split()
fcity.append(' '.join(city))
desc = item.find_all("div", {'class': 'short-desc'})[0].text.split()
fdesc.append(' '.join(desc))
df = pandas.DataFrame({'fname': fname, 'fcity': fcity, 'fdesc': fdesc})
print (df)
df.to_csv('farmdata.csv')
Upvotes: 1
Reputation: 151
Consider, instead of using lists of the information for each farm entity that you scrape, to use a list of dictionaries, or a dict of dicts. eg:
[{name:farm1, city: San Jose... etc},
{name: farm2, city: Oakland...etc}]
Now you can call Pandas.DataFrame.from_dict()
on the above defined list of dicts.
Pandas method: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html
An answer that might describe this solution in more detail: Convert Python dict into a dataframe
Upvotes: 1