JeremyD
JeremyD

Reputation: 125

Trouble creating pandas dataframe from lists

I am having some trouble creating a pandas df from lists I generate while scraping data from the web. Here I am using beautifulsoup to pull a few pieces of information about local farms from localharvest.org (farm name, city, and description). I am able to scrape the data effectively, creating a list of objects on each pass. The trouble I'm having is outputting these lists into a tabular df.

My complete code is as follows:

import requests
from bs4 import BeautifulSoup
import pandas

url = "http://www.localharvest.org/search.jsp?jmp&lat=44.80798&lon=-69.22736&scale=8&ty=6"
r = requests.get(url)
soup = BeautifulSoup(r.content)


data = soup.find_all("div", {'class': 'membercell'})

fname = []
fcity = []
fdesc = []

for item in data:
    name = item.contents[1].text
    fname.append(name)
    city = item.contents[3].text
    fcity.append(city)
    desc = item.find_all("div", {'class': 'short-desc'})[0].text
    fdesc.append(desc)

df = pandas.DataFrame({'fname': fname, 'fcity': fcity, 'fdesc': fdesc})

print (df)

df.to_csv('farmdata.csv')

Interestingly, the print(df) function shows that all three lists have been passed to the dataframe. But the resultant .CSV output contains only a single column of values (fcity) with the fname and fdesc column labels present. Interstingly, If I do something crazy like try to force tab delineated output with df.to_csv('farmdata.csv', sep='\t'), I get a single column with jumbled output, but it appears to at least be passing the other elements of the dataframe.

Thanks in advance for any input.

Upvotes: 0

Views: 331

Answers (3)

Nickil Maveli
Nickil Maveli

Reputation: 29711

It works for me:

# Taking a few slices of each substring of a given string after stripping off whitespaces
df['fname'] = df['fname'].str.strip().str.slice(start=0, stop=20)
df['fdesc'] = df['fdesc'].str.strip().str.slice(start=0, stop=20)
df.to_csv('farmdata.csv')
df

                fcity                 fdesc                 fname
0  South Portland, ME  Gromaine Farm is pro         Gromaine Farm
1         Newport, ME  We are a diversified    Parker Family Farm
2           Unity, ME  The Buckle Farm is a       The Buckle Farm
3      Kenduskeag, ME  Visit wiseacresfarm.       Wise Acres Farm
4      Winterport, ME  Winter Cove Farm is       Winter Cove Farm
5          Albion, ME  MISTY BROOK FARM off      Misty Brook Farm
6  Dover-Foxcroft, ME  We want you to becom           Ripley Farm
7         Madison, ME  Hide and Go Peep Far  Hide and Go Peep Far
8            Etna, ME  Fail Better Farm is       Fail Better Farm
9      Pittsfield, ME  We are a family farm  Snakeroot Organic Fa

Maybe you had a lot of empty spaces which was misinterpreted by the default delimiter(,) and hence picked up fcity column as it contained(,) in it which led to the ordering getting affected.

Upvotes: 1

zarak
zarak

Reputation: 3003

Try stripping out the newline and space characters:

import requests
from bs4 import BeautifulSoup
import pandas

url = "http://www.localharvest.org/search.jsp?jmp&lat=44.80798&lon=-69.22736&scale=8&ty=6"
r = requests.get(url)
soup = BeautifulSoup(r.content)


data = soup.find_all("div", {'class': 'membercell'})

fname = []
fcity = []
fdesc = []

for item in data:
    name = item.contents[1].text.split()
    fname.append(' '.join(name))
    city = item.contents[3].text.split()
    fcity.append(' '.join(city))
    desc = item.find_all("div", {'class': 'short-desc'})[0].text.split()
    fdesc.append(' '.join(desc))

df = pandas.DataFrame({'fname': fname, 'fcity': fcity, 'fdesc': fdesc})

print (df)

df.to_csv('farmdata.csv')

Upvotes: 1

Mansweet
Mansweet

Reputation: 151

Consider, instead of using lists of the information for each farm entity that you scrape, to use a list of dictionaries, or a dict of dicts. eg:

[{name:farm1, city: San Jose... etc},
{name: farm2, city: Oakland...etc}]

Now you can call Pandas.DataFrame.from_dict() on the above defined list of dicts.

Pandas method: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html

An answer that might describe this solution in more detail: Convert Python dict into a dataframe

Upvotes: 1

Related Questions