Ben Sorensen
Ben Sorensen

Reputation: 51

BeautifulSoup organize data into dataframe table

I have been working with BeautifulSoup to try and organize some data that I am pulling from an website (html) I have been able to boil the data down but am getting stuck on how to:

  1. eliminate not needed info
  2. organize remaining data to be put into a pandas dataframe

Here is the code I am working with:

import urllib.request
from bs4 import BeautifulSoup as bs
import re
import pandas as pd
import requests

headers = requests.utils.default_headers()
headers.update({
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
})

url = 'https://www.apartments.com/lehi-ut/1-bedrooms/'

page = requests.get(url,headers = headers)

soup = bs(page.text)

names = soup.body.findAll('tr')
function_names = re.findall('th class="\w+', str(names))
function_names = [item[10:] for item in function_names]

description = soup.body.findAll('td')
#description = re.findall('td class="\w+', str(description))

data = pd.DataFrame({'Title':function_names,'Info':description})

The error I have been getting is that the array numbers don't match up, which I know to be true but when I un-hashtag out the second description line it removes the numbers I want from there and even then the table isn't organizing itself properly.

What I would like the output to look like is:

(headers)  title: location | studio | 1 BR | 2 BR | 3 BR
(new line) data :  Lehi, UT| $1,335 |$1,309|$1,454|$1,580    

That is really all that I need but I can't get BS or Pandas to do it properly.

Any help would be greatly appreciated!

Upvotes: 0

Views: 1603

Answers (1)

Martin Evans
Martin Evans

Reputation: 46759

Try the following approach. It first extracts all of the data in the table and then transposes it (columns swapped with rows):

import urllib.request
from bs4 import BeautifulSoup as bs
import re
import pandas as pd
import requests

headers = {
    'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
}

url = 'https://www.apartments.com/lehi-ut/1-bedrooms/'
page = requests.get(url, headers=headers)
soup = bs(page.text, 'lxml')
table = soup.find("table", class_="rentTrendGrid")
rows = []

for tr in table.find_all('tr'):
    rows.append([td.text for td in tr.find_all(['th', 'td'])])

#header_row = rows[0]
rows = list(zip(*rows[1:])) # tranpose the table
df = pd.DataFrame(rows[1:], columns=rows[0])
print(df)

Giving you the following kind of output:

   Studio    1 BR    2 BR    3 BR
0       0     729   1,041   1,333
1  $1,335  $1,247  $1,464  $1,738

Upvotes: 1

Related Questions