user10939484
user10939484

Reputation: 177

BeautifulSoup table to dataframe

Can't seem to copy over the values from the table into a dataframe correctly. If you run raw_data, it outputs a list of all the values. Any idea how to make it structured?

pop_source = requests.get("http://zipatlas.com/us/tx/austin/zip-code-comparison/population-density.htm").text

soup = BeautifulSoup(pop_source, 'html5lib')    
source = soup.find_all('td',class_ = 'report_data')


pop = pd.DataFrame(columns=['Zip Code','Population'])


row_data = [data.text for data in source]

temp_df = pd.DataFrame([row_data], columns=['#','Zip Code','Location','City', 'Population','People/Sq.Mile','National Rank'])   

temp_df = temp_df[['Zip Code','Population']]   
pop = pop.append(temp_df).reset_index(drop=True)    
pop

Upvotes: 2

Views: 8822

Answers (1)

chitown88
chitown88

Reputation: 28565

Few things with the code:

  1. The code you have above actually doesn't loop through anything. In fact if you try to run this you should be getting tons of errors. There's no way to see that output of just looping same values with what you provided.

  2. The indentation is off,

  3. you reference a variable source which hasn't been defined. You reference a variable cols which is also not defined.

  4. You initialize a dataframe pop with 2 columns, and try to append a dataframe with 7 columns.

All kinds of issues happening here.

Have you considered just using straight up Pandas to form the dataframe? You could still use BeautifulSoup, but pandas can do the work for you, and it seems like you need to get a little more practice of iterating through elements with BeautifulSoup (you actually never even used .find or .find_all to locate the tags associated with the table you want.)

If you NEED the BeautifulSoup way to do it, just let me know and I can offer that way as well, but honestly, it's a lot more work than just using the .read_html() with pandas here.

import pandas as pd

url = 'http://zipatlas.com/us/tx/austin/zip-code-comparison/population-density.htm'

tables = pd.read_html(url)

df = tables[11][1:]
df.columns = tables[11].iloc[0]

Output:

print (df)
0     # Zip Code      ...      People / Sq. Mile National Rank
1    1.    78705      ...               11008.66          #519
2    2.    78751      ...                5822.28        #1,374
3    3.    78752      ...                5435.92        #1,528
4    4.    78741      ...                5346.47        #1,562
5    5.    78723      ...                5175.95        #1,640
6    6.    78704      ...                5001.96        #1,713
7    7.    78758      ...                4954.80        #1,730
8    8.    78702      ...                4501.98        #2,015
9    9.    78757      ...                4380.92        #2,087
10  10.    78756      ...                4298.80        #2,139
11  11.    78745      ...                4063.22        #2,295
12  12.    78753      ...                3973.96        #2,350
13  13.    78703      ...                3491.54        #2,753
14  14.    78731      ...                3031.63        #3,167
15  15.    78759      ...                2998.68        #3,199
16  16.    78727      ...                2856.67        #3,371
17  17.    78749      ...                2795.02        #3,438
18  18.    78728      ...                2640.31        #3,614
19  19.    78721      ...                2568.43        #3,690
20  20.    78722      ...                2567.53        #3,692
21  21.    78729      ...                2366.94        #3,944
22  22.    78701      ...                2326.65        #3,995
23  23.    78748      ...                1961.73        #4,504
24  24.    78750      ...                1731.01        #4,870
25  25.    78744      ...                1464.78        #5,311
26  26.    78746      ...                1152.39        #5,971
27  27.    78717      ...                1081.05        #6,119
28  28.    78739      ...                 768.80        #7,006
29  29.    78734      ...                 698.96        #7,267
30  30.    78724      ...                 555.85        #7,870
31  31.    78726      ...                 543.24        #7,940
32  32.    78733      ...                 510.92        #8,116
33  33.    78754      ...                 484.73        #8,255
34  34.    78735      ...                 474.14        #8,318
35  35.    78732      ...                 416.13        #8,702
36  36.    78742      ...                 321.40        #9,467
37  37.    78730      ...                 257.86       #10,189
38  38.    78738      ...                 213.29       #10,829
39  39.    78747      ...                 194.02       #11,173
40  40.    78736      ...                 187.88       #11,301
41  41.    78737      ...                 143.90       #12,372
42  42.    78725      ...                 116.87       #13,282
43  43.    78719      ...                  93.88       #14,377

[43 rows x 7 columns]

with BeautifulSoup

This is not the ideal way to do this. While this site is pretty straight forward with the table, tr, td tags. What you probably would want is first grab all the rows, then iterate through each row to grab the <td> tags. But you grabbed all the <td> tags in one big swoop. Which is still ok, but we would need to break that up into each row.

All I did was then break that up into groups of 7, since that's how many columns there are. Note, I'm making a huge assumption all the data is there. If it's not, then the table would be off or rows, columns would be shifted.

import requests
import pandas as pd
import bs4


# Create a function called "chunks" with two arguments, l and n:
def chunks(l, n):
    # For item i in a range that is a length of l,
    for i in range(0, len(l), n):
        # Create an index range for l of n items:
        yield l[i:i+n]



pop_source = requests.get("http://zipatlas.com/us/tx/austin/zip-code-comparison/population-density.htm").text

soup = bs4.BeautifulSoup(pop_source, 'html5lib')


source = soup.find_all('td',class_ = 'report_data')

pop = pd.DataFrame(columns=['#','Zip Code','Location','City', 'Population','People/Sq.Mile','National Rank'])

row_data = [data.text for data in source]

rows_data = list(chunks(row_data, 7))

for ele in rows_data:
    temp_df = pd.DataFrame([ele], columns=['#','Zip Code','Location','City', 'Population','People/Sq.Mile','National Rank'])
    pop = pop.append(temp_df).reset_index(drop=True)

Upvotes: 4

Related Questions