Reputation: 177
Can't seem to copy over the values from the table into a dataframe correctly. If you run raw_data, it outputs a list of all the values. Any idea how to make it structured?
pop_source = requests.get("http://zipatlas.com/us/tx/austin/zip-code-comparison/population-density.htm").text
soup = BeautifulSoup(pop_source, 'html5lib')
source = soup.find_all('td',class_ = 'report_data')
pop = pd.DataFrame(columns=['Zip Code','Population'])
row_data = [data.text for data in source]
temp_df = pd.DataFrame([row_data], columns=['#','Zip Code','Location','City', 'Population','People/Sq.Mile','National Rank'])
temp_df = temp_df[['Zip Code','Population']]
pop = pop.append(temp_df).reset_index(drop=True)
pop
Upvotes: 2
Views: 8822
Reputation: 28565
Few things with the code:
The code you have above actually doesn't loop through anything. In fact if you try to run this you should be getting tons of errors. There's no way to see that output of just looping same values with what you provided.
The indentation is off,
you reference a variable source
which hasn't been defined. You reference a variable cols
which is also not defined.
You initialize a dataframe pop
with 2 columns, and try to append a dataframe with 7 columns.
All kinds of issues happening here.
Have you considered just using straight up Pandas to form the dataframe? You could still use BeautifulSoup, but pandas can do the work for you, and it seems like you need to get a little more practice of iterating through elements with BeautifulSoup (you actually never even used .find
or .find_all
to locate the tags associated with the table you want.)
If you NEED the BeautifulSoup way to do it, just let me know and I can offer that way as well, but honestly, it's a lot more work than just using the .read_html()
with pandas here.
import pandas as pd
url = 'http://zipatlas.com/us/tx/austin/zip-code-comparison/population-density.htm'
tables = pd.read_html(url)
df = tables[11][1:]
df.columns = tables[11].iloc[0]
Output:
print (df)
0 # Zip Code ... People / Sq. Mile National Rank
1 1. 78705 ... 11008.66 #519
2 2. 78751 ... 5822.28 #1,374
3 3. 78752 ... 5435.92 #1,528
4 4. 78741 ... 5346.47 #1,562
5 5. 78723 ... 5175.95 #1,640
6 6. 78704 ... 5001.96 #1,713
7 7. 78758 ... 4954.80 #1,730
8 8. 78702 ... 4501.98 #2,015
9 9. 78757 ... 4380.92 #2,087
10 10. 78756 ... 4298.80 #2,139
11 11. 78745 ... 4063.22 #2,295
12 12. 78753 ... 3973.96 #2,350
13 13. 78703 ... 3491.54 #2,753
14 14. 78731 ... 3031.63 #3,167
15 15. 78759 ... 2998.68 #3,199
16 16. 78727 ... 2856.67 #3,371
17 17. 78749 ... 2795.02 #3,438
18 18. 78728 ... 2640.31 #3,614
19 19. 78721 ... 2568.43 #3,690
20 20. 78722 ... 2567.53 #3,692
21 21. 78729 ... 2366.94 #3,944
22 22. 78701 ... 2326.65 #3,995
23 23. 78748 ... 1961.73 #4,504
24 24. 78750 ... 1731.01 #4,870
25 25. 78744 ... 1464.78 #5,311
26 26. 78746 ... 1152.39 #5,971
27 27. 78717 ... 1081.05 #6,119
28 28. 78739 ... 768.80 #7,006
29 29. 78734 ... 698.96 #7,267
30 30. 78724 ... 555.85 #7,870
31 31. 78726 ... 543.24 #7,940
32 32. 78733 ... 510.92 #8,116
33 33. 78754 ... 484.73 #8,255
34 34. 78735 ... 474.14 #8,318
35 35. 78732 ... 416.13 #8,702
36 36. 78742 ... 321.40 #9,467
37 37. 78730 ... 257.86 #10,189
38 38. 78738 ... 213.29 #10,829
39 39. 78747 ... 194.02 #11,173
40 40. 78736 ... 187.88 #11,301
41 41. 78737 ... 143.90 #12,372
42 42. 78725 ... 116.87 #13,282
43 43. 78719 ... 93.88 #14,377
[43 rows x 7 columns]
with BeautifulSoup
This is not the ideal way to do this. While this site is pretty straight forward with the table
, tr
, td
tags. What you probably would want is first grab all the rows, then iterate through each row to grab the <td>
tags. But you grabbed all the <td>
tags in one big swoop. Which is still ok, but we would need to break that up into each row.
All I did was then break that up into groups of 7, since that's how many columns there are. Note, I'm making a huge assumption all the data is there. If it's not, then the table would be off or rows, columns would be shifted.
import requests
import pandas as pd
import bs4
# Create a function called "chunks" with two arguments, l and n:
def chunks(l, n):
# For item i in a range that is a length of l,
for i in range(0, len(l), n):
# Create an index range for l of n items:
yield l[i:i+n]
pop_source = requests.get("http://zipatlas.com/us/tx/austin/zip-code-comparison/population-density.htm").text
soup = bs4.BeautifulSoup(pop_source, 'html5lib')
source = soup.find_all('td',class_ = 'report_data')
pop = pd.DataFrame(columns=['#','Zip Code','Location','City', 'Population','People/Sq.Mile','National Rank'])
row_data = [data.text for data in source]
rows_data = list(chunks(row_data, 7))
for ele in rows_data:
temp_df = pd.DataFrame([ele], columns=['#','Zip Code','Location','City', 'Population','People/Sq.Mile','National Rank'])
pop = pop.append(temp_df).reset_index(drop=True)
Upvotes: 4