Reputation: 507
Scraping a column from Wikipedia with Beautifulsoup returns the last row, while I want all of them in a list:
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "https://en.wikipedia.org/wiki/Agriculture_in_India"
html = urlopen(site)
soup = BeautifulSoup(html, "html.parser")
table = soup.find("table", {'class': 'wikitable sortable'})
for row in table.find_all("tr")[1:]:
col = row.find_all("td")
if len(col) > 0:
com = str(col[1].string.strip("\n"))
list(com)
com
Out: 'ZTS'
So it only shows the last row of the string, I was expecting to get a list with each line of the string as a string value. So that I can assign the list to new variable.
"Rice", "Buffalo milk", "Cow milk", "Wheat"
Can anyone help me?
Upvotes: 1
Views: 1359
Reputation: 2168
Your method will not work because you are not "adding" anything to com.
One way to do what you desire is:
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "https://en.wikipedia.org/wiki/Agriculture_in_India"
html = urlopen(site)
soup = BeautifulSoup(html, "html.parser")
table = soup.find("table", {'class': 'wikitable sortable'})
com=[]
for row in table.find_all("tr")[1:]:
col = row.find_all("td")
if len(col)> 0:
temp=col[1].contents[0]
try:
to_append=temp.contents[0]
except Exception as e:
to_append=temp
com.append(to_append)
print(com)
This will give you what you require.
Explanation
col[1].contents[0]
gives the first child of the tag. .contents
gives you a list of children of the tag. Here we have a single child so 0
.
In some cases, the content inside the <tr>
tag is a <a href>
tag. So I apply another .contents[0]
to get the text.
In other cases it is not a link. For that I used an exception statement. If there is no descendant of the child extracted, we would get an exception.
See the official documentation for details
Upvotes: 2