malina
malina

Reputation: 881

Python Data Scraper

I wrote the following line of code

#!/usr/bin/python
#weather.scraper

from bs4 import BeautifulSoup
import urllib

def main():
    """weather scraper"""
    r = urllib.urlopen("https://www.wunderground.com/history/airport/KPHL/2016/1/1/MonthlyHistory.html?&reqdb.zip=&reqdb.magic=&reqdb.wmo=&MR=1").read()
    soup = BeautifulSoup(r, "html.parser")
    table = soup.find_all("table", class_="responsive airport-history-summary-table")
    tr = soup.find_all("tr")
    td = soup.find_all("td")
    print table
            

if __name__ == "__main__":
    main()

When I print the table i get all the html (td, tr, span, etc.) as well. How can I print the content of the table (tr, td) without the html?
THANKS!

Upvotes: 0

Views: 141

Answers (1)

Milano
Milano

Reputation: 18745

You have to use .getText() method when you want to get a content. Since find_all returns a list of elements, you have to choose one of them (td[0]).

Or you can do for example:

for tr in soup.find_all("tr"):
    print '>>>> NEW row <<<<'
    print '|'.join([x.getText() for x in tr.find_all('td')])

The loop above prints for each row cell next to cell.

Note that you do find all td's and all tr's your way but you probably want to get just those in table.

If you want to look for elements inside the table, you have to do this:

table.find('tr') instead of soup.find('tr) so the BeautifulSoup will be looking for trs in the table instead of whole html.

YOUR CODE MODIFIED (according to your comment that there are more tables):

#!/usr/bin/python
#weather.scraper

from bs4 import BeautifulSoup
import urllib

def main():
    """weather scraper"""
    r = urllib.urlopen("https://www.wunderground.com/history/airport/KPHL/2016/1/1/MonthlyHistory.html?&reqdb.zip=&reqdb.magic=&reqdb.wmo=&MR=1").read()
    soup = BeautifulSoup(r, "html.parser")
    tables = soup.find_all("table")

    for table in tables:
        print '>>>>>>> NEW TABLE <<<<<<<<<'

        trs = table.find_all("tr")

        for tr in trs:
            # for each row of current table, write it using | between cells
            print '|'.join([x.get_text().replace('\n','') for x in tr.find_all('td')])



if __name__ == "__main__":
    main()

Upvotes: 2

Related Questions