tmhs
tmhs

Reputation: 1070

webscraping data and use pandas read_html to convert it to dataframe and merge the dataset together

I'm kinda new to python and having some problem with my code. Would appreciate any suggestions on what I try to do.

import pandas as pd
from bs4 import BeautifulSoup
import requests

def trade():
tickers = ["AAPL","AMZN", "INTC", "MSFT", "SNAP"]
    for ticker in tickers:
        url = "http://finance.yahoo.com/quote/%s?p=%s"%(ticker,ticker)
        res = requests.get(url)
        soup = (BeautifulSoup(res.content, 'lxml'))
        table = soup.find_all('table')[0]
        df = pd.read_html(str(table))
        print("DF")
        print(df)
        df_string = str(df)
        print(df_string_parse)
        print(type(df_string_parse))

When I look at df, it displays string like this

[                0                1
 0  Previous Close           167.37
 1            Open           169.79
 2             Bid     157.23 x 300
 3             Ask     157.29 x 500
 4     Day's Range  169.00 - 173.09
 5   52 Week Range  134.84 - 180.10
 6          Volume         51124085
 7     Avg. Volume         33251246]

What I want to do, is to store each ticker's table into dataframe, and merge them together like below. Or I guess if there is a way, I Could turn that into dictionary so that I could use its variable more easily.

                  APPL    AMAZN    INTC     MSFT    SNAP
Previous Close
Open
Bid
Ask
Day's Range
Volume
Avg. Volume

For now, there are 2 problems that I face:

  1. how to turn the table into data frame and/or dictionary after using pd.read_html(str(table))
  2. how to store each ticker's results separately eventually to merge them together? I know how to use for loop to read them one by one, but I don't seem to know how to store them in such way.

Upvotes: 1

Views: 1502

Answers (1)

piRSquared
piRSquared

Reputation: 294278

I'd do it like this:

import pandas as pd
from bs4 import BeautifulSoup
import requests

def fetch(t):
    url = f'http://finance.yahoo.com/quote/{t}?p={t}'
    res = requests.get(url)
    soup = (BeautifulSoup(res.content, 'lxml'))
    table = soup.find_all('table')[0]
    labels, data = pd.read_html(str(table))[0].values.T
    #                                       ^
    #                             What you were missing
    # pd.read_html returned a list of 1 dataframe
    return pd.Series(data, labels, name=t)

tickers = ["AAPL","AMZN", "INTC", "MSFT", "SNAP"]
df = pd.concat(map(fetch, tickers), axis=1)

                           AAPL                 AMZN           INTC           MSFT           SNAP
Previous Close           167.37              1451.05          45.38          90.81          19.56
Open                     169.79              1466.89          45.88          91.21          19.66
Bid                157.23 x 300       1,373.50 x 100    43.29 x 100    86.19 x 100    19.01 x 300
Ask                157.29 x 500       1,376.00 x 200    43.43 x 800    86.30 x 500    19.04 x 500
Day's Range     169.00 - 173.09  1,436.84 - 1,468.94  44.95 - 45.99  90.62 - 92.72  19.58 - 20.57
52 Week Range   134.84 - 180.10    833.50 - 1,498.00  33.23 - 50.85  63.62 - 96.07  11.28 - 29.44
Volume                 51129225              5650685       23536349       27823161       40733751
Avg. Volume            33251246              4689059       34001746       28064627       25027667

Upvotes: 1

Related Questions