Oscar Lundberg
Oscar Lundberg

Reputation: 17

Webscraping several URLs into panda df

Need some help appending several webscraping resaults to a panda df.

Currently im only getting the output from one of the URLs to the DF.

I left out the URLs, if you need them i will supply them to you.

##libs

import bs4 
import requests
import re
from time import sleep
import pandas as pd
from bs4 import BeautifulSoup as bs
 
##webscraping targets

URLs = ["URL1","URL2","URL3"]

## Get columns

column_list = []

r1 = requests.get(URLs[0])
soup1 = bs(r1.content)
data1 = soup1.find_all('dl', attrs= {"class": "border XSText rightAlignText noMarginTop highlightOnHover thickBorderBottom noTopBorder"})

columns = soup1.find_all('dt')
for col in columns:
  column_list.append(col.text.strip()) # strip() removes extra space from the text   

##Get values

value_list = []

for url in URLs:
  r1 = requests.get(url)
  soup1 = bs(r1.content)
  data1 = soup1.find_all('dl', attrs= {"class": "border XSText rightAlignText noMarginTop highlightOnHover thickBorderBottom noTopBorder"})
  
  values = soup1.find_all('dd')
  for val in values:
    value_list.append(val.text.strip())

  
df=pd.DataFrame(list(zip(column_list,value_list)))
df.transpose()

Current output only showing the resaults of one URL:

enter image description here

Expected output: enter image description here

Upvotes: 0

Views: 47

Answers (1)

Arthur Pereira
Arthur Pereira

Reputation: 1559

The problem here is with your zip function. It will only zip the values until the length of the shortest list, in this case, the column_list. Leaving all the other values unused.

If you want to append the other values to the dataframe as well you will have to iterate over then. So change the last two lines on your code to this and it should work:

result = [[i] for i in column_list]
for i, a in enumerate(value_list):
    result[i % len(column_list)].extend([a])

df = pd.DataFrame(result)
df.transpose()

Upvotes: 1

Related Questions