How to i append the output from beautifulsoup to a pandas dataframe

Question

I am relatively new to python. I am planning to

a) obtain a list of URLs from the following url (https://aviation-safety.net/database/) with data from the year 1919 onwards (https://aviation-safety.net/database/dblist.php?Year=1919).

b) obtain the data (date, type, registration, opreator, fat., location, cat) from 1919 to current year

However, i ran into some problems and am still stuck in a)

Any form of help is appreciated, thank you so much!

#import packages
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

#start of code
mainurl = "https://aviation-safety.net/database/"
def getAndParseURL(mainurl):
   result = requests.get(mainurl)
   soup = BeautifulSoup(result.content, 'html.parser')
   datatable = soup.find('a', href = True)


#try clause to go through the content and grab the URLs
try:
   for row in datatable:
      cols = row.find_all("|")
      if len(cols) > 1:
         links.append(x, cols = cols)
         except: pass


#place links into numpy array
links_array = np.asarray(links)
len(links_array)


#check if links are in dataframe
df = pd.DataFrame(links_array)

df.columns = ['url']
df.head(10)

i can't seem to be able to get the URLs

would be great if i could get the following

S/N URL 1 https://aviation-safety.net/database/dblist.php?Year=1919 2 https://aviation-safety.net/database/dblist.php?Year=1920 3 https://aviation-safety.net/database/dblist.php?Year=1921

chitown88 · Accepted Answer

You're not extracting the href attributes from the tags you are pulling. What you want to do is find all tags with links (which you did, but you need to use find_all as find will just return the first 1 it finds.) Then iterate through those tags. I choose to just have it look for the substring 'Year' and if it does, put that into the list.

#import packages
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests

#start of code
mainurl = "https://aviation-safety.net/database/"
def getAndParseURL(mainurl):
   result = requests.get(mainurl)
   soup = BeautifulSoup(result.content, 'html.parser')
   datatable = soup.find_all('a', href = True)
   return datatable

datatable = getAndParseURL(mainurl)

#go through the content and grab the URLs
links = []
for link in datatable:
    if 'Year' in link['href']:
        url = link['href']
        links.append(mainurl + url)


#check if links are in dataframe
df = pd.DataFrame(links, columns=['url'])

df.head(10)

Output:

df.head(10)
Out[24]: 
                                                 url
0  https://aviation-safety.net/database/dblist.ph...
1  https://aviation-safety.net/database/dblist.ph...
2  https://aviation-safety.net/database/dblist.ph...
3  https://aviation-safety.net/database/dblist.ph...
4  https://aviation-safety.net/database/dblist.ph...
5  https://aviation-safety.net/database/dblist.ph...
6  https://aviation-safety.net/database/dblist.ph...
7  https://aviation-safety.net/database/dblist.ph...
8  https://aviation-safety.net/database/dblist.ph...
9  https://aviation-safety.net/database/dblist.ph...

How to i append the output from beautifulsoup to a pandas dataframe

Answers (1)

Related Questions