wjie08
wjie08

Reputation: 445

How to i append the output from beautifulsoup to a pandas dataframe

I am relatively new to python. I am planning to

a) obtain a list of URLs from the following url (https://aviation-safety.net/database/) with data from the year 1919 onwards (https://aviation-safety.net/database/dblist.php?Year=1919).

b) obtain the data (date, type, registration, opreator, fat., location, cat) from 1919 to current year

However, i ran into some problems and am still stuck in a)

Any form of help is appreciated, thank you so much!

#import packages
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

#start of code
mainurl = "https://aviation-safety.net/database/"
def getAndParseURL(mainurl):
   result = requests.get(mainurl)
   soup = BeautifulSoup(result.content, 'html.parser')
   datatable = soup.find('a', href = True)


#try clause to go through the content and grab the URLs
try:
   for row in datatable:
      cols = row.find_all("|")
      if len(cols) > 1:
         links.append(x, cols = cols)
         except: pass


#place links into numpy array
links_array = np.asarray(links)
len(links_array)


#check if links are in dataframe
df = pd.DataFrame(links_array)

df.columns = ['url']
df.head(10)


i can't seem to be able to get the URLs

would be great if i could get the following

S/N URL 1 https://aviation-safety.net/database/dblist.php?Year=1919 2 https://aviation-safety.net/database/dblist.php?Year=1920 3 https://aviation-safety.net/database/dblist.php?Year=1921

Upvotes: 1

Views: 732

Answers (1)

chitown88
chitown88

Reputation: 28565

You're not extracting the href attributes from the tags you are pulling. What you want to do is find all <a> tags with links (which you did, but you need to use find_all as find will just return the first 1 it finds.) Then iterate through those tags. I choose to just have it look for the substring 'Year' and if it does, put that into the list.

#import packages
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests

#start of code
mainurl = "https://aviation-safety.net/database/"
def getAndParseURL(mainurl):
   result = requests.get(mainurl)
   soup = BeautifulSoup(result.content, 'html.parser')
   datatable = soup.find_all('a', href = True)
   return datatable

datatable = getAndParseURL(mainurl)

#go through the content and grab the URLs
links = []
for link in datatable:
    if 'Year' in link['href']:
        url = link['href']
        links.append(mainurl + url)


#check if links are in dataframe
df = pd.DataFrame(links, columns=['url'])

df.head(10)

Output:

df.head(10)
Out[24]: 
                                                 url
0  https://aviation-safety.net/database/dblist.ph...
1  https://aviation-safety.net/database/dblist.ph...
2  https://aviation-safety.net/database/dblist.ph...
3  https://aviation-safety.net/database/dblist.ph...
4  https://aviation-safety.net/database/dblist.ph...
5  https://aviation-safety.net/database/dblist.ph...
6  https://aviation-safety.net/database/dblist.ph...
7  https://aviation-safety.net/database/dblist.ph...
8  https://aviation-safety.net/database/dblist.ph...
9  https://aviation-safety.net/database/dblist.ph...

Upvotes: 2

Related Questions