Reputation: 445
I am relatively new to python
. I am planning to
a) obtain a list of URLs from the following url (https://aviation-safety.net/database/) with data from the year 1919 onwards (https://aviation-safety.net/database/dblist.php?Year=1919).
b) obtain the data (date, type, registration, opreator, fat., location, cat) from 1919 to current year
However, i ran into some problems and am still stuck in a)
Any form of help is appreciated, thank you so much!
#import packages
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
#start of code
mainurl = "https://aviation-safety.net/database/"
def getAndParseURL(mainurl):
result = requests.get(mainurl)
soup = BeautifulSoup(result.content, 'html.parser')
datatable = soup.find('a', href = True)
#try clause to go through the content and grab the URLs
try:
for row in datatable:
cols = row.find_all("|")
if len(cols) > 1:
links.append(x, cols = cols)
except: pass
#place links into numpy array
links_array = np.asarray(links)
len(links_array)
#check if links are in dataframe
df = pd.DataFrame(links_array)
df.columns = ['url']
df.head(10)
i can't seem to be able to get the URLs
would be great if i could get the following
S/N URL 1 https://aviation-safety.net/database/dblist.php?Year=1919 2 https://aviation-safety.net/database/dblist.php?Year=1920 3 https://aviation-safety.net/database/dblist.php?Year=1921
Upvotes: 1
Views: 732
Reputation: 28565
You're not extracting the href
attributes from the tags you are pulling. What you want to do is find all <a>
tags with links (which you did, but you need to use find_all
as find
will just return the first 1 it finds.) Then iterate through those tags. I choose to just have it look for the substring 'Year'
and if it does, put that into the list.
#import packages
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
#start of code
mainurl = "https://aviation-safety.net/database/"
def getAndParseURL(mainurl):
result = requests.get(mainurl)
soup = BeautifulSoup(result.content, 'html.parser')
datatable = soup.find_all('a', href = True)
return datatable
datatable = getAndParseURL(mainurl)
#go through the content and grab the URLs
links = []
for link in datatable:
if 'Year' in link['href']:
url = link['href']
links.append(mainurl + url)
#check if links are in dataframe
df = pd.DataFrame(links, columns=['url'])
df.head(10)
Output:
df.head(10)
Out[24]:
url
0 https://aviation-safety.net/database/dblist.ph...
1 https://aviation-safety.net/database/dblist.ph...
2 https://aviation-safety.net/database/dblist.ph...
3 https://aviation-safety.net/database/dblist.ph...
4 https://aviation-safety.net/database/dblist.ph...
5 https://aviation-safety.net/database/dblist.ph...
6 https://aviation-safety.net/database/dblist.ph...
7 https://aviation-safety.net/database/dblist.ph...
8 https://aviation-safety.net/database/dblist.ph...
9 https://aviation-safety.net/database/dblist.ph...
Upvotes: 2