SecureEntrepeneur
SecureEntrepeneur

Reputation: 97

Python BeautifulSoup - Trying to parse for names in <ol>

I'm really new to web-scraping, but I'm trying to web scrape the name and relevant information from this website: https://ofsistorage.blob.core.windows.net/publishlive/ConList.html

I'm stuck on how to extract the information from an ordered list tag in html. This is the HTML structure when I inspect the page: HTML structure

This is the code I have so far:

from bs4 import BeautifulSoup
import requests

source = requests.get('https://ofsistorage.blob.core.windows.net/publishlive/ConList.html')
soup = BeautifulSoup(source.text, 'html.parser')

#Pull all text from the vsc-initalized div
name_list = soup.find(class_='vsc-initialized')

# Pull text from all instances of <ol> tag within vsc-initalized div
name_list_items = name_list.find_all('ol')

# Create for loop to print out all names
for name in name_list_items:
    print(name_list.prettify())

Any suggestions would be greatly appreciated!

Upvotes: 1

Views: 889

Answers (3)

Humayun Ahmad Rajib
Humayun Ahmad Rajib

Reputation: 1560

from bs4 import BeautifulSoup
import requests
import pandas as pd
response= requests.get('https://ofsistorage.blob.core.windows.net/publishlive/ConList.html')
soup = BeautifulSoup(response.text, 'lxml')
name_list = soup.find('body')
name_list_items = name_list.find_all('ol')
data = []

for name in name_list_items:
    list_items = name.find_all('li') 
    list_items = [item.text for item in list_items]
    data.append(list_items)

df = pd.DataFrame(data)
print(df)

Output will be:

0   Name 6: ABBASIN 1: ABDUL AZIZ 2: n/a 3: n/a 4:...   
1   Organisation Name: HAJI BASIR AND ZARJMIL COMP...   
2   Name 6: NAVUMAU  1: ULADZIMIR  2: ULADZIMIRAVI...   
3   Name 6: AUNG 1: AUNG 2: n/a 3: n/a 4: n/a 5: n...   
4   Name 6: BIZIMANA 1: GODEFROID 2: n/a 3: n/a 4:...   
5   Name 6: ABDOULAYE 1: HISSENE 2: n/a 3: n/a 4: ...   
6   Organisation Name: BUREAU D'ACHAT DE DIAMANT E...   
7   Name 6: AHMED 1: FIRAS 2: n/a 3: n/a 4: n/a 5:...   
8   Organisation Name: CENTRE D'ETUDES ET DE RECHE...   
9   Name 6: BADEGE 1: ERIC 2: n/a 3: n/a 4: n/a 5:...   
10  Organisation Name: ADFa.k.a: (1) ADF/NALU (2) ...   
11  Name 6: EL GAMMAL 1: KHADIGA 2: MAHMOUD 3: n/a...   
12  Name 6: ABBASZADEH-MESHKINI 1: MAHMOUD 2: n/a ...   
13  Organisation Name: CYBER POLICEAddress: Tehran...   
14  Name 6: ABBASI-DAVANI 1: FEREIDOUN 2: n/a 3: n...   
15  Organisation Name: 3M MIZAN MACHINERY MANUFACT...   
16  Name 6: ABD AL-GHAFUR 1: SUNDUS 2: n/a 3: n/a ...   
17  Organisation Name: AL WASEL AND BABEL GENERAL ...   
18  Name 6: ABDELRAZAK 1: FITIWI 2: n/a 3: n/a 4: ...   
19  Organisation Name: AL-INMA HOLDING CO. FOR CON...   
20  Name 6: AG ALBACHAR 1: AHMED 2: n/a 3: n/a 4: ... 


... and so on.

Upvotes: 1

Andrej Kesely
Andrej Kesely

Reputation: 195543

This script will get all names found on the page:

import requests
from bs4 import BeautifulSoup

url = 'https://ofsistorage.blob.core.windows.net/publishlive/ConList.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

all_names = []
for li in soup.select('li:has(b:contains("Name 6:"))'):
    all_names.append([name.find_next_sibling(text=True).strip() for name in li.select('b')[:6]])

# pretty print on screen:
from pprint import pprint
pprint(all_names)

Prints:

[['ABBASIN', 'ABDUL AZIZ', 'n/a', 'n/a', 'n/a', 'n/a.'],
 ['ABDUL AHAD', 'AZIZIRAHMAN', 'n/a', 'n/a', 'n/a', 'n/a.'],
 ['ABDUL AHMAD TURK', 'ABDUL GHANI', 'BARADAR', 'n/a', 'n/a', 'n/a.'],
 ['ABDUL BASEER', 'ABDUL QADEER', 'BASIR', 'n/a', 'n/a', 'n/a.'],
 ['ABDUL BASIR', 'NAZIR MOHAMMAD', 'n/a', 'n/a', 'n/a', 'n/a.'],
 ['ABDUL GHANI', 'ABDUL GHAFAR', 'QURISHI', 'n/a', 'n/a', 'n/a.'],
 ['ABDUL MANAN', 'ABDUL SATAR', 'n/a', 'n/a', 'n/a', 'n/a.'],
 ['ABDUL QADER', 'ABDUL HAI', 'HAZEM', 'n/a', 'n/a', 'n/a.'],
 ['ABDUL QADIR', 'AHMAD TAHA', 'KHALID', 'n/a', 'n/a', 'n/a.'],
 ['ABDUL QUDDUS', 'SAYED ESMATULLAH', 'ASEM', 'n/a', 'n/a', 'n/a.'],
 ['ABDUL ZAHIR', 'SHAMS', 'UR-RAHMAN', 'n/a', 'n/a', 'n/a.'],
 ['ABDULLAH', 'AMIR', 'n/a', 'n/a', 'n/a', 'n/a.'],
 ['ACHEKZAI', 'ABDUL SAMAD', 'n/a', 'n/a', 'n/a', 'n/a.'],
 ['ACHEKZAI', 'ADAM KHAN', 'n/a', 'n/a', 'n/a', 'n/a.'],
 ['AGHA', 'ABDUL RAHMAN', 'n/a', 'n/a', 'n/a', 'n/a.'],
 ['AGHA', 'SAYED', 'MOHAMMAD', 'AZIM', 'n/a', 'n/a.'],
 ['AGHA', 'SAYYED GHIASSOUDDINE', 'n/a', 'n/a', 'n/a', 'n/a.'],
 ['AGHA', 'JANAN', 'n/a', 'n/a', 'n/a', 'n/a.'],

...and so on.

Upvotes: 1

Shreyas Sreenivas
Shreyas Sreenivas

Reputation: 351

There is no div with class vsc-initialized so you use the body tag directly. .find_all does not pull the text. It gets the entire html element, which means you can further perform functions on them.

The list_items gives you a list of the text within each li element within each ol element. To further extract information, you would have to use regular python as there is no defined structure for that information in the webpage.

from bs4 import BeautifulSoup
import requests

source = requests.get('https://ofsistorage.blob.core.windows.net/publishlive/ConList.html')
soup = BeautifulSoup(source.text, 'html.parser')

#Pull all text from the body
name_list = soup.find('body')

# Pull all instances of <ol> tag within the body
name_list_items = name_list.find_all('ol')

# Create for loop to print out all names
for name in name_list_items:
    list_items = name.find_all('li') #extract all li elements under each ol tag
    list_items = [item.text for item in list_items]
    print(list_items)

Upvotes: 1

Related Questions