Reputation: 33
I am trying to scan a bunch of Wikipedia pages for statistics about WWII.
I am using BeautifulSoup to try and get all of the statistics from the column on the right of the Wikipedia page.
The code is listed below.
"links.csv" is a file with a bunch of link endings like "Battle_of_Leyte_Gulf". I have tested with the <h2>
tag and it is properly accessing all sites.
import requests
from bs4 import BeautifulSoup
import pandas
df=pandas.read_csv("links.csv")
links=df['links']
for url in links:
# print("\n"+url+"\n")
txt="https://en.wikipedia.org/wiki/"+url
page=requests.get(txt)
soup=BeautifulSoup(page.content, 'html.parser')
tags = soup.find_all("br")
for tag in tags:
print(tag)
However, I noticed the text is not in the actual
tag, and it is actually outside like listed.
"Sixth Army: "
<br>
"≈200,000"
<br>
<span class="flagicon">...</span>
"Air and naval forces: ≈120,000"
I want to know how I can get the actual text "Sixth Army: " and "≈200,000".
link here: https://en.wikipedia.org/wiki/Battle_of_Leyte
Upvotes: 2
Views: 75
Reputation: 84465
You could isolate the td cell and then use next_sibling
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://en.wikipedia.org/wiki/Battle_of_Leyte')
soup = bs(r.content, 'lxml')
visible_row = soup.select_one('.vevent tr:nth-of-type(12) td span')
print(visible_row.next_sibling)
print(visible_row.next_sibling.next_sibling.next_sibling)
Upvotes: 1