Tanishq Vyas
Tanishq Vyas

Reputation: 1689

How to extract image src from the website

I tried scraping the table rows from the website to get the data on corona virus spread.

I wanted to extract the src for all the tags so as to get the source of the flag's image along with all the data for each country. Could someone help ?

import pandas as pd
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
options = Options()
options.add_argument('--headless')
driver = webdriver.Firefox(options=options)

driver = webdriver.Firefox(options=options)
driver.get("https://google.com/covid19-map/?hl=en")
df = pd.read_html(driver.page_source)[1]

df.to_csv("Data.csv", index=False)

driver.quit()

Upvotes: 2

Views: 337

Answers (2)

Gareth Ma
Gareth Ma

Reputation: 707

Not the most genious way, but since you have the page source already, how about using regex to match the urls of the images?

import re
print (re.findall(r'https://www.gstatic.com/onebox/sports/logos/flags/.+?.svg', driver.page_source))

The image links are in order so it matches the order of confirmed cases - except that on my computer, the country I'm in right now is at the top of the list.

If this is not what you want, I can delete this answer.

As mentioned by @Chris Doyle in the comments, this can even simply done by noticing the urls are the same, with ".+?" replaced by the country's name (all lowercase, connected with underscores). You have that information in the csv file.

country_name = "United Kingdom"
url = "https://www.gstatic.com/onebox/sports/logos/flags/"
url += '_'.join(country_name.lower().split())
url += '.svg'
print (url)

Also be sure to check out his answer using purely panda :)

Upvotes: 3

Chris Doyle
Chris Doyle

Reputation: 11992

While Gareth's answer has already been accepted, his answer inspired me to write this one form a pandas point of view. Since we know the url for flags are a fixed pattern and the only thing that changes is the name. We can create a new column by lowercasing the name, replacing spaces with underscores and then weaving the name in the fixed URL pattern

import pandas as pd
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
options = Options()
options.add_argument('--headless')
driver = webdriver.Chrome()

driver.get("https://google.com/covid19-map/?hl=en")
df = pd.read_html(driver.page_source)[1]
df['flag_url'] = df.apply(lambda row: f"https://www.gstatic.com/onebox/sports/logos/flags/{row.Location.lower().replace(' ', '_')}_icon_square.svg", axis=1)
df.to_csv("Data.csv", index=False)
driver.quit()

OUTPUT SAMPLE

Location,Confirmed,Cases per 1M people,Recovered,Deaths,flag_url
Worldwide,882068,125.18,185067,44136,https://www.gstatic.com/onebox/sports/logos/flags/worldwide_icon_square.svg
United Kingdom,29474,454.19,135,2352,https://www.gstatic.com/onebox/sports/logos/flags/united_kingdom_icon_square.svg
United States,189441,579.18,7082,4074,https://www.gstatic.com/onebox/sports/logos/flags/united_states_icon_square.svg

Upvotes: 4

Related Questions