CamM
CamM

Reputation: 31

Scraping images injected by javascript in Python with Selenium

I'm trying to make a web scraper in Python on Mac OSX and an example I'm testing with is to load tags and images from a MyFonts page (eg here). Originally I was using BeautifulSoup but I noticed that the site initially loads with a 'blank.png' in place of the font images I'm trying to grab, which then get replaced with the 'real' ones with js. I'm trying to use Selenium, can I use a webdriverwait to listen for the change in the img src similar to the example below, but not by an ID or Class?

ff = webdriver.Firefox()
ff.get("http://www.myfonts.com/fonts/fort-foundry/gin/")
try:
    element = WebDriverWait(ff, 10).until(EC.presence_of_element_located((By.ID, "myDynamicElement")))
finally:
    ff.quit()

Ideally this should be waiting for not img src="*/blank.png" since the element doesn't change class or get a consistent name. Or should I just wait until the page finishes loading entirely? The scraper has to go through a lot of these, so I'm trying to keep it fairly quick.

I'm very new to Python so any help would be greatly appreciated.

Upvotes: 3

Views: 1355

Answers (2)

Padraic Cunningham
Padraic Cunningham

Reputation: 180391

I second what Alex said in regard to legality but you could also get the fonts if you mimic the Ajax request with requests and bs4:

In [16]: import requests

In [17]: from bs4 import BeautifulSoup

In [18]: data = {
   ....:     'seed': '24',
   ....:     "text": "Pangrams",
   ....:     "src": "pangram.auto",
   ....:     "size": "72",
   ....:     "fg": "000000",
   ....:     "bg": "ffffff",
   ....:     "goodies": "_2x:0",
   ....:     "w": "720",
   ....:     "i[]": ["fort-foundry/gin/regular,,720", "fort-foundry/gin/oblique,,720", "fort-foundry/gin/rough,,720",
   ....:             "fort-foundry/gin/rough-oblique,,720", "fort-foundry/gin/round,,720","fort-foundry/gin/round-oblique,,720",
   ....:             "fort-foundry/gin/lines,,720", "fort-foundry/gin/lines-oblique,,720"],
   ....:     "showimgs": "true"}

In [19]: js = requests.post("http://www.myfonts.com/ajax-server/testdrive_new-ajax.php", data=data).json()

In [20]: 

In [20]: urls = [img["src"] for img in BeautifulSoup("".join(js.values()),"lxml").find_all("img")]

In [21]: pp(urls)
['//samples.myfonts.net/a_91/u/af/5e840d069d35f2c8e5f7077bae7b1e.gif',
 '//samples.myfonts.net/e_91/u/d6/1d63ad993299d182ae19eddb2c41e1.gif',
 '//samples.myfonts.net/e_92/u/7c/15b8e24e4b077ae3b1c7a614afa8b5.gif',
 '//samples.myfonts.net/b_92/u/ce/63dffdda8581fc83f6fe20874714e7.gif',
 '//samples.myfonts.net/e_91/u/51/e8b7a0b5cccb2abf530b05e1d3fb04.gif',
 '//samples.myfonts.net/b_91/u/6f/a5f870c719dcf9961e753b9f4afd7e.gif',
 '//samples.myfonts.net/b_92/u/7c/94d652e4f146801e3c81f694898e07.gif',
 '//samples.myfonts.net/b_91/u/47/39fa3ab779cabd1068abbca7ce98c5.gif']

The only ones you need to pass are the i[]: values, the rest can be used to change the size, background colour etc..

So if you did not care about changing the bg, fg or size etc and to get all the names using just bs4 and requests, you could get the font names from the the search-result-item class and construct the Ajax request using those:

In [1]: import requests

In [2]: from bs4 import BeautifulSoup

In [3]: r = requests.get("http://www.myfonts.com/fonts/fort-foundry/gin/")

In [4]: soup = BeautifulSoup(r.content, "lxml")

# creates fort-foundry/gin/regular,,720" etc..
In [5]: fonts = ["{},,720".format(a["href"].strip("/").split("/", 1)[1]) 
                   for a in soup.select("div .search-result-item h4 a[href]")]

In [6]: data = {
   ...:     "i[]": fonts
   ...:      }

In [7]: js = requests.post("http://www.myfonts.com/ajax-server/testdrive_new-ajax.php", data=data).json()

In [8]: urls = [img["src"] for img in BeautifulSoup("".join(js.values()),"lxml").select("img[src]")]

In [9]: 

In [9]: from pprint import  pprint as pp

In [10]: pp(urls)
['//samples.myfonts.net/b_91/u/06/64bdafe9368dd401df4193a7608028.gif',
 '//samples.myfonts.net/b_92/u/06/b8ad49c563d310a97147d8220f55ab.gif',
 '//samples.myfonts.net/a_91/u/e7/8f84ce98f19e3f91ddc15304d636e7.gif',
 '//samples.myfonts.net/e_91/u/71/9769a1ab626429d63d3c779fcaa3d7.gif',
 '//samples.myfonts.net/b_92/u/65/fe416f15ea94b1f8603ddc675fd638.gif',
 '//samples.myfonts.net/b_91/u/5d/3ced9e71910bc411a0d76316d18df1.gif',
 '//samples.myfonts.net/e_92/u/cd/0df987a72bb0a43cba29b38c16b7a5.gif',
 '//samples.myfonts.net/e_91/u/88/3f80a1108fd0a075c69b09e9c21a8d.gif']

Upvotes: 1

alecxe
alecxe

Reputation: 473763

First of all, make sure what you are doing is legal: Legal page.

Wait for at least one font sample to be loaded and then proceed to extracting:

# wait for at least one font sample to be loaded
wait = WebDriverWait(ff, 10)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#overview_samples .search-result-item")))

# get font sample urls
for sample in ff.find_elements_by_css_selector("#overview_samples .search-result-item .sample .fontsample[title]"):
    print(sample.get_attribute("src"))

Prints:

http://samples.myfonts.net/e_91/u/e7/19061adcc0c9ac025d0414e5ff11a1.gif
http://samples.myfonts.net/a_91/u/e5/4d795cdae0cb99d1424b13020d0f6e.gif
...
http://samples.myfonts.net/b_92/u/2c/4c21ddeb53f19f109306746dac6b24.gif

Upvotes: 1

Related Questions