bariskau
bariskau

Reputation: 347

Key Words search in articles on website with python

I just joined you, and it's been 1 month since I started learning python. I would like to search keyword from this site with Python (http://aeconf.com/may2013.htm).

Normally I manually click on the view abtsract and search for words after the "Key Words:". How can I do this with python automatically?

Upvotes: 2

Views: 1437

Answers (1)

Ibaboiii
Ibaboiii

Reputation: 89

You should check out Selenium

pip install selenium

I provided a sample code of what it can do you should test this out.

Sample Code:

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException


caps = DesiredCapabilities().CHROME
caps["pageLoadStrategy"] = "normal"  
driver = webdriver.Chrome(desired_capabilities=caps, executable_path=r'C:\Users\My-PC-Name\AppData\Local\Programs\Python\Python37-32\Scripts\chromedriver.exe')


url = "http://aeconf.com/may2000.htm" #Link
driver.get(url)
links = [x.get_attribute('href') for x in driver.find_elements_by_link_text('View Abstract')]
htmls = []

for link in links:
   driver.get(link)
   Keyword = [y.text for y in driver.find_elements_by_xpath("//font[2]/span[@style = 'mso-bidi-font-size: 1.0pt']")]
   if not Keyword: #If link is a dead link
     continue
   print(Keyword[0])
   htmls.append(driver.page_source)

In this example I changed the url to http://aeconf.com/may2000.htm The code I provided mostly get the "Key Words" you need but there are certain cases in which the index position of the "Key Words" changes depending on the links in the said Url.

Output of 'changed' link:

Fiscal decentralization; Corruption; Tax evasion.
Incentive mechanism design; Walrasian allocations; Implementation.
Debt and equity flows; Asymmetric information; Bankruptcy cost; Market failures; 
Corrective taxation.
Transitory volatility; Price formation; Exogenous liquidity demand.
Investment horizon; Beta; Size; Book-to-market equity; CAPM.
G11, G13.                         #At This part you can see that the 'Key Words' printed are not correct
Portfolio constraints; Stochastic income; Relaxation-projection methods.
Foreign aid; Foreign borrowing; Capital accumulation.
Entrepreneurial ability; Asymmetric information; Liquidity constraints.
Contract; Human capital; Labor.
Endogenous structure of the division of labor; Dual economy; Endogenous trade policy regime.

If we changed the 'url' variable in my Sample Code to your original link there are more cases in which the index position is changed even the first link is a dead link. As a challenge I'll let you figure it out yourself :-) There are more modules that can do the same thing like Selenium. Hope this gives you more interest on Browser Automation, Web Scraping and many more (Web Crawler etc.).

Just a tip (probably not a tip) You'll just have to change the index position of the 'Keyword' variable to get the desired "Key Word".

Upvotes: 1

Related Questions