Shin Yami
Shin Yami

Reputation: 9

Python: Perform Google Search and extract only the content from the individual top 10 results

I am trying to write a script which performs a Google search for the input keyword and returns only the content from the top 10 URLs.

Note: Content specifically refers to the content that is being requested by the searched term and is found in the body of the returned URLs.

I am done with the search and top 10 url retrieval part. Here is the script:

from google import search
top_10_links = search(keyword, tld='com.in', lang='en',stop=10)

however i am unable to retrieve only the content from the links without knowing their structure. I can scrape content from a particular site by finding the class etc. of the tags using dev tools.But i am unable to figure out how to get content from the top 10 result URLs since for every searched term there are different URLs(different sites have different css selectors) and it would to pretty hard to find the css class of the required content. here is the sample code to extract content from a particular site.

content_dict = {}
i = 1
for page in links:
    print(i, ' @ link: ', page)
    article_html = get_page(page)#get_page() returns page's html
    soup = BeautifulSoup(article_html, 'lxml')
    content = soup.find('div',{'class': 'entry-content'}).get_text()
    content_dict[page] = content
    i += 1

However the css class changes for the different sites. Is there someway i can get this script working and get the desired content?

Upvotes: 0

Views: 742

Answers (1)

pythad
pythad

Reputation: 4267

You can't do scraping without knowing the structure of what you're scraping.But there is a package that does something similar. Take a look at newspaper

Upvotes: 1

Related Questions