Reputation: 37

How to save webpages text content as a text file using python

I did python script:

    from string import punctuation
    from collections import Counter
    import urllib
    from stripogram import html2text
    myurl = urllib.urlopen("https://www.google.co.in/?gfe_rd=cr&ei=v-PPV5aYHs6L8Qfwwrlg#q=samsung%20j7") 
    html_string = myurl.read()
    text = html2text( html_string )
    file = open("/home/nextremer/Final_CF/contentBased/contentCount/hi.txt", "w")
    file.write(text)
    file.close()

Using this script I didn't get perfect output only some HTML code.

I want save all webpage text content in a text file.

I used urllib2 or bs4 but I didn't get results.

I don't want output as a html structure.

I want all text data from webpage

Upvotes: 1

Answers (3)

underwater ranged weapon

Reputation: 405

You dont need to write any hard algorithms to extract data from search result. Google has a API to do this.
Here is an example:
https://github.com/google/google-api-python-client/blob/master/samples/customsearch/main.py
But to use it, first you must to register in google for API Key.
All information you can find here:
https://developers.google.com/api-client-library/python/start/get_started

Upvotes: 0

user3860618

Reputation: 135

 import urllib

 urllib.urlretrieve("http://www.example.com/test.html", "test.txt")

Upvotes: 0

gsus

Reputation: 139

What do you mean with "webpage text"? It seems you don't want the full HTML-File. If you just want the text you see in your browser, that is not so easily solvable, as the parsing of a HTML-document can be very complex, especially with JavaScript-rich pages. That starts with assessing if a String between "<" and ">" is a regular tag and includes analyzing the CSS-Properties changed by JavaScript-behavior.

That is why people write very big and complex rendering-Engines for Webpage-Browsers.

Upvotes: 2

How to save webpages text content as a text file using python

Answers (3)

Related Questions