Prajakta Dumbre
Prajakta Dumbre

Reputation: 37

How to save webpages text content as a text file using python

I did python script:

    from string import punctuation
    from collections import Counter
    import urllib
    from stripogram import html2text
    myurl = urllib.urlopen("https://www.google.co.in/?gfe_rd=cr&ei=v-PPV5aYHs6L8Qfwwrlg#q=samsung%20j7") 
    html_string = myurl.read()
    text = html2text( html_string )
    file = open("/home/nextremer/Final_CF/contentBased/contentCount/hi.txt", "w")
    file.write(text)
    file.close()

Using this script I didn't get perfect output only some HTML code.

  • I want save all webpage text content in a text file.
  • I used urllib2 or bs4 but I didn't get results.
  • I don't want output as a html structure.
  • I want all text data from webpage

    Upvotes: 1

    Views: 3017

  • Answers (3)

    You dont need to write any hard algorithms to extract data from search result. Google has a API to do this.
    Here is an example:
    https://github.com/google/google-api-python-client/blob/master/samples/customsearch/main.py
    But to use it, first you must to register in google for API Key.
    All information you can find here:
    https://developers.google.com/api-client-library/python/start/get_started

    Upvotes: 0

    user3860618
    user3860618

    Reputation: 135

     import urllib
    
     urllib.urlretrieve("http://www.example.com/test.html", "test.txt")
    

    Upvotes: 0

    gsus
    gsus

    Reputation: 139

    What do you mean with "webpage text"? It seems you don't want the full HTML-File. If you just want the text you see in your browser, that is not so easily solvable, as the parsing of a HTML-document can be very complex, especially with JavaScript-rich pages. That starts with assessing if a String between "<" and ">" is a regular tag and includes analyzing the CSS-Properties changed by JavaScript-behavior.

    That is why people write very big and complex rendering-Engines for Webpage-Browsers.

    Upvotes: 2

    Related Questions