shubham
shubham

Reputation: 125

how can i remove the html content from the output?

import urllib

data = urllib.urlopen("https://www.python.org/")
for line in data:
    line.strip()
    print line

I am trying to make a web crawler but when I run the above code ,some HTML stuff also gets printed .I only want the text portion of the web page and the hyperlinks

Upvotes: 1

Views: 53

Answers (2)

priyanka
priyanka

Reputation: 244

Use beautiful soup library for making a web crawler and handling HTML tags.

Upvotes: 1

BeaumontTaz
BeaumontTaz

Reputation: 273

A somewhat rudimentary solution would be to .split over "<" and ">" tags and then just check the resulting list to remove elements starting at any "<" and ending at the next ">".

Upvotes: 1

Related Questions