joey
joey

Reputation: 241

Readin information in reddit with urllib

I got the following code:

import urllib
import re

def worldnews():
    count = 0
    html = urllib.urlopen("https://www.reddit.com/r/worldnews/").readlines()

    lines = html
    for line in lines:
        if "Paris" or "Putin" in line:
            count = count + 1
            print line       

    print "Totaal gevonden: ", count
    print "----------------------"

worldnews()

How can I find all reddit post on that page with Paris or Puttin in the title. And Is there a way to print this title of the post to the console? When I run this now I get a lot of html code back.

Upvotes: 0

Views: 65

Answers (1)

n1c9
n1c9

Reputation: 2687

The best way to work with HTML in Python is BeautifulSoup. So, you'll need to download that and look through the documentation to find out how to do exactly what you're asking. However, I got you off to a start:

import urllib
from bs4 import BeautifulSoup

def worldnews():
    count = 0
    html = urllib.urlopen("https://www.reddit.com/r/worldnews/")
    soup = BeautifulSoup(html,"lxml")
    titles = soup.find_all('p',{'class':'title'})
    for i in titles:
        print(i.text)

worldnews()

When this is run, it gives an output that looks like this:

Paris attacks ringleader dead - French officials (bbc.com)
Company which raised price of AIDS drug by 5500% reports $14m quarterly losses. (pinknews.co.uk)
Syria/IraqSyrian man kills judge at ISIS Sharia Court for beheading his brother (en.abna24.com)
Putin Puts $50 Million Bounty on Heads of Metrojet Bombers (fortune.com)

and so on for all the titles on the page. From here you should be able to figure out somewhat easily how to code the rest. :-)

Upvotes: 2

Related Questions