gaggina
gaggina

Reputation: 5425

Analyze and grab link from an html page

I'm new from python and I'm having some issue in doing a simple thing.

I've an html page and I want to analyze it and grab some links inside a spcific table.

In bash I'd use lynx --source and with grep/cut I'd have no problem..but in Python I dont know how to do it..

I thought to do something like that:

import urllib2

data = urllib2.urlopen("http://www.my_url.com")

Doing it I get the whole html page.

Then I thought to do:

for line in data.read():
    if "my_links" in line:
        print line

But seems it not working

Upvotes: 2

Views: 726

Answers (3)

G M
G M

Reputation: 22510

Why don't you use simply enumerate():

site=urllib2.urlopen(r'http://www.rom.on.ca/en/join-us/jobs')

for i,j in enumerate(site):
     if "http://www.ontario.ca" in j: #j is the line
         print i+1 #i is the number start from 0 normally in the html code is 1 the first line so add +1

>>620 

Upvotes: 0

Andrey Gubarev
Andrey Gubarev

Reputation: 791

You need Xpath for those purpose in general case. Examples: http://www.w3schools.com/xpath/xpath_examples.asp

Python has beautiful library called lxml: http://lxml.de/xpathxslt.html

Upvotes: 0

pyfunc
pyfunc

Reputation: 66739

On your code issue, this will read character by character. If you do not pass how much data to read.

for line in data.read():

you could do :

line = data.readline()
while(line):
    print line
    line = data.readline()

This portion is not exactly an answer but I suggest that you use BeautifulSoup.

import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.my_url.com"
data = urllib2.urlopen(url).read()
soup = BeautifulSoup.BeautifulSoup(data)

all_links = soup.find('a')
# you can look for specific link

Upvotes: 1

Related Questions