Reputation: 11
I'm writing a webscraper and I have a table full of links to .pdf files that I want to download, save, and later analyze. I was using beautiful soup and I had soup find all the links. They are normally beautiful soup tag objects, but I've turned them into strings. The string is actually a bunch of junk with the link text buried in the middle. I want to cut out that junk and just leave the link. Then I will turn these into a list and have python download them later. (My plan is for python to keep a list of the pdf link names to keep track of what it's downloaded and then it can name the files according to those link names or a portion thereof).
But the .pdfs come in variable name-lengths, e.g.:
and as they exist in the table, they have a bunch of junk text:
So I want to cut ("slice") the front part and the last part off of the string and just leave the string that points to my url (so what follows is the desired output for my program):
://blah/blah/blah/I_am_the_first_file.pdf
://blah/blah/blah/And_I_am_the_seond_file.pdf
As you can see, though, the second file has more characters in the string than the first. So I can't do:
string[9:40]
or whatever because that would work for the first file but not for the second.
So i'm trying to come up with a variable for the end of the string slice, like so:
string[9:x]
wherein x is the location in the string that ends in '.pdf' (and my thought was to use the string.index('.pdf') function to do this.
But is t3h fail because I get an error trying to use a variable to do this
("TypeError: 'int' object is unsubscriptable")
Probably there's an easy answer and a better way to do this other than messing with strings, but you guys are way smartert than me and I figured you'd know straight off.
Here's my full code so far:
import urllib, urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("mywebsite.com")
soup = BeautifulSoup(page)
table_with_my_pdf_links = soup.find('table', id = 'searchResults')
#"search results" is just what the table i was looking for happened to be called.
for pdf_link in table_with_my_pdf_links.findAll('a'):
#this says find all the links and looop over them
pdf_link_string = str(pdf_link)
#turn the links into strings (they are usually soup tag objects, which don't help me much that I know of)
if 'pdf' in pdf_link_string:
#some links in the table are .html and I don't want those, I just want the pdfs.
end_of_link = pdf_link_string.index('.pdf')
#I want to know where the .pdf file extension ends because that's the end of the link, so I'll slice backward from there
just_the_link = end_of_link[9:end_of_link]
#here, the first 9 characters are junk "a href = yadda yadda yadda". So I'm setting a variable that starts just after that junk and goes to the .pdf (I realize that I will actualy have to do .pdf + 3 or something to actually get to the end of string, but this makes it easier for now).
print just_the_link
#I debug by print statement because I'm an amatuer
the line (Second from the bottom) that reads:
just_the_link = end_of_link[9:end_of_link]
returns an error (TypeError: 'int' object is unsubscriptable
)
also, the ":" should be hyper text transfer protocol colon, but it won't let me post that b/c newbs can't post more than 2 links so I took them out.
Upvotes: 1
Views: 598
Reputation: 4617
Sounds like a job for regular expressions:
import urllib, urllib2, re
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("mywebsite.com")
soup = BeautifulSoup(page)
table_with_my_pdf_links = soup.find('table', id = 'searchResults')
#"search results" is just what the table i was looking for happened to be called.
for pdf_link in table_with_my_pdf_links.findAll('a'):
#this says find all the links and looop over them
pdf_link_string = str(pdf_link)
#turn the links into strings (they are usually soup tag objects, which don't help me much that I know of)
if 'pdf' in pdf_link_string:
pdfURLPattern = re.compile("""://(\w+/)+\S+.pdf""")
pdfURLMatch = pdfURLPattern.search(line)
#If there is no match than search() returns None, otherwise the whole group (group(0)) returns the URL of interest.
if pdfURLMatch:
print pdfURLMatch.group(0)
Upvotes: 0
Reputation: 184200
just_the_link = end_of_link[9:end_of_link]
This is your problem, just like the error message says. end_of_link
is an integer -- the index of ".pdf" in pdf_link_string
, which you calculated in the preceding line. So naturally you can't slice it. You want to slice pdf_link_string
.
Upvotes: 1