Reputation: 183
I have the following code snippet from page source:
var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
id: "pdfObjectContainer",
width: "100%",
height: "700px",
pdfOpenParams: {
navpanes: 0,
statusbar: 1,
toolbar: 1,
view: "FitH"
}
}).embed("pdf_placeholder");
the
'PDFObject('
is unique on the page. I want to retreive url content using REGEX. In this case I need to get
http://www.site.com/doc55.pdf
Please help.
Upvotes: 2
Views: 163
Reputation: 1035
Although the other answers may appear to work, most do not take into account that the only unique thing on the page is 'PDFObject('. A much better regular expression would be the following:
PDFObject\({\surl: "(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)",
It takes into account that 'PDFObject(' is unique and contains some basic URL verification.
Below is an example of how this regex could be used in python
>>> import re
>>> strs = """var myPDF = new PDFObject({
... url: "http://www.site.com/doc55.pdf",
... id: "pdfObjectContainer",
... width: "100%",
... height: "700px",
... pdfOpenParams: {
... navpanes: 0,
... statusbar: 1,
... toolbar: 1,
... view: "FitH"
... }
... }).embed("pdf_placeholder");"""
>>> re.search(r'PDFObject\({\surl: "(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)",',strs).group(1)
'http://www.site.com/doc55.pdf'
A pure python (no regex) alternative would be:
>>> unique = 'PDFObject({\nurl: "'
>>> start = strs.find(unique) + len(unique)
>>> end = start + strs[start:].find('"')
>>> strs[start:end]
'http://www.site.com/doc55.pdf'
No regex oneliner:
>>> (lambda u:(lambda s:(lambda e:strs[s:e])(s+strs[s:].find('"')))(strs.find(u)+len(u)))('PDFObject({\nurl: "')
'http://www.site.com/doc55.pdf'
Upvotes: 0
Reputation: 3429
If 'PDFObject('
is the unique identifier in the page, you only have to match the first next quoted content.
Using the DOTALL flag (re.DOTALL
or re.S
) and the non-greedy star (*?
), you can write:
import re
snippet = '''
var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
id: "pdfObjectContainer",
width: "100%",
height: "700px",
pdfOpenParams: {
navpanes: 0,
statusbar: 1,
toolbar: 1,
view: "FitH"
}
}).embed("pdf_placeholder");
'''
# First version using unnamed groups
RE_UNNAMED = re.compile(r'PDFObject\(.*?"(.*?)"', re.S)
# Second version using named groups
RE_NAMED = re.compile(r'PDFObject\(.*?"(?P<url>.*?)"', re.S)
RE_UNNAMED.search(snippet, re.S).group(1)
RE_NAMED.search(snippet, re.S).group('url')
# result for both: 'http://www.site.com/doc55.pdf'
If you don't want to compile your regex because it's used once, simply this syntax:
re.search(r'PDFObject\(.*?"(.*?)"', snippet, re.S).group(1)
re.search(r'PDFObject\(.*?"(?P<url>.*?)"', snippet, re.S).group('url')
Four choices, one should match you need and taste!
Upvotes: 0
Reputation: 97938
Here is an alternative for solving your problem without using regex:
url,in_object = None, False
with open('input') as f:
for line in f:
in_object = in_object or 'PDFObject(' in line
if in_object and 'url:' in line:
url = line.split('"')[1]
break
print url
Upvotes: 3
Reputation: 46365
In order to be able to find "something that happens in the line after something else", you need to match things "including the newline". For this you use the (dotall) modifier - a flag added during the compilation.
Thus the following code works:
import re
r = re.compile(r'(?<=PDFObject).*?url:.*?(http.*?)"', re.DOTALL)
s = '''var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
id: "pdfObjectContainer",
width: "100%",
height: "700px",
pdfOpenParams: {
navpanes: 0,
statusbar: 1,
toolbar: 1,
view: "FitH"
}
}).embed("pdf_placeholder"); '''
print r.findall(s)
Explanation:
r = re.compile( compile regular expression
r' treat this string as a regular expression
(?<=PDFObject) the match I want happens right after PDFObject
.*? then there may be some other characters...
url: followed by the string url:
.*? then match whatever follows until you get to the first instance (`?` : non-greedy match of
(http:.*?)" match the string http: up to (but not including) the first "
', end of regex string, but there's more...
re.DOTALL) set the DOTALL flag - this means the dot matches all characters
including newlines. This allows the match to continue from one line
to the next in the .*? right after the lookbehind
Upvotes: 0
Reputation: 103754
This works:
import re
src='''\
var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
URL: "http://www.site.com/doc52.PDF",
id: "pdfObjectContainer",
width: "100%",
height: "700px",
pdfOpenParams: {
navpanes: 0,
statusbar: 1,
toolbar: 1,
view: "FitH"
}
}).embed("pdf_placeholder"); '''
print [m.group(1).strip('"') for m in
re.finditer(r'^url:\s*(.*)[\W]$',
re.search(r'PDFObject\(\{(.*)',src,re.M | re.S | re.I).group(1),re.M|re.I)]
prints:
['http://www.site.com/doc55.pdf', 'http://www.site.com/doc52.PDF']
Upvotes: 0
Reputation: 23364
using a combination of look-behind and look-ahead assertions
import re
re.search(r'(?<=url:).*?(?=",)', s).group().strip('" ')
'http://www.site.com/doc55.pdf'
Upvotes: 0