David
David

Reputation: 149

How to scrape a particular text from PDF using Selenium with VBA

I am doing a automation project, where it starts with opening browser, visiting a URL, logging to it, clicking on few links and finally click a link which opens a PDF file in browser itself. Now I want to get a line from the PDF to the Excel (like string).

I have used the below code, which was the courtesy of the author from GitHub. With the code I am only able to scrape the first line of the PDF. The PDF I use is dynamic and some times the info I require is at the 5th line and sometimes it is at the 25th line and so on...

Hope I have explained it, pardon me for any errors.

Private Sub Handle_PDF_Chrome()
Dim driver As New ChromeDriver
driver.Get "http://static.mozilla.com/moco/en-US/pdf/mozilla_privacypolicy.pdf"

' Return the first line using the pugin API (asynchronous).
Const JS_READ_PDF_FIRST_LINE_CHROME As String = _
"addEventListener('message',function(e){" & _
" if(e.data.type=='getSelectedTextReply'){" & _
"  var txt=e.data.selectedText;" & _
"  callback(txt && txt.match(/^.+$/m)[0]);" & _
" }" & _
"});" & _
"plugin.postMessage({type:'initialize'},'*');" & _
"plugin.postMessage({type:'selectAll'},'*');" & _
"plugin.postMessage({type:'getSelectedText'},'*');"

' Assert the first line
Dim firstline
firstline = driver.ExecuteAsyncScript(JS_READ_PDF_FIRST_LINE_CHROME)
Assert.Equals "Websites Privacy Policy", firstline

driver.Quit
End Sub

Upvotes: 0

Views: 767

Answers (1)

QHarr
QHarr

Reputation: 84475

Assuming your code does function you need to change the regex and index.

The regex becomes

[^\r\n]+

to retrieve all lines (ignoring empty lines). You then index with 4 to get line 5.

Regex explanation:

enter image description here

addEventListener('message',function(e){if(e.data.type=='getSelectedTextReply'){var txt=e.data.selectedText;
callback(txt && txt.match(/[^\r\n]+/g)[4]);}});
plugin.postMessage({type:'initialize'},'*');
plugin.postMessage({type:'selectAll'},'*');
plugin.postMessage({type:'getSelectedText'},'*');

Upvotes: 2

Related Questions