Reputation: 775
I believe I have finally come up with a way to extract plain text without line breaks whilst retaining intended carriage returns from PDFs using VBA, Acrobat and Word Combined.
Previous answers using either word or acrobat independently ran into their own issues. Word would occasionally omit text interpreted as images, and Acrobat sometimes would not handle complex structures of PDFs and generate a blank text file.
Having tinkered with word, I realise that it has the option to generate plain text without linebreaks as shown below. Importantly the text generated retains intended carriage returns.
Acrobat does this automatically, too, when generating a plain text file; however, with the issue of unstructured PDFs, I think word is the better bet. And also likely more controllably with VBA.
By combining the two in VBA, I believe I have omitted many of the issues. The text files generated are much more than what I have been after for the past few days. i.e. sentences are not broken with line breaks.
The VBA code below works as follows:
Update: 21/12/22 The below code uses FileFormat:=wdFormatText
which maybe more straight forward.
Sub ConvertDocumentsToTxt()
'Updated by Extendoffice 20181123
Dim xIndex As Long
Dim xFolder As Variant
Dim xFileStr As String
Dim xFilePath As String
Dim xDlg As FileDialog
Dim xActPath As String
Dim xDoc As Document
Application.ScreenUpdating = False
Set xDlg = Application.FileDialog(msoFileDialogFolderPicker)
If xDlg.Show <> -1 Then Exit Sub
xFolder = xDlg.SelectedItems(1)
xFileStr = Dir(xFolder & "\*.doc")
xActPath = ActiveDocument.Path
While xFileStr <> ""
xFilePath = xFolder & "\" & xFileStr
If xFilePath <> xActPath Then
Set xDoc = Documents.Open(xFilePath, AddToRecentFiles:=False, Visible:=False)
xIndex = InStrRev(xFilePath, ".")
Debug.Print Left(xFilePath, xIndex - 1) & ".txt"
xDoc.SaveAs Left(xFilePath, xIndex - 1) & ".txt", FileFormat:=wdFormatText, AddToRecentFiles:=False
xDoc.Close True
End If
xFileStr = Dir()
Wend
Application.ScreenUpdating = True
End Sub
So far: (Updated now improved - Same as submitted answer) I have created the following working script in VBA, which achieves these two steps:
References, Acrobat
, and Microsoft Scripting Runtime
.
Sub LoopThroughFiles()
Dim StrFile As String
Dim pdfPath As String
StrFile = Dir("C:\temp\PDFs\")
fileRoot = "C:\temp\PDFs\"
If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
Do While Len(StrFile) > 0
Debug.Print StrFile
n = StrFile
pdfPath = fileRoot & StrFile
Debug.Print pdfPath
'Convert to WordDoc
success = ConvertPdf2(pdfPath, fileRoot & StrFile & ".doc")
StrFile = Dir
On Error Resume Next
oWd.Quit
'Convert to PlainText
Debug.Print pdfPath & ".doc"
success2 = GetTextFromWord(pdfPath & ".doc", n)
Loop
End Sub
'returns true if conversion was successful (based on whether `Open` succeeded or not)
Function ConvertPdf2(pdfPath As String, textPath As String) As Boolean
Dim AcroXApp As Acrobat.AcroApp
Dim AcroXAVDoc As Acrobat.AcroAVDoc
Dim AcroXPDDoc As Acrobat.AcroPDDoc
Dim jsObj As Object, success As Boolean
Set AcroXApp = CreateObject("AcroExch.App")
Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
success = AcroXAVDoc.Open(pdfPath, "Acrobat") '<<< returns false if fails
If success Then
Application.Wait (Now + TimeValue("0:00:2")) 'Helps PC have some time to go through data, can cause PC to freeze without
Set AcroXPDDoc = AcroXAVDoc.GetPDDoc
Set jsObj = AcroXPDDoc.GetJSObject
jsObj.SaveAs textPath, "com.adobe.acrobat.doc"
AcroXAVDoc.Close False
End If
AcroXApp.Hide
AcroXApp.Exit
ConvertPdf2 = success 'report success/failure
End Function
Function GetTextFromWord(DocStr As String, n)
Dim filePath As String
Dim fso As FileSystemObject
Dim fileStream As TextStream
Dim oWd As Object, oDoc As Object, fileRoot As String
Const wdFormatText As Long = 2, wdCRLF As Long = 0
Set fso = New FileSystemObject
Set oWd = CreateObject("word.application")
fileRoot = "C:\temp\PDFs" 'read this once
If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
Set oDoc = Nothing
On Error Resume Next 'ignore error if no document...
Set oDoc = oWd.Documents.Open(DocStr)
On Error GoTo 0 'stop ignoring errors
Debug.Print n
If Not oDoc Is Nothing Then
filePath = fileRoot & n & ".txt" 'filename
Debug.Print filePath
oDoc.SaveAs2 Filename:=filePath, _
FileFormat:=wdFormatText, LockComments:=False, Password:="", _
AddToRecentFiles:=True, WritePassword:="", ReadOnlyRecommended:=False, _
EmbedTrueTypeFonts:=False, SaveNativePictureFormat:=False, SaveFormsData _
:=False, SaveAsAOCELetter:=False, Encoding:=1252, InsertLineBreaks:=False _
, AllowSubstitutions:=True, LineEnding:=wdCRLF, CompatibilityMode:=0
oDoc.Close False
End If
oWd.Quit
GetTextFromWord = success2
End Function
Please note I am not good at all with VBA; much of this is stitching together answers previously provided and trying to get it to loop through. I am hoping someone with VBA experience can review this and really make it more robust.
It does work, albeit quite slowly, to generate the doc files and then text files:
I hope someone familiar with VBA can help me make this solution more robust.
The files can be downloaded here: https://1drv.ms/u/s!AsrLaUgt0KCLhXtP-jYDd4Z0ujKQ?e=2b6DNg
Add to a PDF folder in temp, and the code should run okay.
Please let me know if you require any more information. I think this is it after a week of questions. Just the code needs tidying up.
Finally, if anyone who comes across this knows of any program that can generate plain text without inserting line breaks but retaining carriage returns, please let me know. Acrobat would be the solution and does work for most cases but has to generate tags on some PDFs, which has failed in my case. I would be very interested in an existing program that can in Batch convert PDFs in this way.
Upvotes: 0
Views: 957
Reputation: 1
Try using below:
strTemp = Replace(FromString, vbCr, " ")
strTemp = Replace(strTemp, vbLf, " ")
strTemp = Replace(strTemp, vbNewline," ")
I use the free tool xpf reader to convert a pdf.
Upvotes: 0
Reputation: 775
Improved Answer that enables word parameters
ChangeEncoding:=1252 to 65001 for unusual characters(Added below):
Sub LoopThroughFiles()
Dim StrFile As String
Dim pdfPath As String
StrFile = Dir("C:\temp\PDFs\")
fileRoot = "C:\temp\PDFs\"
If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
Do While Len(StrFile) > 0
Debug.Print StrFile
n = StrFile
pdfPath = fileRoot & StrFile
Debug.Print pdfPath
'Convert to WordDoc
success = ConvertPdf2(pdfPath, fileRoot & StrFile & ".doc")
StrFile = Dir
On Error Resume Next
oWd.Quit
'Convert to PlainText
Debug.Print pdfPath & ".doc"
success2 = GetTextFromWord(pdfPath & ".doc", n)
Loop
End Sub
'returns true if conversion was successful (based on whether `Open` succeeded or not)
Function ConvertPdf2(pdfPath As String, textPath As String) As Boolean
Dim AcroXApp As Acrobat.AcroApp
Dim AcroXAVDoc As Acrobat.AcroAVDoc
Dim AcroXPDDoc As Acrobat.AcroPDDoc
Dim jsObj As Object, success As Boolean
Set AcroXApp = CreateObject("AcroExch.App")
Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
success = AcroXAVDoc.Open(pdfPath, "Acrobat") '<<< returns false if fails
If success Then
Application.Wait (Now + TimeValue("0:00:2")) 'Helps PC have some time to go through data, can cause PC to freeze without
Set AcroXPDDoc = AcroXAVDoc.GetPDDoc
Set jsObj = AcroXPDDoc.GetJSObject
jsObj.SaveAs textPath, "com.adobe.acrobat.doc"
AcroXAVDoc.Close False
End If
AcroXApp.Hide
AcroXApp.Exit
ConvertPdf2 = success 'report success/failure
End Function
Function GetTextFromWord(DocStr As String, n)
Dim filePath As String
Dim fso As FileSystemObject
Dim fileStream As TextStream
Dim oWd As Object, oDoc As Object, fileRoot As String
Const wdFormatText As Long = 2, wdCRLF As Long = 0
Set fso = New FileSystemObject
Set oWd = CreateObject("word.application")
fileRoot = "C:\temp\PDFs" 'read this once
If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
Set oDoc = Nothing
On Error Resume Next 'ignore error if no document...
Set oDoc = oWd.Documents.Open(DocStr)
On Error GoTo 0 'stop ignoring errors
Debug.Print n
If Not oDoc Is Nothing Then
filePath = fileRoot & n & ".txt" 'filename
Debug.Print filePath
oDoc.SaveAs2 Filename:=filePath, _
FileFormat:=wdFormatText, LockComments:=False, Password:="", _
AddToRecentFiles:=False, WritePassword:="", ReadOnlyRecommended:=False, _
EmbedTrueTypeFonts:=False, SaveNativePictureFormat:=False, SaveFormsData _
:=False, SaveAsAOCELetter:=False, Encoding:=65001, InsertLineBreaks:=False _
, AllowSubstitutions:=True, LineEnding:=wdCRLF, CompatibilityMode:=0
oDoc.Close False
End If
oWd.Quit
GetTextFromWord = success2
End Function
Upvotes: 1