Reputation: 339
What I want to do: Read text from pdf files in a specific folder. There are multiple PDF files in this folder, and I have to loop through them to retrieve the text from each.
Problem I am having: My program is not getting the proper File
information, and it is returning an error Invalid Argument
. The error message is just this.
Language: Google Apps Script
My code:
//Folder ID
var myFolderID = "XXXXXXXXXXXXXXXXXXXXXXXXX";
/**
Get PDF files from the folder and return them in an array
*/
function GetPdfFiles(){
var pdfFiles = [];
var files = DriveApp.getFolderById(myFolderID).getFiles();
while(files.hasNext())
{
var file = files.next();
//retrieve only pdf files (non-pdf files need to be ignored)
if(file.getName().indexOf("pdf") >= 1)
{
//Add to the array the file data
pdfFiles.push(file);
}
}
return pdfFiles;
}
/**
Do some operations to each PDF file
*/
function DoSomeOperations(pdfFiles){
for(var i = 0; i < pdfFiles.length; i++)
{
//The below line of code doesn't work
var doc = DocumentApp.openByUrl(pdfFiles[i].getUrl()); /*Error*/
//I also tried the below code instead of the above line of code
var doc = DocumentApp.openById(pdfFiles[i].getId()); /*Error*/
/*Ideally, do some operation to each PDF file here */
/*I was hoping to use something like this: */
var textFromPdfFile = doc.getBody().getText();
/*But I cannot get this "doc" in the first place.*/
}
}
function Main(){
var pdfFiles = GetPdfFiles();
DoSomeOperations(pdfFiles);
}
Can someone tell me what I am doing wrong?
Edit: I logged "getID" and "getUrl" result, and it is showing the result. But it seems like it is not the actual ID or URL... I don't know what is going on.
Upvotes: 1
Views: 130
Reputation: 201398
doc.getBody().getText()
.application/pdf
.If my understanding is correct, how about this modification? Please think of this as just one of several answers.
DocumentApp.openByUrl()
and DocumentApp.openById()
. In this case, at first, the PDF file is required to be converted to Google Document. By this, it can be opened with DocumentApp.openByUrl()
and DocumentApp.openById()
.In this modified script, the method of Files: copy of Drive API is used. So before you use this script, please enable Drive API at Advanced Google services.
//Folder ID
var myFolderID = "XXXXXXXXXXXXXXXXXXXXXXXXX";
/**
Get PDF files from the folder and return them in an array
*/
function GetPdfFiles(){
var pdfFiles = [];
var files = DriveApp.getFolderById(myFolderID).getFiles();
while(files.hasNext())
{
var file = files.next();
//retrieve only pdf files (non-pdf files need to be ignored)
if (file.getMimeType() == MimeType.PDF) { // Check the mimeType.
// Convert PDF file to Google Document
var id = Drive.Files.copy({mimeType: MimeType.GOOGLE_DOCS}, file.getId()).id;
//Add to the array the file data
pdfFiles.push(id);
}
}
return pdfFiles;
}
/**
Do some operations to each PDF file
*/
function DoSomeOperations(pdfFiles){
for(var i = 0; i < pdfFiles.length; i++)
{
//The below line of code doesn't work
// var doc = DocumentApp.openByUrl(pdfFiles[i].getUrl()); // This is not used.
//I also tried the below code instead of the above line of code
var doc = DocumentApp.openById(pdfFiles[i]);
/*Ideally, do some operation to each PDF file here */
/*I was hoping to use something like this: */
var textFromPdfFile = doc.getBody().getText();
/*But I cannot get this "doc" in the first place.*/
// Drive.Files.remove(pdfFiles[i]); // If you want to delete the converted file, please use this line.
}
}
function Main(){
var pdfFiles = GetPdfFiles();
DoSomeOperations(pdfFiles);
}
If I misunderstood your question and this was not the direction you want, I apologize.
If my understanding is correct, how about this modification? Please modify above my script as follows.
Did you use this line of Drive.Files.remove(pdfFiles[i]);
in my above script? When this line is used, the converted Google Document is always deleted. In this case, the Google Document files with the same filename are not created. How about this?
If you don't want to use Drive.Files.remove(pdfFiles[i]);
, how about the following modification? Please modify the function of GetPdfFiles()
as follows. By this modification, when the same filename of Google Document is existing, the PDF file is not converted.
// Convert PDF file to Google Document
var id = Drive.Files.copy({mimeType: MimeType.GOOGLE_DOCS}, file.getId()).id;
//Add to the array the file data
pdfFiles.push(id);
To:
var existingFile = DriveApp.getFilesByName(file.getName().split(".")[0]);
if (!(existingFile.hasNext() && existingFile.next().getMimeType() == MimeType.GOOGLE_DOCS)) {
// Convert PDF file to Google Document
var id = Drive.Files.copy({mimeType: MimeType.GOOGLE_DOCS}, file.getId()).id;
//Add to the array the file data
pdfFiles.push(id);
}
Upvotes: 2