user5509289
user5509289

Reputation:

Tesseract "Error in pixCreateNoInit: pix_malloc fail for data"

trying to run this function within a function based loosely off of this, however, since xPDF can convert PDFs to PNGs, I skipped the ImageMagick conversion step, as well as the faulty logic with the function(i) process, since pdftopng requires a root name and that is "ocrbook-000001.png" in this case and throws an error when looking for a PNG of the original PDF's file name.

My issue is now with getting Tesseract to do anything with my PNG files. I get the error:

Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Error in pixCreateNoInit: pix_malloc fail for data
Error in pixCreate: pixd not made
Error in pixReadStreamPng: pix not made
Error in pixReadStream: png: no pix returned
Error in pixRead: pix not read
Error during processing.

Here is my code:

lapply(myfiles, function(i){

shell(shQuote(paste0("pdftopng -f 1 -l 10 -r 600 ", i, " ocrbook")))
mypngs <- list.files(path = dest, pattern = "png", full.names = TRUE)
    lapply(mypngs, function(z){
    shell(shQuote(paste0("tesseract ", z, " out")))
    file.remove(paste0(z))
    })
})

Upvotes: 3

Views: 2637

Answers (2)

Russ
Russ

Reputation: 1431

Background

It sounds like you already solved your problem. Yay! I'm writing this answer because I encountered a very similar problem calling tesseract from R and wanted to share some of the workarounds I came up with in case anyone else stumbles across the post and needs further troubleshooting ideas.

In my case I was converting a bunch of faxes (about 3000 individual pdf files, most of them between 1-15 pages) to text. I used an apply function to make the text from each individual fax as a separate entry in a list (length = number of faxes = ~ 3000). Then the faxes were converted to a vector and then that vector was combined with a vector of file names to make a data frame. Finally I wrote the data frame to a csv file. (See below for the code I used).

The problem was I kept getting the same string of errors that you got:

Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Error in pixCreateNoInit: pix_malloc fail for data
Error in pixCreate: pixd not made
Error in pixReadStreamPng: pix not made
Error in pixReadStream: png: no pix returned
Error in pixRead: pix not read
Error during processing.

Followed by this error: error in FUN(X[[i]], ...) : basic_string::_M_construct null not valid

What I think the problem is

What was weird for me was that I re-ran the code multiple times and it was always a different fax where the error occurred. It seemed to also occur more often when I was trying to do something else that used a lot of RAM or CPU (opening microsoft teams etc.). I tried changing the DPI as suggested in the first answer and that didn't seem to work.

It was also noticeable that while this code was running I was regularly using close to 100% of RAM and 50% of CPU (based on windows task manager).

When I ran this process (on a similiar batch of about 3,000 faxes) on linux machine with significantly more RAM and CPU I never encountered this problem.

basic_string::_M_construct null not valid, appears to be a c++ error. I'm not familiar with c++, but it sort of sounds like it's a bit of a catch all error that might indicate something that should have been created wasn't created.

Based on all that, I think the problem is that R runs out of memory and in response somehow the memory available to some of the underlying tesseract processes gets throttled off. This means there's not enough memory to convert a pdf to a png and then extract the text which is what throws these errors. This leads to a text blob not getting created where one is expected and the final C++ error of : basic_string::_M_construct null not valid It's possible that lowering the dpi is what gave your process enough memory to complete, but maybe the fundamental underlying problem was the memory not the DPI.

Possible workarounds

So, I'm not sure about any of what I just said, but running with that assumption, here's some ideas I came up with for people running the tesseract package in R who encounter similar problems:

  1. Switch from Rstudio to Rgui: This alone solved my problem. I was able to complete the whole 3000 fax process without any errors using Rgui. Rgui also used between 100-400 MB instead 1000+ that Rstudio used, and about 25% of CPU instead of 50%. Putting R in the path and running R from the console or running R in the background might reduce memory use even further.

  2. Close any memory intensive processes while the code is running. Microsoft teams, videoconferencing, streaming, docker on windows and the windows linux subsystem are all huge memory hogs.

  3. lower DPI As suggested by the first answer, this would also probably reduce memory use.

  4. break the process up. I think running my processes in batches of about 500 might have also reduced the amount of working memory R has to take up before writing to file.

These are all quick and easy solutions that can be done from R without having to learn C++ or upgrade hardware. A more durable solution probably would require doing more customization of the tesseract parameters, implementing the process in C++, changing memory allocation settings for R and the operating system, or buying more RAM.

Example Code

# Load Libraries
library(tesseract)


dir.create("finished_data")

# Define Functions
ocr2 <- function(pdf_path){
  
  # tell tesseract which language to guess
  eng <- tesseract("eng")
  
  #convert to png first
  #pngfile <- pdftools::pdf_convert(pdf_path, dpi = 300)
 
  # tell tesseract to convert the pdf at pdf_path
   seperated_pages <- tesseract::ocr(pdf_path, engine = eng)
  

  #combine all the pages into one page
  combined_pages <- paste(seperated_pages, collapse = "**new page**")
  
  
  # I delete png files as I go to avoid overfilling the hard drive
  # because work computer has no hard drive space :'(

  png_file_paths <-  list.files(pattern = "png$")
  
  file.remove(png_file_paths)
  
  combined_pages
  
}


# find pdf_paths

fax_file_paths <-  list.files(path="./raw_data", 
                                    pattern = "pdf$",
                                    recursive = TRUE)


#this converts all the pdfs to text using the ocr
faxes <- lapply(paste0("./raw_data/",fax_file_paths), 
                       ocr2)  




fax_table <- data.frame(file_name= fax_file_paths, file_text= unlist(faxes))


write.csv(fax_table, file = paste0("./finished_data/faxes_",format(Sys.Date(),"%b-%d-%Y"), "_test.csv"),row.names = FALSE)

Upvotes: 0

user5509289
user5509289

Reputation:

The issue was the DPI set too high for Tesseract to handle, apparently. Changing the PDFtoPNG DPI parameter from 600 to 150 appears to have corrected the issue. There seems to be a max DPI for Tesseract to understand and know what to do.

I have also corrected my code from a static naming convention to a more dynamic one that mimics the file's original names.

  dest <- "C:\\users\\YOURNAME\\desktop"

  files <- tools::file_path_sans_ext(list.files(path = dest, pattern = "pdf", full.names = TRUE))
    lapply(files, function(i){
      shell(shQuote(paste0("pdftoppm -f 1 -l 10 -r 150 ", i,".pdf", " ",i)))
      })


  myppms <- tools::file_path_sans_ext(list.files(path = dest, pattern = "ppm", full.names = TRUE))
    lapply(myppms, function(y){
      shell(shQuote(paste0("magick ", y,".ppm"," ",y,".tif")))
      file.remove(paste0(y,".ppm"))
      })

  mytiffs <- tools::file_path_sans_ext(list.files(path = dest, pattern = "tif", full.names = TRUE))
    lapply(mytiffs, function(z){
      shell(shQuote(paste0("tesseract ", z,".tif", " ",z)))
      file.remove(paste0(z,".tif"))
      })

Upvotes: 1

Related Questions