MeeraWhy
MeeraWhy

Reputation: 103

RStudio or R: Text Mining to Excel Project

RStudio V1.0.153

This will be a long post so I will appreciate anyone that will have the patience to read and offer suggestions. I'm building a database on ~110 observations and a section of it will require data that is unfortunately only available in PDF format. I'm new to R, but thought I'd take a massive stab at this. I'd prefer to try it this way than go through 100s of pages of PDFs to manually input the data of interest.

Here is the source of data in PDF format PDF Pathology Report to Excel Format as shown here Sample Excel Format Basically my goal is to get the "meat" of this path report from the bones as easily as possible. I understand some cleanup will always be necessary, though!

So far, I have converted the PDF to PNG using an open source website and then used the Tesseract package that returned a character string of 1 assigned to the object "path". Then I used the Tokenizers package:

words <- tokenize_words(X, lowercase = TRUE)

dput(words)
c("appropriate", "controls", "specimen", "1", "2", "old", "liver", 
  "explant", "posit", "ve", "for", "malignancy", "hepatocellular", 
  "carcinoma", "see", "synoptic", "report", "below", "advanced", 
  "stage", "chronic", "liver", "disease", "fibrosis", "staging"
)

I just don't know where to go from here? Perhaps there is a function in the TM package that can be used to weed out phrases of interest and the 3-4 words following the phrase that will have the description of interest?

Any advice would be appreciated!

Upvotes: 0

Views: 250

Answers (1)

rawr
rawr

Reputation: 20811

I don't know of a specific tool, but what you described is pretty easy to do with regular expressions

weed out phrases of interest and the 3-4 words following the phrase

# words <- tokenize_words(X, lowercase = TRUE)
words <- 
  c("appropriate", "controls", "specimen", "1", "2", "old", "liver", 
    "explant", "posit", "ve", "for", "malignancy", "hepatocellular", 
    "carcinoma", "see", "synoptic", "report", "below", "advanced", 
    "stage", "chronic", "liver", "disease", "fibrosis", "staging"
  )


f <- function(x, phrase, n_words = 3L, upto = NULL) {
  x <- paste0(x, collapse = ' ')
  word <- '\\b\\w+\\b\\s*'

  p <- if (!is.null(upto))
    sprintf('(?:%s)\\s*((%s)+)%s|.', phrase, word, upto)
  else sprintf('(?:%s)\\s*((%s){1,%s})|.', phrase, word, n_words)

  trimws(gsub(p, '\\1', x))
}

paste0(words, collapse = ' ')
# "appropriate controls specimen 1 2 old liver explant posit ve for malignancy
# hepatocellular carcinoma see synoptic report below advanced stage chronic
# liver disease fibrosis staging"

f(words, 'carcinoma')
# [1] "see synoptic report"

f(words, 'old liver', 10)
# [1] "explant posit ve for malignancy hepatocellular carcinoma see synoptic report"

f(words, 'old liver', upto = 'carcinoma')
# [1] "explant posit ve for malignancy hepatocellular"

where n_words is the number of words returned after phase is matched; upto will basically return everything between phrase and upto

Upvotes: 1

Related Questions