Reputation: 103
RStudio V1.0.153
This will be a long post so I will appreciate anyone that will have the patience to read and offer suggestions. I'm building a database on ~110 observations and a section of it will require data that is unfortunately only available in PDF format. I'm new to R, but thought I'd take a massive stab at this. I'd prefer to try it this way than go through 100s of pages of PDFs to manually input the data of interest.
Here is the source of data in PDF format PDF Pathology Report to Excel Format as shown here Sample Excel Format Basically my goal is to get the "meat" of this path report from the bones as easily as possible. I understand some cleanup will always be necessary, though!
So far, I have converted the PDF to PNG using an open source website and then used the Tesseract package that returned a character string of 1 assigned to the object "path". Then I used the Tokenizers package:
words <- tokenize_words(X, lowercase = TRUE)
dput(words)
c("appropriate", "controls", "specimen", "1", "2", "old", "liver",
"explant", "posit", "ve", "for", "malignancy", "hepatocellular",
"carcinoma", "see", "synoptic", "report", "below", "advanced",
"stage", "chronic", "liver", "disease", "fibrosis", "staging"
)
I just don't know where to go from here? Perhaps there is a function in the TM package that can be used to weed out phrases of interest and the 3-4 words following the phrase that will have the description of interest?
Any advice would be appreciated!
Upvotes: 0
Views: 250
Reputation: 20811
I don't know of a specific tool, but what you described is pretty easy to do with regular expressions
weed out phrases of interest and the 3-4 words following the phrase
# words <- tokenize_words(X, lowercase = TRUE)
words <-
c("appropriate", "controls", "specimen", "1", "2", "old", "liver",
"explant", "posit", "ve", "for", "malignancy", "hepatocellular",
"carcinoma", "see", "synoptic", "report", "below", "advanced",
"stage", "chronic", "liver", "disease", "fibrosis", "staging"
)
f <- function(x, phrase, n_words = 3L, upto = NULL) {
x <- paste0(x, collapse = ' ')
word <- '\\b\\w+\\b\\s*'
p <- if (!is.null(upto))
sprintf('(?:%s)\\s*((%s)+)%s|.', phrase, word, upto)
else sprintf('(?:%s)\\s*((%s){1,%s})|.', phrase, word, n_words)
trimws(gsub(p, '\\1', x))
}
paste0(words, collapse = ' ')
# "appropriate controls specimen 1 2 old liver explant posit ve for malignancy
# hepatocellular carcinoma see synoptic report below advanced stage chronic
# liver disease fibrosis staging"
f(words, 'carcinoma')
# [1] "see synoptic report"
f(words, 'old liver', 10)
# [1] "explant posit ve for malignancy hepatocellular carcinoma see synoptic report"
f(words, 'old liver', upto = 'carcinoma')
# [1] "explant posit ve for malignancy hepatocellular"
where n_words
is the number of words returned after phase
is matched; upto
will basically return everything between phrase
and upto
Upvotes: 1