Reputation: 648
I have a folder with many .txt files. I want to read all the files and then extract text from each file located between two words and store them in a .csv file.
Text to be extracted is always between two words
IMPRESSION: "text to be extracted" (Dr. Deepak Bhatt)
OR
IMPRESSION : "text to be extracted" (Dr. Deepak Bhatt)
The code i wrote below is not extracting text from all files. How do i solve this?
names <- list.files(path = "C:\\Users\\Admin\\Downloads\\data\\data",
pattern = "*.txt", all.files = FALSE,
full.names = FALSE, recursive = FALSE,
ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)
all.names <- lapply(names,readFn)
readFn <- function(i)
{
file <- read_file(i)
file <- gsub("[\r\n\t]", " ", file)
extracted_txt <- rm_between(file,
'IMPRESSION :', '(Dr. Deepak Bhatt)',
extract=TRUE, trim = TRUE, clean = TRUE)
if(is.na(extracted_txt))
{
extracted_txt <- rm_between(file,
'IMPRESSION:', '(Dr. Deepak Bhatt)',
extract=TRUE, trim = TRUE, clean = TRUE)
}
}
output <- do.call(rbind,all.names)
name_of_file <- sub(".txt","",names)
final_output <- cbind(name_of_file,output)
colnames(final_output) <- c('filename','text')
write.csv(final_output,"final_output.csv",row.names=F)
EXAMPLE 1: filename = 15-1-2011.txt
The optic nerve is normal.
There is diffuse enlargement of the lacrimal gland (more marked on the left side).
IMPRESSION:
Bilateral diffuse irregular enlargement of the lacrimal gland is due to inflammatory enlargement (? Sjogerns syndrome).
The left gland is more enlarged than right.
No mass lesion or cystic lesion noted.
No evidence of retinal detachment.
(Dr. Deepak Bhatt)
(B-Scan findings are interpretation of echoes and need to be correlated clinically)
#
EXAMPLE 2: 1-12-48.txt
The ciliary body and ciliary process are normal in position and texture.
There is marked steching of the zonules.
IMPRESSION :
Left sided marked stretching of the zonules noted from 2 to 6 O’clock position.
There is absence of zonules at 3 O’clock position.
The angle is normal and the ciliary body, processes are normal in position.
(Dr. Deepak Bhatt)
(UBM findings are interpretation of echoes and need to be correlated clinically)
#### objective
OUTPUT file: final_output.csv
15-1-2011 Bilateral diffuse.....retinal detachment.
1-12-48 Left sided marked stretching of the zonules ...in position.
Upvotes: 3
Views: 438
Reputation: 4187
You can use gsub
for that:
text_between_words <- "IMPRESSION: text to be extracted (Dr. Deepak Bhatt)"
gsub('IMPRESSION:\\s+(.*)\\s+\\(.*\\)', '\\1', text_between_words)
The result:
[1] "text to be extracted "
Or in combination with trimws
:
trimws(gsub('IMPRESSION:(.*)\\(.*\\)', '\\1', text_between_words))
The result of that:
[1] "text to be extracted"
When there is sometimes a space between IMPRESSION
and :
, then you can adapt the code to:
text_between_words2 <- "IMPRESSION : text to be extracted (Dr. Deepak Bhatt)"
trimws(gsub('IMPRESSION\\s{0,1}:(.*)\\(.*\\)', '\\1', text_between_words2))
As you can see, I added \\s{0,1}
between IMPRESSION
and :
. This will look whether there are zero or one spaces between IMPRESSION
and :
. The result of that:
[1] "text to be extracted"
For the adaptations as requested in the comment below, you need to adapt the approach too:
text_between_words3 <- "Some Text before..... IMPRESSION: text to be extracted (Dr. Deepak Bhatt) text that should go too"
trimws(gsub('.*IMPRESSION\\s{0,1}:(.*)\\(.*\\).*', '\\1', text_between_words3))
The result:
[1] "text to be extracted"
If it is only that specific name (Dr. Deepak Bhatt
) in the text, you can also do:
trimws(gsub('.*IMPRESSION\\s{0,1}:(.*)\\(Dr. Deepak Bhatt\\).*', '\\1', text_between_words3))
Upvotes: 2