pranav nerurkar
pranav nerurkar

Reputation: 648

Extract text between two words from all files in a folder in R

I have a folder with many .txt files. I want to read all the files and then extract text from each file located between two words and store them in a .csv file.

Text to be extracted is always between two words

IMPRESSION:  "text to be extracted"  (Dr. Deepak Bhatt)

OR

IMPRESSION : "text to be extracted"  (Dr. Deepak Bhatt)

The code i wrote below is not extracting text from all files. How do i solve this?

    names <- list.files(path = "C:\\Users\\Admin\\Downloads\\data\\data",
     pattern = "*.txt", all.files = FALSE,
               full.names = FALSE, recursive = FALSE,
               ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)

    all.names <- lapply(names,readFn)

    readFn <- function(i)
   {

    file <- read_file(i)

    file <- gsub("[\r\n\t]", " ", file)

    extracted_txt <- rm_between(file, 
    'IMPRESSION :', '(Dr. Deepak Bhatt)', 
    extract=TRUE, trim = TRUE, clean = TRUE)

    if(is.na(extracted_txt))
    {
    extracted_txt <- rm_between(file, 
    'IMPRESSION:', '(Dr. Deepak Bhatt)', 
    extract=TRUE, trim = TRUE, clean = TRUE)
    }

    }


    output <- do.call(rbind,all.names)
    name_of_file <- sub(".txt","",names)
    final_output <- cbind(name_of_file,output)
    colnames(final_output) <- c('filename','text')
    write.csv(final_output,"final_output.csv",row.names=F)

EXAMPLE 1: filename = 15-1-2011.txt

The optic nerve is normal.


There is diffuse enlargement of the lacrimal gland (more marked on the left side).

IMPRESSION:

Bilateral diffuse irregular enlargement of the lacrimal gland is due to inflammatory enlargement (? Sjogerns syndrome).
The left gland is more enlarged than right.
No mass lesion or cystic lesion noted.
No evidence of retinal detachment.


(Dr. Deepak Bhatt)

(B-Scan findings are interpretation of echoes and need to be correlated clinically)
#

EXAMPLE 2: 1-12-48.txt

The ciliary body and ciliary process are normal in position and texture.

There is marked steching of the zonules.


IMPRESSION :

Left sided marked stretching of the zonules noted from 2 to 6 O’clock position.
There is absence of zonules at 3 O’clock position.
The angle is normal and the ciliary body, processes are normal in position.


(Dr. Deepak Bhatt)

(UBM findings are interpretation of echoes and need to be correlated clinically) 
#### objective
OUTPUT file: final_output.csv

15-1-2011      Bilateral diffuse.....retinal detachment.

1-12-48        Left sided marked stretching of the zonules ...in  position.

Upvotes: 3

Views: 438

Answers (1)

h3rm4n
h3rm4n

Reputation: 4187

You can use gsub for that:

text_between_words <- "IMPRESSION:  text to be extracted  (Dr. Deepak Bhatt)"
gsub('IMPRESSION:\\s+(.*)\\s+\\(.*\\)', '\\1', text_between_words)

The result:

[1] "text to be extracted "

Or in combination with trimws:

trimws(gsub('IMPRESSION:(.*)\\(.*\\)', '\\1', text_between_words))

The result of that:

[1] "text to be extracted"

When there is sometimes a space between IMPRESSION and :, then you can adapt the code to:

text_between_words2 <- "IMPRESSION :  text to be extracted  (Dr. Deepak Bhatt)"
trimws(gsub('IMPRESSION\\s{0,1}:(.*)\\(.*\\)', '\\1', text_between_words2))

As you can see, I added \\s{0,1} between IMPRESSION and :. This will look whether there are zero or one spaces between IMPRESSION and :. The result of that:

[1] "text to be extracted"

For the adaptations as requested in the comment below, you need to adapt the approach too:

text_between_words3 <- "Some Text before..... IMPRESSION: text to be extracted (Dr. Deepak Bhatt) text that should go too"
trimws(gsub('.*IMPRESSION\\s{0,1}:(.*)\\(.*\\).*', '\\1', text_between_words3))

The result:

[1] "text to be extracted"

If it is only that specific name (Dr. Deepak Bhatt) in the text, you can also do:

trimws(gsub('.*IMPRESSION\\s{0,1}:(.*)\\(Dr. Deepak Bhatt\\).*', '\\1', text_between_words3))

Upvotes: 2

Related Questions