CogNeuro123
CogNeuro123

Reputation: 33

How to extract text within brackets in Excel .CSV file in R?

I have an Excel .CSV file in which one column has the transcription of a conversation. Whenever the speaker uses Spanish, the Spanish is written within brackets.

so [usualmente] maybe [me levanto como a las nueve y media] like I exercise and the I like either go to class online or in person like it depends on the day

Ideally, I'd like to extract the English and Spanish separately, so one file would contain all the Spanish words, and another would contain all the English words.

Any ideas on how to do this? Or which function/package to use?

Edited to add: there's about 100 cells that contain text in this Excel sheet. I guess where I'm confused is how do I treat this entire CSV as a "string"?

Upvotes: 2

Views: 92

Answers (1)

jpsmith
jpsmith

Reputation: 17185

You could do this by Vectorizeing the seq function and indexing, then using stringr::word to extract the whole words at the indices:

Example string:

strng <- "so [usualmente] maybe [me levanto como a las nueve y media] like I exercise and the I like either go to class online or in person like it depends on the day"

Code

strng <- "so [usualmente] maybe [me levanto como a las nueve y media] like I exercise and the I like either go to class online or in person like it depends on the day"

vecSeq <- Vectorize(seq.default, vectorize.args = c("to", "from"))

ixstart <- grep("\\[", unlist(strsplit(strng, " ")))
ixend <- grep("\\]", unlist(strsplit(strng, " ")))
spanish_ix <- unlist(vecSeq(ixstart, ixend, 1))
english_ix <- setdiff(1:(lengths(gregexpr("\\W+", strng)) + 1), spanish_ix)

spanish <- paste(stringr::word(strng, spanish_ix), collapse = " ")
english <- paste(stringr::word(strng, english_ix), collapse = " ")

#spanish
#[1] "[usualmente] [me levanto como a las nueve y media]"
#> english
#[1] "so maybe like I exercise and the I like either go to class #online or in person like it depends on the day"

Note to remove the pesky brackets just do: spanish <- gsub("\\]|\\[", "", spanish)

Upvotes: 1

Related Questions