Reputation: 33
I have an Excel .CSV file in which one column has the transcription of a conversation. Whenever the speaker uses Spanish, the Spanish is written within brackets.
so [usualmente] maybe [me levanto como a las nueve y media] like I exercise and the I like either go to class online or in person like it depends on the day
Ideally, I'd like to extract the English and Spanish separately, so one file would contain all the Spanish words, and another would contain all the English words.
Any ideas on how to do this? Or which function/package to use?
Edited to add: there's about 100 cells that contain text in this Excel sheet. I guess where I'm confused is how do I treat this entire CSV as a "string"?
Upvotes: 2
Views: 92
Reputation: 17185
You could do this by Vectorize
ing the seq
function and indexing, then using stringr::word
to extract the whole words at the indices:
Example string:
strng <- "so [usualmente] maybe [me levanto como a las nueve y media] like I exercise and the I like either go to class online or in person like it depends on the day"
Code
strng <- "so [usualmente] maybe [me levanto como a las nueve y media] like I exercise and the I like either go to class online or in person like it depends on the day"
vecSeq <- Vectorize(seq.default, vectorize.args = c("to", "from"))
ixstart <- grep("\\[", unlist(strsplit(strng, " ")))
ixend <- grep("\\]", unlist(strsplit(strng, " ")))
spanish_ix <- unlist(vecSeq(ixstart, ixend, 1))
english_ix <- setdiff(1:(lengths(gregexpr("\\W+", strng)) + 1), spanish_ix)
spanish <- paste(stringr::word(strng, spanish_ix), collapse = " ")
english <- paste(stringr::word(strng, english_ix), collapse = " ")
#spanish
#[1] "[usualmente] [me levanto como a las nueve y media]"
#> english
#[1] "so maybe like I exercise and the I like either go to class #online or in person like it depends on the day"
Note to remove the pesky brackets just do: spanish <- gsub("\\]|\\[", "", spanish)
Upvotes: 1