Reputation: 5154
I have a data in R as follows
Text <- c("reuce FR563 323 aldk", "vard 432", "DK123 fg4d", "matten global height")
ID <- c("S1", "S2", "S3", "S4")
data <- data.frame(ID, Text)
data$noofwords <- sapply(data$Text, str_count,"[[:space:]]") +1
data$Text <- as.character(data$Text)
data$ID <- as.character(data$ID)
data
ID Text noofwords
1 S1 reuce FR563 323 aldk 4
2 S2 vard 432 2
3 S3 DK123 fg4d 2
4 S4 matten global height 3
I want to fetch every word in a string in Text column into a new data.frame in R along with the corresponding ID and Text field
The following script with nested for loops does the job, but is there any way to vectorise it? It is very slow for large datasets.
keyword <- "keyword"
text <- "text"
ID <- "ID"
Index <- data.frame(keyword,text,ID)
Index[,1:3] <- as.character(Index[,1:3])
n <- nrow(data)
for (i in 1:n) {
k <- data[i,"noofwords"]
kwv <- str_split(data[i,"Text"], " ", n = Inf)
kwv <- unlist(kwv, recursive = TRUE, use.names = FALSE)
for (j in 1:k){
kw <- kwv[j]
tex <- (data[i,"Text"])
nid <- (data[i, "ID"])
Index <- rbind(Index, c(kw,tex,nid))
}
}
Index
keyword text ID
1 1 1 1
2 reuce reuce FR563 323 aldk S1
3 FR563 reuce FR563 323 aldk S1
4 323 reuce FR563 323 aldk S1
5 aldk reuce FR563 323 aldk S1
6 vard vard 432 S2
7 432 vard 432 S2
8 DK123 DK123 fg4d S3
9 fg4d DK123 fg4d S3
10 matten matten global height S4
11 global matten global height S4
12 height matten global height S4
Also why is there an extra first row with all 1s getting created?
Upvotes: 3
Views: 100
Reputation: 59970
This uses the data.table
package and should be relatively quick.
Do check your column types because the example data you gave gets converted to a factor
variable (so I used stringsAsFactors=FALSE
when recreating it).
require(data.table)
dt <- data.table( data , key = "ID" )
dt[ dt[ , list( Keyword = unlist( strsplit( Text , " " ) ) ) , by = ID ] ]
# ID Text Keyword
# 1: S1 reuce FR563 323 aldk reuce
# 2: S1 reuce FR563 323 aldk FR563
# 3: S1 reuce FR563 323 aldk 323
# 4: S1 reuce FR563 323 aldk aldk
# 5: S2 vard 432 vard
# 6: S2 vard 432 432
# 7: S3 DK123 fg4d DK123
# 8: S3 DK123 fg4d fg4d
# 9: S4 matten global height matten
#10: S4 matten global height global
#11: S4 matten global height height
Upvotes: 2