Excluding rows from a data.table based on sort order

Question

I need some help filtering a data.table in R. I had a file with millions of rows with 4 words each.

I want to remove some rows that I don't need. Each row has 4 words and a frequency.

I want to keep just the 3 with "most frequency" for each combo of the first 3 words.

Bellow are a example of the data.table and what I need as an output.

text <- c("Run to the hills", "Run to the mountains", "Run to the highway", "Run to the top", "Run to the horizon",
          "Go away with him", "Go away with her",
          "I am a good", "I am a bad", "I am a uggly", "I am a guy", "I am a woman",
          "I am the most")

frequency <- c(0.1, 0.09, 0.2, 0.05, 0.001,
               0.05, 0.04,
               0.1, 0.06, 0.3, 0.05, 0.1,
               0.2)

DT <- data.table(text = text, frequency = frequency)

#Original output:
                    text frequency
 1:     Run to the hills     0.100
 2: Run to the mountains     0.090
 3:   Run to the highway     0.200
 4:       Run to the top     0.050
 5:   Run to the horizon     0.001
 6:     Go away with him     0.050
 7:     Go away with her     0.040
 8:          I am a good     0.100
 9:           I am a bad     0.060
10:         I am a uggly     0.300
11:           I am a guy     0.050
12:         I am a woman     0.100
13:         I am awesome     0.200

Output needed: (Just the top 3 frequency from same "first 3 words")

                 text frequency
  1: Go away with him      0.05     
  2: Go away with her      0.04
  3: I am a uggly          0.30  
  4: I am a woman          0.10
  5: I am a good           0.10
  6: I am the most         0.20
  7: Run to the highway    0.20
  8: Run to the hills      0.10
  9: Run to the mountains 0.09

So, I want to keep just the top 3 sorted by the frequency column: "Run to the XXXXX", "Go away with XXXXX", "I am a XXXXX", "I am the XXXXX"

In this case, I would drop: "Run to the top", "Run to the horizon", "I am a bad", "I am a guy"

I was thinking in something with regex, but I am kinda lost right now :-\

Rich Scriven · Accepted Answer

You can create an id column with sub() that consists of the first three words, then use that to take the top three values of frequency.

Easier done than said ...

library(data.table)

## add an id column containing only the first three words
DT[, id := sub(" \S+$", "", text)]
## order by frequency, take the top three by id, remove id and NAs
## and with a little help from Frank :)
na.omit(
  DT[order(frequency, decreasing = TRUE), .SD[1:3], keyby = id][, id := NULL][]
)
#                    text frequency
# 1:     Go away with him      0.05
# 2:     Go away with her      0.04
# 3:         I am a uggly      0.30
# 4:          I am a good      0.10
# 5:         I am a woman      0.10
# 6:        I am the most      0.20
# 7:   Run to the highway      0.20
# 8:     Run to the hills      0.10
# 9: Run to the mountains      0.09

Excluding rows from a data.table based on sort order

Answers (2)

Related Questions