Icaro Bombonato
Icaro Bombonato

Reputation: 4172

Excluding rows from a data.table based on sort order

I need some help filtering a data.table in R. I had a file with millions of rows with 4 words each.

I want to remove some rows that I don't need. Each row has 4 words and a frequency.

I want to keep just the 3 with "most frequency" for each combo of the first 3 words.

Bellow are a example of the data.table and what I need as an output.

text <- c("Run to the hills", "Run to the mountains", "Run to the highway", "Run to the top", "Run to the horizon",
          "Go away with him", "Go away with her",
          "I am a good", "I am a bad", "I am a uggly", "I am a guy", "I am a woman",
          "I am the most")

frequency <- c(0.1, 0.09, 0.2, 0.05, 0.001,
               0.05, 0.04,
               0.1, 0.06, 0.3, 0.05, 0.1,
               0.2)

DT <- data.table(text = text, frequency = frequency)

#Original output:
                    text frequency
 1:     Run to the hills     0.100
 2: Run to the mountains     0.090
 3:   Run to the highway     0.200
 4:       Run to the top     0.050
 5:   Run to the horizon     0.001
 6:     Go away with him     0.050
 7:     Go away with her     0.040
 8:          I am a good     0.100
 9:           I am a bad     0.060
10:         I am a uggly     0.300
11:           I am a guy     0.050
12:         I am a woman     0.100
13:         I am awesome     0.200

Output needed: (Just the top 3 frequency from same "first 3 words")

                 text frequency
  1: Go away with him      0.05     
  2: Go away with her      0.04
  3: I am a uggly          0.30  
  4: I am a woman          0.10
  5: I am a good           0.10
  6: I am the most         0.20
  7: Run to the highway    0.20
  8: Run to the hills      0.10
  9: Run to the mountains 0.09

So, I want to keep just the top 3 sorted by the frequency column: "Run to the XXXXX", "Go away with XXXXX", "I am a XXXXX", "I am the XXXXX"

In this case, I would drop: "Run to the top", "Run to the horizon", "I am a bad", "I am a guy"

I was thinking in something with regex, but I am kinda lost right now :-\

Upvotes: 2

Views: 243

Answers (2)

Rich Scriven
Rich Scriven

Reputation: 99351

You can create an id column with sub() that consists of the first three words, then use that to take the top three values of frequency.

Easier done than said ...

library(data.table)

## add an id column containing only the first three words
DT[, id := sub(" \\S+$", "", text)]
## order by frequency, take the top three by id, remove id and NAs
## and with a little help from Frank :)
na.omit(
  DT[order(frequency, decreasing = TRUE), .SD[1:3], keyby = id][, id := NULL][]
)
#                    text frequency
# 1:     Go away with him      0.05
# 2:     Go away with her      0.04
# 3:         I am a uggly      0.30
# 4:          I am a good      0.10
# 5:         I am a woman      0.10
# 6:        I am the most      0.20
# 7:   Run to the highway      0.20
# 8:     Run to the hills      0.10
# 9: Run to the mountains      0.09

Upvotes: 4

Kyle Balkissoon
Kyle Balkissoon

Reputation: 368

DT[,group := sub(" \\S+$", "", text)]
DT[,grank:=base::rank(-frequency),by=group]
DT[grank <= 3]

Used the rank function so OP can specify what to do with ties.

Upvotes: 1

Related Questions