Reputation: 23200
I have an n by 2 object containing variable names in the 1st column and numeric values (scores) in the 2nd column:
data <- data.frame(matrix(nrow = 20, ncol = 2))
data[, 2] <- 1:20
data[, 1] <- c("example_a_1", "example_a_2", "example_a_3",
"example_b_1", "example_c_1", "example_d_1",
"example_d_2", "example_d_3", "example_f_1",
"example_g_1", "example_g_2", "example_h_1",
"example_i_1", "example_l_1", "example_o_1",
"example_j_1", "example_m_1", "example_p_1",
"example_k_1", "example_n_1")
data
X1 X2
1 example_a_1 1
2 example_a_2 2
3 example_a_3 3
4 example_b_1 4
5 example_c_1 5
6 example_d_1 6
7 example_d_2 7
8 example_d_3 8
9 example_f_1 9
10 example_g_1 10
11 example_g_2 11
12 example_h_1 12
13 example_i_1 13
14 example_l_1 14
15 example_o_1 15
16 example_j_1 16
17 example_m_1 17
18 example_p_1 18
19 example_k_1 19
20 example_n_1 20
I don't want this object to contain similar variables -- if a variable name has the same first 9 characters (in this example) as another, then it's repetitious. In those cases I only want to keep the first of the similarly named variables.
I can get a list of which variable names are repetitious like this:
rep <- as.data.frame(table(substr(data[,1], 1, 9)))
rep <- rep[rep[, 2] > 1, ]
rep
Var1 Freq
1 example_a 3
4 example_d 3
6 example_g 2
and thus identify them in a for
loop or other conditional:
for(i in 1:nrow(data)){
if(substr(data[i, 1], 1, 9) %in% rep[,1])){
# What goes here?
# or what's another approach?
}
}
However, I'm not sure what logic I can use to remove the rows with repetitious names?
The final object should look like this:
data
X1 X2
1 example_a_1 1
2 example_b_1 4
3 example_c_1 5
4 example_d_1 6
5 example_f_1 9
6 example_g_1 10
7 example_h_1 12
8 example_i_1 13
9 example_l_1 14
10 example_o_1 15
11 example_j_1 16
12 example_m_1 17
13 example_p_1 18
14 example_k_1 19
15 example_n_1 20
Upvotes: 2
Views: 57
Reputation: 56169
Using dplyr:
library(dplyr)
data <- data %>%
group_by(my9=substr(X1,1,9)) %>%
filter(row_number(my9)==1) %>%
select(-my9)
Upvotes: 2
Reputation: 1905
I would create a column with the shortened name and aggregate on that column:
data$short <- substr(data[,1], 1, 9)
agg <- aggregate(data$X2~data$short, FUN=min)
I used min because you seem to be interested in the smallest score for each repetitive name
Upvotes: 2
Reputation: 47146
You can use duplicated
short <- substr(data[,1], 1, 9)
i <- duplicated(short)
data <- data[!i , ]
Upvotes: 5