Dag
Dag

Reputation: 589

Aggregate columns by alphanumerical and numerical parameters

I have a list of individual names and dates. In my raw data file each name can occur several times on each date and I need to input this list, containing several thousand individuals, and get a file where all the individuals are listed only once for each date, i.e. a simple aggregation by name and date. The date is a 5- or 6-digit integer, and the individual name is alphanumeric, which is a problem I can't seem to get around.

Any help welcome.

Sample data:

structure(list(name = c("KA-08", "KA-08", "KA-11", "KA-11", "KA-11", 
"KA-11", "KA-12", "KA-12", "KA-12", "KB-04", "KB-04", "KB-04", 
"KB-04", "KB-04", "KB-04", "KB-04", "KB-06", "KB-06", "KB-06", 
"KB-06", "KB-06", "KB-06", "KB-06", "KB-06", "KB-06", "KB-06", 
"KB-06", "KB-06", "KB-06", "KB-06", "KB-07", "KB-07", "KB-07", 
"KB-07", "KB-11", "KB-11", "KB-11", "KB-11", "KB-11", "KB-11", 
"KB-11", "KB-11", "KB-11", "KB-11", "KB-11", "KB-11", "KC-01", 
"KC-01", "KC-01", "KC-03", "KC-03", "KC-04", "KE-01", "KE-01", 
"KE-01", "KE-01", "KE-01", "KE-01", "KE-02", "KE-02", "KE-02", 
"KE-03", "KF-01", "KF-01", "KF-01", "KF-01", "KF-01", "KG-05", 
"KG-06", "KG-06", "KG-06", "KH-01", "KH-01", "KH-01", "KH-01", 
"KH-01", "KH-01", "KH-01", "KH-01", "KH-01", "KH-01", "KH-01", 
"KH-01", "KH-01", "KH-01", "KH-01", "KH-01", "KH-01", "KH-01", 
"KH-01", "KH-02", "KH-02", "KH-02", "KH-03", "KH-03", "KH-04", 
"KH-04", "KH-05", "KH-05", "KH-06", "KH-06", "KH-07", "KH-07", 
"KH-07", "NU-09", "NU-09", "NU-09", "NU-09", "NU-09", "NU-09", 
"NU-09", "NU-09", "NU-09", "NU-09", "NU-09", "NU-09", "NU-09", 
"NU-09", "NU-09", "NU-09", "NU-09", "NW-03", "NW-03", "NW-03", 
"NY-05", "NY-05", "NY-05", "NY-05", "NY-05", "NY-05"), date = c(250188, 
250188, 250188, 250188, 250188, 250188, 250188, 290188, 290188, 
310190, 80290, 100290, 160290, 160290, 160290, 160290, 310190, 
310190, 80290, 80290, 90290, 110290, 150290, 150290, 160290, 
160290, 160290, 160290, 160290, 160290, 100290, 100290, 150290, 
160290, 100290, 100290, 100290, 100290, 100290, 110290, 150290, 
150290, 150290, 160290, 160290, 160290, 190490, 190490, 190490, 
190490, 190490, 190490, 271190, 271190, 271190, 271190, 271190, 
271190, 271190, 271190, 271190, 271190, 210188, 210188, 210188, 
210188, 161090, 90391, 90391, 90391, 90391, 90391, 90391, 90391, 
90391, 90391, 90391, 90391, 90391, 90391, 90391, 90391, 90391, 
90391, 90391, 90391, 90391, 90391, 120391, 120391, 90391, 90391, 
90391, 90391, 90391, 90391, 120391, 90391, 90391, 90391, 90391, 
90391, 90391, 90391, 211092, 211092, 211092, 211092, 211092, 
211092, 211092, 211092, 141192, 141192, 161192, 161192, 161192, 
161192, 161192, 161192, 161192, 151092, 161092, 241092, 141192, 
141192, 141192, 141192, 141192, 141192)), class = "data.frame", row.names = c(NA, 
-130L))

Upvotes: 0

Views: 19

Answers (2)

Allan Cameron
Allan Cameron

Reputation: 174348

Assuming you only have two columns to consider, you can just do unique(df):

unique(df)
     name   date
1   KA-08 250188
3   KA-11 250188
7   KA-12 250188
8   KA-12 290188
10  KB-04 310190
11  KB-04  80290
12  KB-04 100290
13  KB-04 160290
17  KB-06 310190
19  KB-06  80290
21  KB-06  90290
22  KB-06 110290
23  KB-06 150290
25  KB-06 160290
31  KB-07 100290
33  KB-07 150290
34  KB-07 160290
35  KB-11 100290
40  KB-11 110290
41  KB-11 150290
44  KB-11 160290
47  KC-01 190490
50  KC-03 190490
52  KC-04 190490
53  KE-01 271190
59  KE-02 271190
62  KE-03 271190
63  KF-01 210188
67  KF-01 161090
68  KG-05  90391
69  KG-06  90391
72  KH-01  90391
89  KH-01 120391
91  KH-02  90391
94  KH-03  90391
96  KH-04  90391
97  KH-04 120391
98  KH-05  90391
100 KH-06  90391
102 KH-07  90391
105 NU-09 211092
113 NU-09 141192
115 NU-09 161192
122 NW-03 151092
123 NW-03 161092
124 NW-03 241092
125 NY-05 141192

Upvotes: 1

Duck
Duck

Reputation: 39613

I would suggest a tidyverse approach using group_by() with the mentioned variables and counting records with n(). Data used df belongs to the dput() you included in your question:

library(tidyverse)
#Code
df1 <- df %>% group_by(name,date) %>% summarise(N=n())

Output:

# A tibble: 47 x 3
# Groups:   name [26]
   name    date     N
   <chr>  <dbl> <int>
 1 KA-08 250188     2
 2 KA-11 250188     4
 3 KA-12 250188     1
 4 KA-12 290188     2
 5 KB-04  80290     1
 6 KB-04 100290     1
 7 KB-04 160290     4
 8 KB-04 310190     1
 9 KB-06  80290     2
10 KB-06  90290     1
# ... with 37 more rows

Upvotes: 1

Related Questions