Reputation: 589
I have a list of individual names and dates. In my raw data file each name can occur several times on each date and I need to input this list, containing several thousand individuals, and get a file where all the individuals are listed only once for each date, i.e. a simple aggregation by name and date. The date is a 5- or 6-digit integer, and the individual name is alphanumeric, which is a problem I can't seem to get around.
Any help welcome.
Sample data:
structure(list(name = c("KA-08", "KA-08", "KA-11", "KA-11", "KA-11",
"KA-11", "KA-12", "KA-12", "KA-12", "KB-04", "KB-04", "KB-04",
"KB-04", "KB-04", "KB-04", "KB-04", "KB-06", "KB-06", "KB-06",
"KB-06", "KB-06", "KB-06", "KB-06", "KB-06", "KB-06", "KB-06",
"KB-06", "KB-06", "KB-06", "KB-06", "KB-07", "KB-07", "KB-07",
"KB-07", "KB-11", "KB-11", "KB-11", "KB-11", "KB-11", "KB-11",
"KB-11", "KB-11", "KB-11", "KB-11", "KB-11", "KB-11", "KC-01",
"KC-01", "KC-01", "KC-03", "KC-03", "KC-04", "KE-01", "KE-01",
"KE-01", "KE-01", "KE-01", "KE-01", "KE-02", "KE-02", "KE-02",
"KE-03", "KF-01", "KF-01", "KF-01", "KF-01", "KF-01", "KG-05",
"KG-06", "KG-06", "KG-06", "KH-01", "KH-01", "KH-01", "KH-01",
"KH-01", "KH-01", "KH-01", "KH-01", "KH-01", "KH-01", "KH-01",
"KH-01", "KH-01", "KH-01", "KH-01", "KH-01", "KH-01", "KH-01",
"KH-01", "KH-02", "KH-02", "KH-02", "KH-03", "KH-03", "KH-04",
"KH-04", "KH-05", "KH-05", "KH-06", "KH-06", "KH-07", "KH-07",
"KH-07", "NU-09", "NU-09", "NU-09", "NU-09", "NU-09", "NU-09",
"NU-09", "NU-09", "NU-09", "NU-09", "NU-09", "NU-09", "NU-09",
"NU-09", "NU-09", "NU-09", "NU-09", "NW-03", "NW-03", "NW-03",
"NY-05", "NY-05", "NY-05", "NY-05", "NY-05", "NY-05"), date = c(250188,
250188, 250188, 250188, 250188, 250188, 250188, 290188, 290188,
310190, 80290, 100290, 160290, 160290, 160290, 160290, 310190,
310190, 80290, 80290, 90290, 110290, 150290, 150290, 160290,
160290, 160290, 160290, 160290, 160290, 100290, 100290, 150290,
160290, 100290, 100290, 100290, 100290, 100290, 110290, 150290,
150290, 150290, 160290, 160290, 160290, 190490, 190490, 190490,
190490, 190490, 190490, 271190, 271190, 271190, 271190, 271190,
271190, 271190, 271190, 271190, 271190, 210188, 210188, 210188,
210188, 161090, 90391, 90391, 90391, 90391, 90391, 90391, 90391,
90391, 90391, 90391, 90391, 90391, 90391, 90391, 90391, 90391,
90391, 90391, 90391, 90391, 90391, 120391, 120391, 90391, 90391,
90391, 90391, 90391, 90391, 120391, 90391, 90391, 90391, 90391,
90391, 90391, 90391, 211092, 211092, 211092, 211092, 211092,
211092, 211092, 211092, 141192, 141192, 161192, 161192, 161192,
161192, 161192, 161192, 161192, 151092, 161092, 241092, 141192,
141192, 141192, 141192, 141192, 141192)), class = "data.frame", row.names = c(NA,
-130L))
Upvotes: 0
Views: 19
Reputation: 174348
Assuming you only have two columns to consider, you can just do unique(df)
:
unique(df)
name date
1 KA-08 250188
3 KA-11 250188
7 KA-12 250188
8 KA-12 290188
10 KB-04 310190
11 KB-04 80290
12 KB-04 100290
13 KB-04 160290
17 KB-06 310190
19 KB-06 80290
21 KB-06 90290
22 KB-06 110290
23 KB-06 150290
25 KB-06 160290
31 KB-07 100290
33 KB-07 150290
34 KB-07 160290
35 KB-11 100290
40 KB-11 110290
41 KB-11 150290
44 KB-11 160290
47 KC-01 190490
50 KC-03 190490
52 KC-04 190490
53 KE-01 271190
59 KE-02 271190
62 KE-03 271190
63 KF-01 210188
67 KF-01 161090
68 KG-05 90391
69 KG-06 90391
72 KH-01 90391
89 KH-01 120391
91 KH-02 90391
94 KH-03 90391
96 KH-04 90391
97 KH-04 120391
98 KH-05 90391
100 KH-06 90391
102 KH-07 90391
105 NU-09 211092
113 NU-09 141192
115 NU-09 161192
122 NW-03 151092
123 NW-03 161092
124 NW-03 241092
125 NY-05 141192
Upvotes: 1
Reputation: 39613
I would suggest a tidyverse
approach using group_by()
with the mentioned variables and counting records with n()
. Data used df
belongs to the dput()
you included in your question:
library(tidyverse)
#Code
df1 <- df %>% group_by(name,date) %>% summarise(N=n())
Output:
# A tibble: 47 x 3
# Groups: name [26]
name date N
<chr> <dbl> <int>
1 KA-08 250188 2
2 KA-11 250188 4
3 KA-12 250188 1
4 KA-12 290188 2
5 KB-04 80290 1
6 KB-04 100290 1
7 KB-04 160290 4
8 KB-04 310190 1
9 KB-06 80290 2
10 KB-06 90290 1
# ... with 37 more rows
Upvotes: 1