Gordon L.
Gordon L.

Reputation: 75

R/dplyr: How do I count the number of unique occurrences of an observation over time without double counting?

I have a panel-like data set that has multiple instances of the same observations over time. I have a variable that counts a specific company's appearances each individual year (e.g. Company A has 10 active machines in year XXXX), and I also want to create a variable that counts the number of machines a specific user has ever had (i.e. an all-time count).

Here is an example of a line of code I have written to do this:

### Creates a column that counts the number of unique machines by company name all time
df_panel <- df_panel %>% group_by(Company_name, Machine_name) %>% summarise(Machine_count_alltime = n())

The problem I am having with this is that it returns an output table like this:

Company Name      Machine Name      n     

Company A         Machine Y         1
Company B         Machine X1        1
Company B         Machine X2        1
Company B         Machine Z         2
Company C         Machine K         5
Company C         Machine U         2

In the table above, n is a count of how many times a specific Machine Name has appeared throughout the data years (e.g. Machine K is present in 5 of the years). I want a count of the number of unique machines each Company Name has over all the years as an additional column, not the instances of each Machine Name as it produces in the table above. It should look something like this:

Company Name      Machine Name           

Company A         Machine Y         1
Company B         Machine X1        3
Company B         Machine X2        3
Company B         Machine Z         3
Company C         Machine K         2
Company C         Machine U         2

I only included the Machine Name column in the group_by because I wanted to ensure that the specific machines were not double counted in the total all-time count variable. I've also tried using commands like tally() and sum() inside of the pipe operators, but have only been met with errors because the names are characters, or there was no applicable method for the summary, etc. Am I going about this the wrong way? Any help is greatly appreciated.

No preference for packages - dplyr, base R, etc. is fine with me.

Upvotes: 1

Views: 1250

Answers (3)

Martin Gal
Martin Gal

Reputation: 16988

A third option, since I'm not sure what the correct output should look like:

library(tidyverse)

df %>%
  group_by(`Company Name`) %>%
  distinct(`Machine Name`) %>%
  mutate(count=n())

Upvotes: 2

Siim P&#245;ldre
Siim P&#245;ldre

Reputation: 54

If i understand correctly this should work:

library(tidyverse)  
 df_panel %>% group_by(Company_name, Machine_name) %>% 
  summarise(n = length(Machine_name))

Upvotes: 1

phiggins
phiggins

Reputation: 248

Would this do the job (with dplyr::count() )?

library(tidyverse)
data <- tribble(~"Company Name", ~"Machine Name",
        "Company A", "Machine Y",
        "Company B", "Machine X1",
        "Company B", "Machine X1",
        "Company B", "Machine X1",
        "Company B", "Machine X2",
        "Company B", "Machine X2",
        "Company B", "Machine X2",
        "Company C", "Machine K",
        "Company C", "Machine K",
        "Company C", "Machine U",
        "Company C", "Machine U")

data %>% 
  count(`Company Name`, `Machine Name`)
#> # A tibble: 5 x 3
#>   `Company Name` `Machine Name`     n
#>   <chr>          <chr>          <int>
#> 1 Company A      Machine Y          1
#> 2 Company B      Machine X1         3
#> 3 Company B      Machine X2         3
#> 4 Company C      Machine K          2
#> 5 Company C      Machine U          2

Created on 2020-06-11 by the reprex package (v0.3.0)

Upvotes: 1

Related Questions