Reputation: 1

R newbie: R equivalent to SAS proc sql, group by, having, count, remerging summary statistics

In SAS i write a code that in the log looks like:

2    proc sql;
3    create table test as
4    select
5    ID,
6    GENDER,
7    BIRTHYEAR
8    from
9    libname.dataset
10   group by ID
11   having count(*)>1
12   ;

NOTE: The query requires remerging summary statistics back with the original data.

NOTE: Table WORK.TEST created, with 0 rows and 3 columns.
13   quit;

Can I do something similar in R?

In R Studio I CAN run the code:

test <- sqlQuery(channel, query='
select
ID
from
lib.dataset
group by ID
having count(*)=1
')

But when trying code below I fail getting a data set like in SAS:

test <- sqlQuery(channel, query='
select
ID,
GENDER,
BIRTHYEAR
from
lib.dataset
group by ID
having count(*)=1
')

Instead of a data with column names and 0 rows I just get a value = chr [1:2] 42000 8120 [Microsoft]{ODBC SQL Server Driver]....

Sorry if the question is not clear.

See message for all details.

Upvotes: 0

Answers (3)

Tom

Reputation: 51621

Not really an R specific issue. PROC SQL will automatically remerge summary statistics (the COUNT() aggregate function in your query) with the detailed observations. Most SQL implementations do not support that. So if you want to use SQL you will need to do the re-merge yourself.

test <- sqlQuery(channel, query='
select a.ID, a.GENDER, a.BIRTHYEAR
from lib.dataset a
inner join
  (select ID from lib.dataset 
   group by ID having count(*) > 1
  ) b
on a.ID = b.ID
')

If you want to use the dplyr package then you will also need to do the re-merge yourself. Let's assume you have a dataframe named lib.dataset with at least the variables ID, GENDER and BIRTHYR. You can use the dplyr verbs of group_by, summarize and filter to find the ID variables with multiple observations and then use the inner_join and select verbs to find the data for those particular ID values.

library(dplyr)
test <- lib.dataset |> group_by(ID) |> summarize(n=n()) |> filter(n>1) |>
  inner_join(lib.dataset) |> select(ID,GENDER,BIRTHYR)

Upvotes: 0

Richard

Reputation: 27508

The SAS SQL query is doing automatic remerging that is proprietary to Proc SQL

You can adjust the SQL select to be either of these:

select ID, GENDER, BIRTHYEAR
from lib.dataset
where ID in 
  ( select ID from lib.dataset group by ID having count(*)=1 )

select EACH.ID, EACH.GENDER, EACH.BIRTHYEAR
from lib.dataset as EACH
join ( select ID from lib.dataset group by ID having count(*)=1 ) as SINGLE
on EACH.ID = SINGLE.ID

Upvotes: 0

Ifeanyi Idiaye

Reputation: 1116

Here is R equivalent of your SAS code:

library(dplyr)

df |> 
  select(ID,GENDER,BIRTHYEAR) |> 
  group_by(ID) |> 
  summarize(count = n())

Where df is your dataframe. select, group_by and summarize are all dplyr verbs.

You can learn more about the R dplyr package here: https://dplyr.tidyverse.org/

Upvotes: 0

R newbie: R equivalent to SAS proc sql, group by, having, count, remerging summary statistics

Answers (3)

Related Questions