Reputation: 1
In SAS i write a code that in the log looks like:
2 proc sql;
3 create table test as
4 select
5 ID,
6 GENDER,
7 BIRTHYEAR
8 from
9 libname.dataset
10 group by ID
11 having count(*)>1
12 ;
NOTE: The query requires remerging summary statistics back with the original data.
NOTE: Table WORK.TEST created, with 0 rows and 3 columns.
13 quit;
Can I do something similar in R?
In R Studio I CAN run the code:
test <- sqlQuery(channel, query='
select
ID
from
lib.dataset
group by ID
having count(*)=1
')
But when trying code below I fail getting a data set like in SAS:
test <- sqlQuery(channel, query='
select
ID,
GENDER,
BIRTHYEAR
from
lib.dataset
group by ID
having count(*)=1
')
Instead of a data
with column names and 0 rows I just get a value = chr [1:2] 42000 8120 [Microsoft]{ODBC SQL Server Driver]...
.
Sorry if the question is not clear.
See message for all details.
Upvotes: 0
Views: 72
Reputation: 51621
Not really an R specific issue. PROC SQL will automatically remerge summary statistics (the COUNT() aggregate function in your query) with the detailed observations. Most SQL implementations do not support that. So if you want to use SQL you will need to do the re-merge yourself.
test <- sqlQuery(channel, query='
select a.ID, a.GENDER, a.BIRTHYEAR
from lib.dataset a
inner join
(select ID from lib.dataset
group by ID having count(*) > 1
) b
on a.ID = b.ID
')
If you want to use the dplyr package then you will also need to do the re-merge yourself. Let's assume you have a dataframe named lib.dataset with at least the variables ID, GENDER and BIRTHYR. You can use the dplyr verbs of group_by, summarize and filter to find the ID variables with multiple observations and then use the inner_join and select verbs to find the data for those particular ID values.
library(dplyr)
test <- lib.dataset |> group_by(ID) |> summarize(n=n()) |> filter(n>1) |>
inner_join(lib.dataset) |> select(ID,GENDER,BIRTHYR)
Upvotes: 0
Reputation: 27508
The SAS SQL query is doing automatic remerging that is proprietary to Proc SQL
You can adjust the SQL select to be either of these:
select ID, GENDER, BIRTHYEAR
from lib.dataset
where ID in
( select ID from lib.dataset group by ID having count(*)=1 )
select EACH.ID, EACH.GENDER, EACH.BIRTHYEAR
from lib.dataset as EACH
join ( select ID from lib.dataset group by ID having count(*)=1 ) as SINGLE
on EACH.ID = SINGLE.ID
Upvotes: 0
Reputation: 1116
Here is R equivalent of your SAS code:
library(dplyr)
df |>
select(ID,GENDER,BIRTHYEAR) |>
group_by(ID) |>
summarize(count = n())
Where df
is your dataframe. select
, group_by
and summarize
are all dplyr
verbs.
You can learn more about the R dplyr
package here: https://dplyr.tidyverse.org/
Upvotes: 0