Reputation: 19
I have a dataset that I am trying to read into R, but it is in .dat format. I have been given code for reading the dataset into SAS, but not for reading it into R. I am having trouble translating this into something I can use to get the data into a usable state. Does anyone have any advice? Here is the SAS code:
/* This program is to read in the SPARCS Diagnosis data table. */
OPTIONS NOCENTER NODATE FORMDLIM=' ' compress=yes pagesize=50;
/*USER INPUT NEEDED*/
%let file=".\SPARCS_Extract\SPARCS_DIAG.dat"; *Set to your path;
data SPARCS_DIAG ;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile &file. delimiter = '|' MISSOVER DSD lrecl=32767 firstobs=2 /*obs = 1000*/;
informat clm_trans_id $12. ;
informat disch_yr $4. ;
informat dx_type_cd $2. ;
informat seq_id 8. ;
informat clm_type_cd $1. ;
informat upide $128. ;
informat dx_catgy_cd $2. ;
informat dx_grp_cd $3. ;
informat dx_cd $7. ;
informat poa_ind $1. ;
informat DX_VERS_TYPE_CD $5. ;
informat clm_key $12. ;
informat actv_flag $1. ;
informat ltst_flag $1. ;
informat processed_dt $8. ;
informat created_by $20. ;
informat last_updd_dt $8. ;
informat last_updd_by $20. ;
informat src_nm $30. ;
informat insert_row_dt $8. ;
informat abort_ind $1. ;
informat hiv_ind $1. ;
format clm_trans_id $12. ;
format disch_yr $4. ;
format dx_type_cd $2. ;
format seq_id 8. ;
format clm_type_cd $1. ;
format upide $128. ;
format dx_catgy_cd $2. ;
format dx_grp_cd $3. ;
format dx_cd $7. ;
format poa_ind $1. ;
format DX_VERS_TYPE_CD $5. ;
format clm_key $12. ;
format actv_flag $1. ;
format ltst_flag $1. ;
format processed_dt $8. ;
format created_by $20. ;
format last_updd_dt $8. ;
format last_updd_by $20. ;
format src_nm $30. ;
format insert_row_dt $8. ;
format abort_ind $1. ;
format hiv_ind $1. ;
input
clm_trans_id $
disch_yr $
dx_type_cd $
seq_id
clm_type_cd $
upide $
dx_catgy_cd $
dx_grp_cd $
dx_cd $
poa_ind $
DX_VERS_TYPE_CD $
clm_key $
actv_flag $
ltst_flag $
processed_dt $
created_by $
last_updd_dt $
last_updd_by $
src_nm $
insert_row_dt $
abort_ind $
hiv_ind $
;
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;
Upvotes: 0
Views: 488
Reputation: 107567
The analogous import version of R to read the .dat file can be the base method, read.table
where read.csv
for comma-separated values and read.delim
for tab-separated values are wrappers to it.
Additionally, the SAS code specifies the data types of every column (where $
translates as character
and remaining being numeric
or integer
) with lengths. Therefore, use the colClasses
argument which can run faster since this avoids R inferring types when parsing.
Do note: R does not require lengths of strings or numbers and R is case sensitive (i.e., DX_VERS_TYPE_CD
!= dx_vers_type_cd
)
SPARCS_DIALOG <- read.table(
"SPARCS_DIAG.dat",
sep = "|",
colClasses = c(
"clm_trans_id" = "character",
"disch_yr" = "character",
"dx_type_cd" = "character",
"seq_id" = "integer",
"clm_type_cd" = "character",
"upide" = "character",
"dx_catgy_cd" = "character",
"dx_grp_cd" = "character",
"dx_cd" = "character",
"poa_ind" = "character",
"DX_VERS_TYPE_CD" = "character",
"clm_key" = "character",
"actv_flag" = "character",
"ltst_flag" = "character",
"processed_dt" = "character",
"created_by" = "character",
"last_updd_dt" = "character",
"last_updd_by" = "character",
"src_nm" = "character",
"insert_row_dt" = "character",
"abort_ind" = "character",
"hiv_ind" = "character"
)
)
However, seeing your comment that you did attempt read.table
(possibly without colClasses
), the wrappers have some arguments that may help such as quote = "\""
and fill=TRUE
. Therefore, consider using those methods but change sep
argument:
SPARCS_DIALOG <- read.csv(
"SPARCS_DIAG.dat",
sep = "|",
colClasses = c(
"clm_trans_id" = "character",
"disch_yr" = "character",
"dx_type_cd" = "character",
... # REST OF COLUMNS
)
)
SPARCS_DIALOG <- read.delim(
"SPARCS_DIAG.dat",
sep = "|",
colClasses = c(
"clm_trans_id" = "character",
"disch_yr" = "character",
"dx_type_cd" = "character",
... # REST OF COLUMNS
)
)
Upvotes: 1