Lana Meijinhos
Lana Meijinhos

Reputation: 185

How to open a text (.txt) file from SAS in R?

I have a text (.txt) file that apparently was created on SAS. I have no idea how SAS works, but I need to open the text file in R so it reads as a dataframe with variables names and all.

The file can be downloaded here: https://www.ibge.gov.br/estatisticas/downloads-estatisticas.html?caminho=pense/2009/microdados/ . It's a zip file called "PENSE_2009_microdados.zip". There are 3 folders:

  1. Dados - with "Base_Pense Completa_2009.txt"
  2. Documentação - with the data dictionary
  3. Input SAS - with a .csv file that I could not understand
  4. A "read me" file that explains that the "Dados" folder contains the data and the "Documentação" folder contains "SAS's layout and reading input".

Does anybody know how to "transform" this data into a dataframe that makes sense? Thanks in advance.

Upvotes: 0

Views: 204

Answers (1)

rawr
rawr

Reputation: 20811

It's a fixed-width file, each line contains all variables in the documentation without spacing, and you provide the number of characters that make up each column.

The input sas file (this should be a .sas, not sure why it would be excel format) tells you the variable names and their formats/widths.

You can also get this from the documentation excel file plus the format of factor variables, eg, B01P02 has 5 unique values coded 1 thru 5.

doc <- readxl::read_excel('PENSE_2009_microdados/Documenta‡ֶo/Dicionario_Base Pense_Completa.xls')
## the widths for each column
wid <- doc$TÉRMINO - doc$INICIO + 1

dat <- read.fwf(
  'PENSE_2009_microdados/Dados/Base_Pense Completa_2009.txt',
  ## also add variable names
  widths = wid, col.names = doc$CAMPO,
  # n = 1000,
  ## some options for importing NAs/trimming white space
  na.strings = c('NA', '.', ''), strip.white = TRUE
)

## add labels if wanted
dat <- Hmisc::upData(dat, labels = setNames(doc$DESCRICAO, doc$CAMPO))

str(dat)

Here is how you can use the formats. They are not standardized, so not all will be fixed this way:

x <- doc$DOMÍNIO[doc$CAMPO %in% 'B01P02']
# [1] "1- Branca\n 2- Preta(Negra)\n 3- Parda(Mulata) \n 4- Amarela(Oriental) \n 5- Indígena"

fmt <- read.table(text = x, sep = '-', strip.white = TRUE)
x_fmt <- factor(dat$B01P02, fmt$V1, fmt$V2)

table(old = dat$B01P02, new = x_fmt)
#    new
# old Branca Preta(Negra) Parda(Mulata) Amarela(Oriental) Indígena
#   1    309            0             0                 0        0
#   2      0          167             0                 0        0
#   3      0            0           436                 0        0
#   4      0            0             0                27        0
#   5      0            0             0                 0       30

Upvotes: 4

Related Questions