Reputation: 71
I'm attempting to read this file into R: https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/21447# (the commoncontent2012.tab file)
When I use read.delim()
everything at first seems ok. However, there are only about two-thirds of the observations that there should be. When using read.table()
it imports the correct number of rows. However, there are other problems with the column names.
Upvotes: 0
Views: 248
Reputation: 263352
The file (I thought) you mentioned is not a tab-separated file, despite what the website might lead you to assume. It is a Stata-formatted file with an extension of '.dta' so use read.dta
from package foreign:
require(foreign)
inp <- read.dta("~/Downloads/commoncontent2012.dta")
str(inp)
# a really "wide" file
'data.frame': 54535 obs. of 479 variables:
$ V101 : int 162390854 162397903 162377974 164027062 164852532 166088596 162312322 162347328 162138459 162263731 ...
$ V103 : num 0.213 0.572 0.371 0.511 0.788 ...
$ comptype : Factor w/ 13 levels "Windows Desktop",..: 2 1 1 1 2 1 1 1 2 2 ...
$ inputzip : int NA NA 92637 NA NA NA 33914 NA NA NA ...
$ birthyr : int 1928 1947 1923 1967 1944 1956 1937 1931 1956 1954 ...
$ gender : Factor w/ 4 levels "Male","Female",..: 1 1 2 2 1 1 2 1 1 1 ...
$ educ : Factor w/ 8 levels "No HS","High school graduate",..: 6 5 6 3 6 5 3 2 3 6 ...
$ race : Factor w/ 10 levels "White","Black",..: 1 1 1 1 3 1 1 1 1 1 ...
$ hispanic : Factor w/ 4 levels "Yes","No","Skipped",..: 2 2 2 2 NA 2 2 2 2 2 ...
$ votereg : Factor w/ 5 levels "Yes","No","Don't know",..: 1 1 1 1 1 1 1 1 1 1 ...
$ regzip : int NA NA NA NA NA NA NA NA NA NA ...
# snipped the rest of the output
But then I also looked at the file named dataverse.zip
that when expanded included a commoncontent.tab
file. When read with read.delim
I get:
> inp2 <- read.delim("~/Downloads/dataverse_files/commoncontent2012.tab")
> str(inp2)
'data.frame': 30140 obs. of 479 variables:
$ V101 : int 162390854 162397903 162377974 164027062 164852532 166088596 162312322 162347328 162138459 162263731 ...
$ V103 : num 0.213 0.572 0.371 0.511 0.788 ...
$ comptype : int 2 1 1 1 2 1 1 1 2 2 ...
$ inputzip : int NA NA 92637 NA NA NA 33914 NA NA NA ...
$ birthyr : Factor w/ 78 levels "__NA__","1918",..: 12 31 7 51 28 40 21 15 40 38 ...
$ gender : int 1 1 2 2 1 1 2 1 1 1 ...
$ educ : int 6 5 6 3 6 5 3 2 3 6 ...
$ race : int 1 1 1 1 3 1 1 1 1 1 ...
# rest of output deleted
So how does this compare with what you think should be in these files or what you are seeing, since you didn't say precisely what your problems were.
Upvotes: 1