Shivaraj Nesargi
Shivaraj Nesargi

Reputation: 139

How to read .data file into R

I have tried to load the data from http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data into R using the following piece of code

hData <- read.table(file.choose(), sep = "\t", dec = ",", fileEncoding = "UTF-16")

but its not populating the exact data. The data has 76 attributes in it and the details about it are given here: http://archive.ics.uci.edu/ml/datasets/Heart+Disease.

Can someone tell me what am I doing incorrect?

Upvotes: 0

Views: 1570

Answers (1)

alistaire
alistaire

Reputation: 43354

The file contains extra line breaks that are causing issues. If you chop them out with regex, you can read it in:

# read file into a single string
x <- readr::read_file('http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data')

# or in base, x <- paste(readLines(url('http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data')), collapse = '\n')

# gsub out line breaks that follow numbers (not "name") and read data
df <- read.table(text = gsub('(\\d)\\n', '\\1 ', x))

head(df, 2)
##     V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25
## 1 1254  0 40  1  1  0  0 -9  2 140   0 289  -9  -9  -9   0  -9  -9   0  12  16  84   0   0   0
## 2 1255  0 49  0  1  0  0 -9  3 160   1 180  -9  -9  -9   0  -9  -9   0  11  16  84   0   0   0
##   V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 V41 V42 V43 V44 V45 V46 V47 V48
## 1   0   0 150  18  -9   7 172  86 200 110 140  86   0   0   0  -9  26  20  -9  -9  -9  -9  -9
## 2   0   0  -9  10   9   7 156 100 220 106 160  90   0   0   1   2  14  13  -9  -9  -9  -9  -9
##   V49 V50 V51 V52 V53 V54 V55 V56 V57 V58 V59 V60 V61 V62 V63 V64 V65 V66 V67 V68 V69 V70 V71
## 1  -9  -9  -9  -9  -9  -9  12  20  84   0  -9  -9  -9  -9  -9  -9  -9  -9  -9  -9   1   1   1
## 2  -9  -9  -9  -9  -9  -9  11  20  84   1  -9  -9   2  -9  -9  -9  -9  -9  -9  -9   1   1   1
##   V72 V73 V74 V75  V76
## 1   1   1  -9  -9 name
## 2   1   1  -9  -9 name

If there doesn't happen to be a conveniently different data type at the end, you can use scan to make a vector, then split and reassemble:

# download data and split into a character vector
x <- scan(url('http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data'), character())

# split and assemble data.frame
df <- data.frame(split(x, 1:76), stringsAsFactors = FALSE)

# fix types
df[] <- lapply(df, type.convert, as.is = TRUE)

or pass scan a model of the types of what a single row should be to read directly into a list:

x <- scan(url('http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data'), 
          c(replicate(75, numeric()), list(character())))

df <- as.data.frame(x)
names(df) <- paste0('V', 1:76)    # replace ugly names

If getting the type structure correct is too complicated, read everything in as character with replicate(76, character()) and use type.convert like the previous option.

Alternately, use readLines, split to create a list with the correct strings for each row grouped, and paste it all back together to use read.table:

x <- readLines(url('http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data'))

df <- read.table(text = paste(sapply(split(x, 
                                           rep(seq(length(x) / 10), each = 10)), 
                                     paste, collapse = ' '), collapse = '\n'))

Upvotes: 3

Related Questions