Reputation: 139
I have tried to load the data from http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data into R using the following piece of code
hData <- read.table(file.choose(), sep = "\t", dec = ",", fileEncoding = "UTF-16")
but its not populating the exact data. The data has 76 attributes in it and the details about it are given here: http://archive.ics.uci.edu/ml/datasets/Heart+Disease.
Can someone tell me what am I doing incorrect?
Upvotes: 0
Views: 1570
Reputation: 43354
The file contains extra line breaks that are causing issues. If you chop them out with regex, you can read it in:
# read file into a single string
x <- readr::read_file('http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data')
# or in base, x <- paste(readLines(url('http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data')), collapse = '\n')
# gsub out line breaks that follow numbers (not "name") and read data
df <- read.table(text = gsub('(\\d)\\n', '\\1 ', x))
head(df, 2)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25
## 1 1254 0 40 1 1 0 0 -9 2 140 0 289 -9 -9 -9 0 -9 -9 0 12 16 84 0 0 0
## 2 1255 0 49 0 1 0 0 -9 3 160 1 180 -9 -9 -9 0 -9 -9 0 11 16 84 0 0 0
## V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 V41 V42 V43 V44 V45 V46 V47 V48
## 1 0 0 150 18 -9 7 172 86 200 110 140 86 0 0 0 -9 26 20 -9 -9 -9 -9 -9
## 2 0 0 -9 10 9 7 156 100 220 106 160 90 0 0 1 2 14 13 -9 -9 -9 -9 -9
## V49 V50 V51 V52 V53 V54 V55 V56 V57 V58 V59 V60 V61 V62 V63 V64 V65 V66 V67 V68 V69 V70 V71
## 1 -9 -9 -9 -9 -9 -9 12 20 84 0 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 1 1 1
## 2 -9 -9 -9 -9 -9 -9 11 20 84 1 -9 -9 2 -9 -9 -9 -9 -9 -9 -9 1 1 1
## V72 V73 V74 V75 V76
## 1 1 1 -9 -9 name
## 2 1 1 -9 -9 name
If there doesn't happen to be a conveniently different data type at the end, you can use scan
to make a vector, then split
and reassemble:
# download data and split into a character vector
x <- scan(url('http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data'), character())
# split and assemble data.frame
df <- data.frame(split(x, 1:76), stringsAsFactors = FALSE)
# fix types
df[] <- lapply(df, type.convert, as.is = TRUE)
or pass scan
a model of the types of what a single row should be to read directly into a list:
x <- scan(url('http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data'),
c(replicate(75, numeric()), list(character())))
df <- as.data.frame(x)
names(df) <- paste0('V', 1:76) # replace ugly names
If getting the type structure correct is too complicated, read everything in as character with replicate(76, character())
and use type.convert
like the previous option.
Alternately, use readLines
, split
to create a list with the correct strings for each row grouped, and paste
it all back together to use read.table
:
x <- readLines(url('http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data'))
df <- read.table(text = paste(sapply(split(x,
rep(seq(length(x) / 10), each = 10)),
paste, collapse = ' '), collapse = '\n'))
Upvotes: 3