Reputation: 20643
I am trying to collect publicly available datasets from UCI repository for R. I understand there are lots of datasets already usable with several R packages such as mlbench.
But there are still several datasets I will need from UCI repository.
This is a trick I learned
url="http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data"
credit<-read.csv(url, header=F)
But this does not get header (variable name) information. That information is in *.names file in text format. Any idea how I can programmatically get header information as well?
Upvotes: 3
Views: 1996
Reputation: 44614
I suspect you'll have to use regular expressions to accomplish this. Here's an ugly, but general solution that should work on a variety of *.names files, assuming their formats are similar to the one you posted.
names.file.url <-'http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.names'
names.file.lines <- readLines(names.file.url)
# get run lengths of consecutive lines containing a colon.
# then find the position of the subgrouping that has a run length
# equal to the number of columns in credit and sum the run lengths up
# to that value to get the index of the last line in the names block.
end.of.names <- with(rle(grepl(':', names.file.lines)),
sum(lengths[1:match(ncol(credit), lengths)]))
# extract those lines
names.lines <- names.file.lines[(end.of.names - ncol(credit) + 1):end.of.names]
# extract the names from those lines
names <- regmatches(names.lines, regexpr('(\\w)+(?=:)', names.lines, perl=TRUE))
# [1] "A1" "A2" "A3" "A4" "A5" "A6" "A7" "A8" "A9" "A10" "A11"
# [12] "A12" "A13" "A14" "A15" "A16"
Upvotes: 3
Reputation: 6410
I'm guessing that Attribute Information
must be the names in the specific file you pointed. Here is a very, very dirty solution to do that. I use a fact that there is a pattern - your names are followed by :
so we separte the strings of characters by :
using scan
, and then grab names from raw vector:
url="http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data"
credit<-read.csv(url, header=F)
url.names="http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.names"
mess <- scan(url.names, what="character", sep=":")
#your names are located from 31 to 61, every second place in the vector
mess.names <- mess[seq(31,61,2)]
names(credit) <- mess.names
Upvotes: 1