Reputation: 89
There are a few values that do not import correctly when performing this read.table:
hs.industry <- read.table("https://download.bls.gov/pub/time.series/hs/hs.industry", header = TRUE, fill = TRUE, sep = "\t", quote = "", stringsAsFactors = FALSE)
Specifically there are a few values where the industry_code and industry_name are joined as a single value in the industry_code column (not sure why). Given that each industry_code is 4 digits, my approach to split and correct is:
for (i in 1:nrow(hs.industry)) {
if (isTRUE(nchar(hs.industry$industry_code[i]) > 4)) {
hs.industry$industry_name[i] <- gsub("[[:digit:]]","",hs.industry$industry_code[i])
hs.industry$industry_code[i] <- gsub("[^0-9]", "",hs.industry$industry_code[i])
}
}
I feel this is terribly innificent, but I'm not sure what approach would be better.
Thanks!
Upvotes: 0
Views: 76
Reputation: 11957
The problem is that lines 29 and 30 (rows 28 and 29, if we're not counting the header) have a formatting error. They use 4 spaces instead of a proper tab character. A bit of extra data cleaning is needed.
Use readLines
to read in the raw text, correct the formatting error, and then read in the cleaned table:
# read in each line of the file as a list of character elements
hs.industry <- readLines('https://download.bls.gov/pub/time.series/hs/hs.industry')
# replace any instances of 4 spaces with a tab character
hs.industry <- gsub('\\W{4,}', '\t', hs.industry)
# collapse together the list, with each line separated by a return character (\n)
hs.industry <- paste(hs.industry, collapse = '\n')
# read in the new table
hs.industry <- read.table(text = hs.industry, sep = '\t', header = T, quote = '')
Upvotes: 4
Reputation: 3369
You should not have to loop through each instance, instead identify only those entries which are problematic and gsub only those entries:
replace_indx <- which(nchar(hs.industry$industry_code) > 4)
hs.industry$industry_name[replace_indx] <- gsub("\\d+\\s+", "", hs.industry$industry_code[replace_indx])
hs.industry$industry_code[replace_indx] <- gsub("\\D+", "", hs.industry$industry_code[replace_indx])
I also used "\\d+\\s+"
to improve the string replacement, here I also replace the spaces:
gsub("[[:digit:]]","",hs.industry$industry_code[replace_indx])
# [1] " Dimension stone" " Crushed and broken stone"
gsub("\\d+\\s+", "", hs.industry$industry_code[replace_indx])
# [1] "Dimension stone" "Crushed and broken stone"
Upvotes: 1