Reputation: 345
The code below downloads census data from the United States census, names the columns and aims to split the column called FIPS into two. The FIPS column is numeric. The first two characters in position 1 and 2 should go into one column, StateFIPS and the last two characters in position 4 and 5 will make up the CountyFIPS column. The character in the 3rd position will be discarded. The problem I run into is that leading zero's are deleted.
In a previous post, I provided only a segment of code to learn how to split the string, which helped. However, when I applied it to my bigger code chunk it did not work. How do I prevent the deletion of leading zero's while splitting a string in the in the code below?
#State census data from 1990 to 1999
censusneeded<-seq(90,99,1)
for(i in 1:length(censusneeded)){
URL <- paste("https://www.census.gov/popest/data/intercensal/st-co/tables/STCH-Intercensal/STCH-icen19",censusneeded[i],".txt", sep="")
destfile <- paste(censusneeded[i],"statecensus.txt", sep="")
download.file(URL, destfile)
}
#Data fields Year, FIPS Code, FIPS code county, Age Group, Race-Sex, Ethnic Origin, POP
#We need to give names to the columns and separate the FIPS State Code and FIPS Code county
cleancensus_1990_1999 <- function(statecensus){
colnames(statecensus_90_99) <- c("Year", "FIPS", "AgeGroup", "RaceSex",
"HispanicStatus","Population")#label the columns
##separate the FIPS column into a column of State FIPS code and County FIPS code by
x <- c(as.character(statecensus_90_99$FIPS))
# x <- as.vector(as.character(statecensus_90_99$FIPS)) #I thought converting the column to a character and vector would prevent the drop of leading zeros when splitting the string
newfips <- lapply(2:3,function(i) if(i==2) str_sub(x,end=i) else str_sub(x,i+1))
StateFIPS <- newfips[[1]]
#StateFIPS <- substr(x, 1, 2) # 2nd attempt also doesn't work
CountyFIPS <- newfips[[2]]
#CountyFIPS <- str_sub(x,4,5) #2nd attempt also did not work because it drops leading zeros.
return(statecensus)
}
#lets apply the cleaning to census 90 to 99
for(i in 1:length(censusneeded)){
statecensus <- read.table(paste(censusneeded[i],"statecensus.txt", sep=""))
newcensus <- cleancensus_1990_1999(statecensus)
write.csv(newcensus, paste(censusneeded[i],"state1990_1999.txt", sep=""))
}
Thank you!
Upvotes: 0
Views: 375
Reputation: 13108
I rewrite your function so that it returns the original dataframe, plus two additional columns for StateFIPS
and CountyFIPS
(side note: do you really only want a 2-character CountyFIPS? So 06001 (Alameda County, CA) and 06101 (Sutter County, CA) will have the same CountyFIPS of "01").
cleancensus <- function(d) {
colnames(d) <- c("Year", "FIPS", "AgeGroup", "RaceSex",
"HispanicStatus","Population")
d$FIPS <- sprintf("%05d", d$FIPS)
d$StateFIPS <- substr(d$FIPS, 1, 2)
d$CountyFIPS <- substr(d$FIPS, 4, 5)
d
}
Try out the function:
data_url <- "https://www.census.gov/popest/data/intercensal/st-co/tables/STCH-Intercensal/STCH-icen1999.txt"
statecensus <- read.table(url(data_url))
d <- cleancensus(statecensus)
head(d)
# Year FIPS AgeGroup RaceSex HispanicStatus Population StateFIPS CountyFIPS
# 1 99 01001 0 1 1 218 01 01
# 2 99 01001 0 2 1 239 01 01
# 3 99 01001 1 1 1 947 01 01
# 4 99 01001 1 2 1 928 01 01
# 5 99 01001 2 1 1 1460 01 01
# 6 99 01001 2 2 1 1355 01 01
It behaves as expected (leading zeros are retained). Now, suppose we write it to csv, and read it back:
write.csv(d, "~/Desktop/census99.csv", row.names = FALSE)
d <- read.csv("~/Desktop/census99.csv")
head(d)
# Year FIPS AgeGroup RaceSex HispanicStatus Population StateFIPS CountyFIPS
# 1 99 1001 0 1 1 218 1 1
# 2 99 1001 0 2 1 239 1 1
# 3 99 1001 1 1 1 947 1 1
# 4 99 1001 1 2 1 928 1 1
# 5 99 1001 2 1 1 1460 1 1
# 6 99 1001 2 2 1 1355 1 1
The leading zeros are gone. This is because read.csv
coerces character vectors to numeric where it can. There are (at least) two ways to solve this:
sprintf
. Use the sprintf
function to pad the numbers with leading zeros, so e.g. calling sprintf("%03d", 7)
-- take an integer value ("d") and make it 3 characters wide, padding with leading 0s when necessary -- returns "007":
d$FIPS <- sprintf("%05d", d$FIPS)
d$StateFIPS <- sprintf("%02d", d$StateFIPS)
d$CountyFIPS <- sprintf("%02d", d$CountyFIPS)
Specify the column classes when you read in the data:
d <- read.csv("~/Desktop/census99.csv",
colClasses = c("numeric", # Year
"character", # FIPS
rep("numeric", 4), # AgeGroup..Population
rep("character", 2) # StateFIPS, CountyFIPS
)
)
head(d)
# Year FIPS AgeGroup RaceSex HispanicStatus Population StateFIPS CountyFIPS
# 1 99 01001 0 1 1 218 01 01
# 2 99 01001 0 2 1 239 01 01
# 3 99 01001 1 1 1 947 01 01
# 4 99 01001 1 2 1 928 01 01
# 5 99 01001 2 1 1 1460 01 01
# 6 99 01001 2 2 1 1355 01 01
Upvotes: 3