Meli
Meli

Reputation: 345

deletion of leading zeros in string split in R

The code below downloads census data from the United States census, names the columns and aims to split the column called FIPS into two. The FIPS column is numeric. The first two characters in position 1 and 2 should go into one column, StateFIPS and the last two characters in position 4 and 5 will make up the CountyFIPS column. The character in the 3rd position will be discarded. The problem I run into is that leading zero's are deleted.

In a previous post, I provided only a segment of code to learn how to split the string, which helped. However, when I applied it to my bigger code chunk it did not work. How do I prevent the deletion of leading zero's while splitting a string in the in the code below?

 #State census data from 1990 to 1999 

 censusneeded<-seq(90,99,1)

 for(i in 1:length(censusneeded)){

   URL <- paste("https://www.census.gov/popest/data/intercensal/st-co/tables/STCH-Intercensal/STCH-icen19",censusneeded[i],".txt", sep="")
   destfile <- paste(censusneeded[i],"statecensus.txt", sep="")
   download.file(URL, destfile)
 }


 #Data fields Year, FIPS Code, FIPS code county, Age Group, Race-Sex, Ethnic Origin, POP

 #We need to give names to the columns and separate the FIPS State Code and FIPS Code county

 cleancensus_1990_1999 <- function(statecensus){   
 colnames(statecensus_90_99) <- c("Year", "FIPS", "AgeGroup", "RaceSex", 
                                 "HispanicStatus","Population")#label the columns

 ##separate the FIPS column into a column of  State FIPS code and County FIPS code by 

   x <- c(as.character(statecensus_90_99$FIPS)) 
 # x <- as.vector(as.character(statecensus_90_99$FIPS)) #I thought converting the column to a character and vector would prevent  the drop of leading zeros when splitting the string

newfips <- lapply(2:3,function(i) if(i==2) str_sub(x,end=i) else str_sub(x,i+1)) 
StateFIPS <- newfips[[1]]
#StateFIPS <- substr(x, 1, 2) # 2nd attempt also doesn't work
CountyFIPS <- newfips[[2]]
#CountyFIPS <- str_sub(x,4,5) #2nd attempt also did not work because it drops leading zeros.
return(statecensus)

}

#lets apply the cleaning to census 90 to 99

for(i in 1:length(censusneeded)){
   statecensus <- read.table(paste(censusneeded[i],"statecensus.txt", sep=""))
   newcensus <- cleancensus_1990_1999(statecensus)
   write.csv(newcensus, paste(censusneeded[i],"state1990_1999.txt", sep=""))
  }

Thank you!

Upvotes: 0

Views: 375

Answers (1)

Weihuang Wong
Weihuang Wong

Reputation: 13108

I rewrite your function so that it returns the original dataframe, plus two additional columns for StateFIPS and CountyFIPS (side note: do you really only want a 2-character CountyFIPS? So 06001 (Alameda County, CA) and 06101 (Sutter County, CA) will have the same CountyFIPS of "01").

cleancensus <- function(d) {
    colnames(d) <- c("Year", "FIPS", "AgeGroup", "RaceSex", 
        "HispanicStatus","Population")
    d$FIPS <- sprintf("%05d", d$FIPS)
    d$StateFIPS <- substr(d$FIPS, 1, 2)
    d$CountyFIPS <- substr(d$FIPS, 4, 5)
    d
}

Try out the function:

data_url <- "https://www.census.gov/popest/data/intercensal/st-co/tables/STCH-Intercensal/STCH-icen1999.txt"
statecensus <- read.table(url(data_url))
d <- cleancensus(statecensus)
head(d)
#   Year  FIPS AgeGroup RaceSex HispanicStatus Population StateFIPS CountyFIPS
# 1   99 01001        0       1              1        218        01         01
# 2   99 01001        0       2              1        239        01         01
# 3   99 01001        1       1              1        947        01         01
# 4   99 01001        1       2              1        928        01         01
# 5   99 01001        2       1              1       1460        01         01
# 6   99 01001        2       2              1       1355        01         01

It behaves as expected (leading zeros are retained). Now, suppose we write it to csv, and read it back:

write.csv(d, "~/Desktop/census99.csv", row.names = FALSE)
d <- read.csv("~/Desktop/census99.csv")
head(d)
#   Year FIPS AgeGroup RaceSex HispanicStatus Population StateFIPS CountyFIPS
# 1   99 1001        0       1              1        218         1          1
# 2   99 1001        0       2              1        239         1          1
# 3   99 1001        1       1              1        947         1          1
# 4   99 1001        1       2              1        928         1          1
# 5   99 1001        2       1              1       1460         1          1
# 6   99 1001        2       2              1       1355         1          1

The leading zeros are gone. This is because read.csv coerces character vectors to numeric where it can. There are (at least) two ways to solve this:

  1. sprintf. Use the sprintf function to pad the numbers with leading zeros, so e.g. calling sprintf("%03d", 7) -- take an integer value ("d") and make it 3 characters wide, padding with leading 0s when necessary -- returns "007":

    d$FIPS <- sprintf("%05d", d$FIPS)
    d$StateFIPS <- sprintf("%02d", d$StateFIPS)
    d$CountyFIPS <- sprintf("%02d", d$CountyFIPS)
    
  2. Specify the column classes when you read in the data:

    d <- read.csv("~/Desktop/census99.csv",
        colClasses = c("numeric",  # Year
            "character",           # FIPS
            rep("numeric", 4),     # AgeGroup..Population
            rep("character", 2)    # StateFIPS, CountyFIPS
            )
        )
    head(d)
    #   Year  FIPS AgeGroup RaceSex HispanicStatus Population StateFIPS CountyFIPS
    # 1   99 01001        0       1              1        218        01         01
    # 2   99 01001        0       2              1        239        01         01
    # 3   99 01001        1       1              1        947        01         01
    # 4   99 01001        1       2              1        928        01         01
    # 5   99 01001        2       1              1       1460        01         01
    # 6   99 01001        2       2              1       1355        01         01
    

Upvotes: 3

Related Questions