Douglas K
Douglas K

Reputation: 81

Replacing expression within data.table

I'm running the following code below to retrieve a data set, which unfortunately uses "." instead of NA to represent missing data. After much wrangling and searching SO and other fora, I still cannot make the code replace all instances of "." with NA so I can convert the columns to numeric and go on with my life. I'm pretty sure the problem is between the screen and the chair, so I don't see a need to post sessionInfo, but please let me know otherwise. Help in solving this would be greatly appreciated. The first four columns are integers setting out the date and the unique ID, so I would only need to correct the other columns. Thanks in advance you all!

library(data.table)
google_mobility_data <- data.table(read.csv("https://github.com/OpportunityInsights/EconomicTracker/raw/main/data/Google Mobility - State - Daily.csv",stringsAsFactors = FALSE))
# The following line is the one where I can't make it work properly.
google_mobility_data[, .SD := as.numeric(sub("^\\.$", NA, .SD)), .SDcols = -c(1:4)]

Upvotes: 0

Views: 146

Answers (1)

DaveTurek
DaveTurek

Reputation: 1297

I downloaded your data and changed the last entry on the first row to "." to test NA in the final column.

Use readLines to read a character vector.

Use gsub to change . to NA.

Use fread to read as a data.table.

    library(data.table)
    gmd <- readLines("Google Mobility - State - Daily.csv")
    gmd[c(2,3120)]
    
#    [1] "2020,2,24,1,.00286,-.00714,.0557,.06,.0129,.00857,."
#    [2] "2020,4,25,10,-.384,-.191,.,-.479,-.441,.179,-.213"  
    
    gmd <- gsub(",\\.,",",NA,",gmd)
    gmd <- gsub(",\\.$",",NA",gmd)
    gmd[c(2,3120)]
    
#    [1] "2020,2,24,1,.00286,-.00714,.0557,.06,.0129,.00857,NA"
#    [2] "2020,4,25,10,-.384,-.191,NA,-.479,-.441,.179,-.213"  
    
    google_mobility_data <- fread(text=gmd)
    google_mobility_data[c(1,3119)]
    
#   year month day statefips gps_retail_and_recreation gps_grocery_and_pharmacy gps_parks gps_transit_stations gps_workplaces gps_residential gps_away_from_home
#1: 2020     2  24         1                   0.00286                 -0.00714    0.0557                0.060         0.0129         0.00857                 NA
#2: 2020     4  25        10                  -0.38400                 -0.19100        NA               -0.479        -0.4410         0.17900             -0.213
    
    summary(google_mobility_data)

EDIT: You mentioned using na.strings with fread didn't work for you, so I suggested the above approach.

However, at least with the data file downloaded as I did, this worked in one line - as suggested by @MichaelChirico:

google_mobility_data <- fread("Google Mobility - State - Daily.csv",na.strings=".")
google_mobility_data[c(1,3119)]
   year month day statefips gps_retail_and_recreation gps_grocery_and_pharmacy gps_parks gps_transit_stations gps_workplaces gps_residential gps_away_from_home
#1: 2020     2  24         1                   0.00286                 -0.00714    0.0557                0.060         0.0129         0.00857                 NA
#2: 2020     4  25        10                  -0.38400                 -0.19100        NA               -0.479        -0.4410         0.17900             -0.213

Upvotes: 1

Related Questions