zsad512
zsad512

Reputation: 881

Converting different columns to different formats

I have a df in R that I have loaded using:

data <- fread("Data/LuminateDataExport_UTDP2_011818.csv", colClasses = 'character', stringsAsFactors = FALSE)

I did this because I had to perform certain operations like stripping "$", etc.

Now, I am trying to convert the columns into the appropriate formats without having to as._ each column individually...

The structure of the current df is:

> str(data)
Classes ‘data.table’ and 'data.frame':  196879 obs. of  32 variables:
 $ city             : chr  "" "" "" "" ...
 $ company_goal     : chr  "" "" "" "" ...
 $ company_name     : chr  "" "" "" "" ...
 $ event_date       : chr  "5/14/2016" "9/26/2015" "9/12/2015" "6/3/2017" ...
 $ event_year       : chr  "FY 2016" "FY 2016" "FY 2016" "FY 2017" ...
 $ fundraising_goal : chr  "250" "200" "350" "0" ...
 $ name             : chr  "Heart Walk 2015-2016 St. Louis MO" "Heart Walk 2015-2016 Canton, OH" "Heart Walk 2015-2016 Dallas, TX" "FDA HW 2016-2017 Albany, NY WO-65355" ...
 $ participant_id   : chr  "2323216" "2273391" "2419569" "4088558" ...
 $ state            : chr  "" "OH" "TX" "" ...
 $ street           : chr  "" "" "" "" ...
 $ team_average     : chr  "176" "123" "306" "47" ...
 $ team_captain     : chr  "No" "No" "Yes" "No" ...
 $ team_count       : chr  "7" "6" "4" "46" ...
 $ team_id          : chr  "152788" "127127" "45273" "179207" ...
 $ team_member_goal : chr  "0" "0" "0" "0" ...
 $ team_name        : chr  "Team Clayton" "Cardiac Crusaders" "BIS - Team Myers" "Independent Walkers" ...
 $ team_total_gifts : chr  "1,230 " "738" "1,225 " "2,145 " ...
 $ zip              : chr  "" "" "" "" ...
 $ gifts_count      : chr  "2" "1" "2" "1" ...
 $ registration_gift: chr  "No" "No" "No" "No" ...
 $ participant_gifts: chr  "236" "218" "225" "0" ...
 $ personal_gift    : chr  "0" "0" "0" "250" ...
 $ total_gifts      : chr  "236" "218" "225" "250" ...
 $ match_code       : chr  "UX000" "UX000" "UX000" "UX000" ...
 $ tap_level        : chr  "X" "X" "X" "X" ...
 $ tap_desc         : chr  "" "" "" "" ...
 $ tap_lifed        : chr  "" "" "" "" ...
 $ medage_cy        : chr  "0" "0" "0" "0" ...
 $ divindx_cy       : chr  "0" "0" "0" "0" ...
 $ medhinc_cy       : chr  "0" "0" "0" "0" ...
 $ meddi_cy         : chr  "0" "0" "0" "0" ...
 $ mednw_cy         : chr  "0" "0" "0" "0" ...
 - attr(*, ".internal.selfref")=<externalptr> 

Now, as a first step- I am trying to convert all of the numbers to_numeric.

I have tried every one of the solutions found here but none of them have worked.

The errors I keep getting are:

Error in [.data.table(data, , cols) : j (the 2nd argument inside [...]) is a single symbol but column name 'cols' is not found. Perhaps you intended DT[,..cols] or DT[,cols,with=FALSE]. This difference to data.frame is deliberate and explained in FAQ 1.1.

AND

Error in [.data.table(data, cols) : When i is a data.table (or character vector), the columns to join by must be specified either using 'on=' argument (see ?data.table) or by keying x (i.e. sorted, and, marked as sorted, see ?setkey). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.

Here some more info on the data:

> dput(data[1:6, 1:11])
structure(list(city = c("", "", "", "", "", ""), company_goal = c("", 
"", "", "", "", ""), company_name = c("", "", "", "", "", ""), 
    event_date = c("5/14/2016", "9/26/2015", "9/12/2015", "6/3/2017", 
    "5/6/2017", "10/17/2015"), event_year = c("FY 2016", "FY 2016", 
    "FY 2016", "FY 2017", "FY 2017", "FY 2016"), fundraising_goal = c("250", 
    "200", "350", "0", "0", "100"), name = c("Heart Walk 2015-2016 St. Louis MO", 
    "Heart Walk 2015-2016 Canton, OH", "Heart Walk 2015-2016 Dallas, TX", 
    "FDA HW 2016-2017 Albany, NY WO-65355", "FDA HW 2016-2017 New Haven, CT WO-66497", 
    "Heart Walk 2015-2016 Puget Sound, WA"), participant_id = c("2323216", 
    "2273391", "2419569", "4088558", "4527010", "2424207"), state = c("", 
    "OH", "TX", "", "", "WA"), street = c("", "", "", "", "", 
    ""), team_average = c("176", "123", "306", "47", "0", "97"
    )), .Names = c("city", "company_goal", "company_name", "event_date", 
"event_year", "fundraising_goal", "name", "participant_id", "state", 
"street", "team_average"), class = c("data.table", "data.frame"
), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x10200c378>)

Suggestions please?

(Once I do this, I will also have to convert different columns to factors, etc)

Upvotes: 0

Views: 1171

Answers (1)

Bradley Mecham
Bradley Mecham

Reputation: 41

I realize this is an older question that you're probably not working on anymore, but since it's one of the first questions that comes up when people search for simultaneously formatting multiple columns as numeric in R, I thought I'd add a thought.

Regarding the first part of your question--how to identify which columns are numeric, which columns are dates, which columns are factors, etc.--I do not have a good answer, particularly because factors can originally be character or otherwise, but then be designated as factor instead. Deciding which ones to convert is largely up to the designer. If there are no entries that are acceptably NA, you could use the logic here to determine which columns should be formatted numeric. Once you have decided which columns to convert . . .

I'm guessing that your second error appears because you're using data tables slightly differently than the current syntax requires. You can find instructions for changing a selection of columns using data table syntax in one of the later answers on this post:

Coerce multiple columns to factors at once

In that post, they coerce a set of columns to factor; the same process works for coercing to numeric.

To keep it simple, you specify the columns you want (using numeric values or column names or otherwise--In your case, this value will be programmatically assigned using whatever logic and rules you apply to divide your data into groups). E.g.,

colsToConvert <- c(6,11,13)

or

colsToConvert <- c("fundraising_goal","team_average","team_count")

Then you use an lapply command and use the SDcols subset specification:

data[, (colsToConvert) := lapply(.SD, as.numeric), .SDcols = colsToConvert]

That should do your conversion. Repeat this process for as many data types as you prefer, changing the formatting from as.numeric to whichever type is appropriate.

Upvotes: 3

Related Questions