user9903833
user9903833

Reputation: 137

How to loop through multiple data sets removing specific characters from specified columns in r

I have 25 data sets each is structured the same. Each contains many rows and 7 columns. Column 6 contains data that should be numerical but is not numerical. They are not numerical because the numbers contain commas i.e. 100000 is 100,000.

I can manually resolve this in each data set by removing the comma and then specifying that the data is numerical using the following code

df$column_6 <- gsub("[,]" , "", df$column_6)
df$column_6 <- as.numerical(df$column_6)

However as there are 25 data sets I would like to loop through them doing this however I am unable to do this.

Additionally because column 6 has a different name in each data set I would prefer to specify column 6 without using its name like below

df[6] <- gsub("[,]" , "", df[6])

however this doesn't seem to work.

My code is as follows

list_of_dfs = c(df1, df2, ..... , df25)

for (i in list_of_dfs) {
  i[6] <- gsub("[,]" , "", i[6])
  i[6] <- as.numerical(i[6])
}

Does anyone have any advice on how to do this

Upvotes: 0

Views: 254

Answers (4)

rahul
rahul

Reputation: 591

The data table way

test<-data.table(col1=c('100,00','100','100,000'),col2=c('90','80,00','60'))
    col1  col2
 100,00    90
 100      80,00
 100,000  60

your list of data frames

testList<-list(test,test)

assume u want to correct col2 in this case but want to use index as reference

removeNonnumeric<-function(x){return(as.numeric(gsub(',','',x)))}
data<-function(x){return(x[,lapply(.SD,removeNonnumeric),.SDcols=names(x)[2],by=col1])}

removeNonnumeirc removes the "," from the columns and data accesses each data table in the testList and calls "removeNonnumeric" on them output is a list of data tables which is created by merging these 2 functions in an "lapply"

 lapply(testList,data)

Upvotes: 0

Gregor Thomas
Gregor Thomas

Reputation: 145765

Your code is close, but has a few problems:

  • the result never gets assigned back the the list.
  • as.numerical is a typo, it needs to be as.numeric
  • i[6] doesn't work because you need to specify that it's the 6th column you want: i[, 6]. See here for details on [ vs [[.
  • c(df1, df2) doesn't actually create a list of data frames

Try this instead:

## this is bad, it will make a single list of columns, not of data frames
# list_of_dfs = c(df1, df2, ..... , df25)

# use this instead
list_of_dfs = list(df1, df2, ..... , df25)
# or this
list_of_dfs = mget(ls(pattern = "df"))

for (i in seq_along(list_of_dfs)) {
  list_of_dfs[[i]][, 6] <- as.numeric(gsub("[,]" , "", list_of_dfs[[i]][, 6]))
}

We can do a bit better, gsub uses pattern-matching regular expressions by default, using the fixed = TRUE argument instead will be quite a bit faster:

for (i in seq_along(list_of_dfs)) {
  list_of_dfs[[i]][, 6] <- as.numeric(gsub(",", "", list_of_dfs[[i]][, 6], fixed = TRUE))
}

And we could use lapply instead of a for loop for slightly shorter code:

list_of_dfs[[i]] <- lapply(list_of_dfs, function(x) {
    x[, 6] = as.numeric(gsub("," , "", x[, 6], fixed = TRUE))
    return(x)
})

Upvotes: 2

AndS.
AndS.

Reputation: 8110

Try this out. You put all dataframes in a list, then you make the column numeric. Instead of gsub I use readr::parse_number. I'll also include a practice set for illustration.

library(tidyverse)

df1 <- data_frame(id = rep(1,3), num = c("10,000", "11,000", "12,000"))
df2 <- data_frame(id = rep(2,3), num = c("13,000", "14,000", "15,000"))
df3 <- data_frame(id = rep(3,3), num = c("16,000", "17,000", "18,000"))

list(df1, df2, df3) %>% map(~mutate(.x, num = parse_number(num)))
#> [[1]]
#> # A tibble: 3 x 2
#>      id   num
#>   <dbl> <dbl>
#> 1     1 10000
#> 2     1 11000
#> 3     1 12000
#> 
#> [[2]]
#> # A tibble: 3 x 2
#>      id   num
#>   <dbl> <dbl>
#> 1     2 13000
#> 2     2 14000
#> 3     2 15000
#> 
#> [[3]]
#> # A tibble: 3 x 2
#>      id   num
#>   <dbl> <dbl>
#> 1     3 16000
#> 2     3 17000
#> 3     3 18000

Created on 2018-09-20 by the reprex package (v0.2.0).

Upvotes: 1

SmitM
SmitM

Reputation: 1376

Part of the answer has been sourced from here: Looping through list of data frames in R

In your case, you can do the following:

list_of_dfs = list(df1, df2, ..... , df25)
lapply(list_of_dfs, function(x) { x[, 6] <- as.integer(gsub("," , "", x[, 6])) })

Upvotes: 0

Related Questions