Maharero
Maharero

Reputation: 248

R: Using function arguments to update elements in a data frame

I want the elements referenced in my data frame to be replaced with the argument I put into the function, however at the moment it is just replacing the elements with the argument I used to initially define the function (I'm finding it hard to explain - hopefully my code and pictures will clarify this a bit!)

Project_assign <- function(prjct) {
  Truth_vector <- is.element((giraffe[,1]),(prjct[,1]))
  giraffe[which(Truth_vector),5] <- 'prjct'
  assign('giraffe' , giraffe , envir= .GlobalEnv)
}
Project_assign(spine_hlfs)

This mostly works however the elements get replaced with prjct instead of spine_hlfs https://i.sstatic.net/uuPnv.png

If I can get this to work as intended, then I will next create a vector with all the project names and use lapply with this function saving me a lot of manual work every few months. I am relatively new to R so any explanations are well appreciated.

Upvotes: 1

Views: 200

Answers (2)

Uwe
Uwe

Reputation: 42544

As far as I have understood OP's intentions from the many comments, he wants to update the giraffe data frame with the name of many other data frames where runkey matches.

This can be achieved by combining the other data frames into one data.table object treating the data frame names as data and finally updating giraffe in a join.

Sample Data

According to the OP, giraffe consists of 500 rows and 5 columns including runkey and project. project is initialized here as character column for the subsequent join with the data frame names.

set.seed(123L) # required for reproducible data
giraffe <- data.frame(runkey = 1:500,
                      X2 = sample.int(99L, 500L, TRUE),
                      X3 = sample.int(99L, 500L, TRUE),
                      X4 = sample.int(99L, 500L, TRUE),
                      project = "",
                      stringsAsFactors = FALSE)

Then there are a number of data frames which contain only one column runkey. According to the OP, runkey is disjunct, i.e., the combined set of all runkey does not contain any duplicates.

spine_hlfs <- data.frame(runkey = c(1L, 498L, 5L))
ir_dia     <- data.frame(runkey = c(3L, 499L, 47L, 327L))

Proposed solution

# specify names of data frames
df_names <- c("spine_hlfs", "ir_dia")
# create named list of data frames 
df_list <- mget(df_names)
# update on join 
library(data.table)
setDT(giraffe)[rbindlist(df_list, idcol = "df.name"), on = "runkey", project := df.name][]
     runkey X2 X3 X4    project
  1:      1  2 44 63 spine_hlfs
  2:      2 73 99 77           
  3:      3 43 20 18     ir_dia
  4:      4 73 12 40           
  5:      5  2 25 96 spine_hlfs
 ---                           
496:    496 75 45 84           
497:    497 24 63 43           
498:    498 33 53 81 spine_hlfs
499:    499  1 33 16     ir_dia
500:    500 99 77 41

Explanation

setDT() coerces giraffe to data.table. rbindlist(df_list, idcol = "df.name") creates a combined data.table from the list of data frames, thereby filling the df.name column with the names of the list elements:

      df.name runkey
1: spine_hlfs      1
2: spine_hlfs    498
3: spine_hlfs      5
4:     ir_dia      3
5:     ir_dia    499
6:     ir_dia     47
7:     ir_dia    327

This intermediate result is joined on runkey with giraffe. The project column is updated with the contents of df.name only for matching rows.

Alternative solution

This is looping over df_names and performs repeated joins which update giraffe in place:

setDT(giraffe)
for (x in df_names) giraffe[get(x), on = "runkey", project := x]
giraffe[]

Upvotes: 0

Maurits Evers
Maurits Evers

Reputation: 50678

Sounds like a simple replace based on matching entries between a (list of) query dataframes and a subject dataframe.

Here is an example based on some simulated data.

I first simulate data for the subject dataframe:

# Sample data
giraffe <- data.frame(
    runkeys = seq(1:500),
    col1 = runif(500),
    col2 = runif(500),
    col3 = runif(500),
    col4 = runif(500));

I then simulate runkeys data for 2 query dataframes:

spine_hlfs <- data.frame(
    runkeys = c(44, 260, 478));
ir_dia <- data.frame(
    runkeys = c(10, 20, 30))

The query dataframes are stored in a list:

lst.runkeys <- list(
    spine_hlfs = spine_hlfs,
    ir_dia = ir_dia);

To flag runkeys entries present in any of the query dataframes, we can use a for loop to match runkeys entries from every query dataframe:

# This is the critical line that loops through the dataframe
# and flags runkeys in giraffe with the name of the query dataframe
for (i in 1:length(lst.runkeys)) {
    giraffe[match(lst.runkeys[[i]]$runkeys, giraffe$runkeys), 5] <- names(lst.runkeys)[i];
}

This is the output of the subject dataframe after matching runkeys entries. I'm only showing rows where entries in column 5 where replaced.

giraffe[grep("(spine_hlfs|ir_dia)", giraffe[, 5]), ];
10       10 0.7401977 0.005703928 0.6778921     ir_dia
20       20 0.7954076 0.331462567 0.7637870     ir_dia
30       30 0.5772808 0.183716142 0.6984193     ir_dia
44       44 0.9701355 0.655736489 0.4917452 spine_hlfs
260     260 0.1893012 0.600140166 0.0390346 spine_hlfs
478     478 0.7655976 0.910946623 0.9779205 spine_hlfs

Upvotes: 1

Related Questions