trotta
trotta

Reputation: 1226

R: change one value every row in big dataframe

I just started working with R for my master thesis and up to now all my calculations worked out as I read a lot of questions and answers here (and it's a lot of trial and error, but thats ok). Now i need to process a more sophisticated code and i can't find a way to do this.

Thats the situation: I have multiple sub-data-sets with a lot of entries, but they are all structured in the same way. In one of them (50000 entries) I want to change only one value every row. The new value should be the amount of the existing entry plus a few values from another sub-data-set (140000 entries) where the 'ID'-variable is the same.

As this is the third day I'm trying to solve this, I already found and tested for and apply but both are running for hours (canceled after three hours). Here is an example of one of my attempts (with for):

for (i in 1:50000) {
  Entry_ID <- Sub02[i,4] 
  SUM_Entries <- sum(Sub03$Source==Entry_ID) 
  Entries_w_ID <- subset(Sub03, grepl(Entry_ID, Sub03$Source)) # The Entry_ID/Source is a character

  Value1 <- as.numeric(Entries_w_ID$VAL1) 
  SUM_Value1 <- sum(Value1)
  Value2 <- as.numeric(Entries_w_ID$VAL2)
  SUM_Value2 <- sum(Value2)

  OLD_Val1 <- Sub02[i,13]
  OLD_Val <- as.numeric(OLD_Val1)
  NEW_Val <- SUM_Entries + SUM_Value1 + SUM_Value2 + OLD_Val

  Sub02[i,13] <- NEW_Val

}

I know this might be a silly code, but thats the way I tried it as a beginner. I would be very grateful if someone could help me out with this so I can get along with my thesis. Thank you!

Here's an example of my data-structure:

        Text VAL0 Source ID VAL1 VAL2 VAL3 VAL4 VAL5 VAL6 VAL7 VAL8 VAL9
XXX 12 456335667806925_1075080942599058 10153901516433434_10153902087098434 4 1 0 0 4 9 4 6 8
ABC 8 456335667806925_1057045047735981 10153677787178434_10153677793613434 6 7 1 1 5 3 6 8 11
DEF 8 456747267806925_2357045047735981 45653677787178434_94153677793613434 5 8 2 1 5 4 1 1 9

The output I expect is an updated value 'VAL9' in every row.

Upvotes: 0

Views: 166

Answers (2)

jakub
jakub

Reputation: 5104

From what I understood so far, you need 2 things:

  1. sum up some values in one dataset
  2. add them to another dataset, using an ID variable

Besides what @yoland already contributed, I would suggest to break it down in two separate tasks. Consider these two datasets:

a = data.frame(x = 1:2, id = letters[1:2], stringsAsFactors = FALSE)
a
#   x id
# 1 1  a
# 2 2  b

b = data.frame(values = as.character(1:4), otherid = letters[1:2], 
               stringsAsFactors = FALSE)
sapply(b, class)
#      values     otherid 
# "character" "character"

Values is character now, we need to convert it to numeric:

b$values = as.numeric(b$values)
sapply(b, class)
#    values     otherid 
# "numeric" "character"

Then sum up the values in b (grouped by otherid):

library(dplyr)

b = group_by(b, otherid)
b = summarise(b, sum_values = sum(values))
b
#   otherid sum_values
#     <chr>      <dbl>
# 1       a          4
# 2       b          6

Then join it with a - note that identifiers are specified in c():

ab = left_join(a, b, by = c("id" = "otherid"))
ab
#   x id sum_values
# 1 1  a          4
# 2 2  b          6

We can then add the result of the sum from b to the variable x in a:

ab$total = ab$x + ab$sum_values
ab
#   x id sum_values total
# 1 1  a          4     5
# 2 2  b          6     8

(Updated.)

Upvotes: 2

yoland
yoland

Reputation: 554

From what I understand you want to create a new variable that uses information from two different data sets indexed by the same ID. The easiest way to do this is probably to join the data sets together (if you need to safe memory, just join the columns you need). I found dplyr's join functions very handy for these cases (explained neatly here) Once you joined the data sets into one, it should be easy to create the new columns you need. e.g.: df$new <- df$old1 + df$old2

Upvotes: 0

Related Questions