R: change one value every row in big dataframe

Question

I just started working with R for my master thesis and up to now all my calculations worked out as I read a lot of questions and answers here (and it's a lot of trial and error, but thats ok). Now i need to process a more sophisticated code and i can't find a way to do this.

Thats the situation: I have multiple sub-data-sets with a lot of entries, but they are all structured in the same way. In one of them (50000 entries) I want to change only one value every row. The new value should be the amount of the existing entry plus a few values from another sub-data-set (140000 entries) where the 'ID'-variable is the same.

As this is the third day I'm trying to solve this, I already found and tested for and apply but both are running for hours (canceled after three hours). Here is an example of one of my attempts (with for):

for (i in 1:50000) {
  Entry_ID <- Sub02[i,4] 
  SUM_Entries <- sum(Sub03$Source==Entry_ID) 
  Entries_w_ID <- subset(Sub03, grepl(Entry_ID, Sub03$Source)) # The Entry_ID/Source is a character

  Value1 <- as.numeric(Entries_w_ID$VAL1) 
  SUM_Value1 <- sum(Value1)
  Value2 <- as.numeric(Entries_w_ID$VAL2)
  SUM_Value2 <- sum(Value2)

  OLD_Val1 <- Sub02[i,13]
  OLD_Val <- as.numeric(OLD_Val1)
  NEW_Val <- SUM_Entries + SUM_Value1 + SUM_Value2 + OLD_Val

  Sub02[i,13] <- NEW_Val

}

I know this might be a silly code, but thats the way I tried it as a beginner. I would be very grateful if someone could help me out with this so I can get along with my thesis. Thank you!

Here's an example of my data-structure:

        Text VAL0 Source ID VAL1 VAL2 VAL3 VAL4 VAL5 VAL6 VAL7 VAL8 VAL9
XXX 12 456335667806925_1075080942599058 10153901516433434_10153902087098434 4 1 0 0 4 9 4 6 8
ABC 8 456335667806925_1057045047735981 10153677787178434_10153677793613434 6 7 1 1 5 3 6 8 11
DEF 8 456747267806925_2357045047735981 45653677787178434_94153677793613434 5 8 2 1 5 4 1 1 9

The output I expect is an updated value 'VAL9' in every row.

jakub · Accepted Answer

From what I understood so far, you need 2 things:

sum up some values in one dataset
add them to another dataset, using an ID variable

Besides what @yoland already contributed, I would suggest to break it down in two separate tasks. Consider these two datasets:

a = data.frame(x = 1:2, id = letters[1:2], stringsAsFactors = FALSE)
a
#   x id
# 1 1  a
# 2 2  b

b = data.frame(values = as.character(1:4), otherid = letters[1:2], 
               stringsAsFactors = FALSE)
sapply(b, class)
#      values     otherid 
# "character" "character"

Values is character now, we need to convert it to numeric:

b$values = as.numeric(b$values)
sapply(b, class)
#    values     otherid 
# "numeric" "character"

Then sum up the values in b (grouped by otherid):

library(dplyr)

b = group_by(b, otherid)
b = summarise(b, sum_values = sum(values))
b
#   otherid sum_values
#           
# 1       a          4
# 2       b          6

Then join it with a - note that identifiers are specified in c():

ab = left_join(a, b, by = c("id" = "otherid"))
ab
#   x id sum_values
# 1 1  a          4
# 2 2  b          6

We can then add the result of the sum from b to the variable x in a:

ab$total = ab$x + ab$sum_values
ab
#   x id sum_values total
# 1 1  a          4     5
# 2 2  b          6     8

(Updated.)

R: change one value every row in big dataframe

Answers (2)

Related Questions