Reputation: 4699
Consider the following dataframe in R:
TYPE VARIAVEL VALOR
A OPER_RELAC_VARIAVEL1 100
A OPER_RELAC_VARIAVEL2 200
A OPER_RELAC_VARIAVEL3 300
B OPER_RELAC_VARIAVEL1 100
B OPER_RELAC_VARIAVEL2 200
B OPER_RELAC_VARIAVEL3 300
A CLI_RELAC_VARIAVEL1 450
A CLI_RELAC_VARIAVEL2 320
A CLI_RELAC_VARIAVEL3 110
I want to take the relevance of each VALOR
based on the root of VARIAVEL
and TYPE
. I don´t have a column with the root of the VARIAVEL
, but it would be everyting before the second _
(OPER_RELAC
and CLI_RELAC
in this sample).
The expected result is:
TYPE VARIAVEL VALOR RELEVANCE
A OPER_RELAC_VARIAVEL1 100 0.167
A OPER_RELAC_VARIAVEL2 200 0.333
A OPER_RELAC_VARIAVEL3 300 0.500
B OPER_RELAC_VARIAVEL1 100 0.167
B OPER_RELAC_VARIAVEL2 200 0.333
B OPER_RELAC_VARIAVEL3 300 0.500
A CLI_RELAC_VARIAVEL1 450 0.511
A CLI_RELAC_VARIAVEL2 320 0.364
A CLI_RELAC_VARIAVEL3 110 0.125
Since, for example, 450 represents 51.1% of the total for type A
and root variable CLI_RELAC
.
I have done with this sequence of commands:
1) Generate a column with the root variable using the library stringr
dados$VARIAVEL_MAE <- str_match(dados$VARIAVEL, "^([^_]+[_][^_]+)")[,2]
Thanks to R:how to get grep to return the match, rather than the whole string
2) Summarize in a new dataframe the totals aggregating by this fresh new column
TOTAIS <- aggregate(VALOR ~ Type + VARIAVEL_MAE, data = dados, sum)
names(TOTAIS) <- c('Type', 'VARIAVEL_MAE', 'TOTAL')
3) Merge this two dataframes using what is suggested here
dados <- merge(TOTAIS, dados, by = c('Type', 'VARIAVEL_MAE'))
dados$RELEVANCIA <- dados$VALOR / dados$TOTAL;
Is there a smarter way to do it or I do need all these steps?
My question is because in R, everything I do can always be replaced by something faster and smaller.
Upvotes: 2
Views: 119
Reputation: 66819
In the OP's example, we can split on _
instead of using a regex:
library(data.table)
setDT(DT)
DT[, paste0("vnome", 1:3) := tstrsplit(VARIAVEL, "_")]
DT[, z := VALOR/sum(VALOR), by=.(TYPE, vnome1, vnome2)]
TYPE VARIAVEL VALOR vnome1 vnome2 vnome3 z
1: A OPER_RELAC_VARIAVEL1 100 OPER RELAC VARIAVEL1 0.1666667
2: A OPER_RELAC_VARIAVEL2 200 OPER RELAC VARIAVEL2 0.3333333
3: A OPER_RELAC_VARIAVEL3 300 OPER RELAC VARIAVEL3 0.5000000
4: B OPER_RELAC_VARIAVEL1 100 OPER RELAC VARIAVEL1 0.1666667
5: B OPER_RELAC_VARIAVEL2 200 OPER RELAC VARIAVEL2 0.3333333
6: B OPER_RELAC_VARIAVEL3 300 OPER RELAC VARIAVEL3 0.5000000
7: A CLI_RELAC_VARIAVEL1 450 CLI RELAC VARIAVEL1 0.5113636
8: A CLI_RELAC_VARIAVEL2 320 CLI RELAC VARIAVEL2 0.3636364
9: A CLI_RELAC_VARIAVEL3 110 CLI RELAC VARIAVEL3 0.1250000
In a more general case, mentioned by the OP as "everything before the second _
", we can use @akrun's approach from another Q&A (assuming VARIAVEL
contains no spaces):
DT[, c("vroot", "vseq") :=
tstrsplit(sub('(^[^_]+_[^_]+)_(.*)$', '\\1 \\2', VARIAVEL), ' ')]
DT[, z := VALOR/sum(VALOR), by=.(TYPE, vroot)]
TYPE VARIAVEL VALOR vroot vseq z
1: A OPER_RELAC_VARIAVEL1 100 OPER_RELAC VARIAVEL1 0.1666667
2: A OPER_RELAC_VARIAVEL2 200 OPER_RELAC VARIAVEL2 0.3333333
3: A OPER_RELAC_VARIAVEL3 300 OPER_RELAC VARIAVEL3 0.5000000
4: B OPER_RELAC_VARIAVEL1 100 OPER_RELAC VARIAVEL1 0.1666667
5: B OPER_RELAC_VARIAVEL2 200 OPER_RELAC VARIAVEL2 0.3333333
6: B OPER_RELAC_VARIAVEL3 300 OPER_RELAC VARIAVEL3 0.5000000
7: A CLI_RELAC_VARIAVEL1 450 CLI_RELAC VARIAVEL1 0.5113636
8: A CLI_RELAC_VARIAVEL2 320 CLI_RELAC VARIAVEL2 0.3636364
9: A CLI_RELAC_VARIAVEL3 110 CLI_RELAC VARIAVEL3 0.1250000
Data:
DT = structure(list(TYPE = c("A", "A", "A", "B", "B", "B", "A", "A",
"A"), VARIAVEL = c("OPER_RELAC_VARIAVEL1", "OPER_RELAC_VARIAVEL2",
"OPER_RELAC_VARIAVEL3", "OPER_RELAC_VARIAVEL1", "OPER_RELAC_VARIAVEL2",
"OPER_RELAC_VARIAVEL3", "CLI_RELAC_VARIAVEL1", "CLI_RELAC_VARIAVEL2",
"CLI_RELAC_VARIAVEL3"), VALOR = c(100L, 200L, 300L, 100L, 200L,
300L, 450L, 320L, 110L)), .Names = c("TYPE", "VARIAVEL", "VALOR"
), row.names = c(NA, -9L), class = "data.frame")
Upvotes: 2