Reputation: 117
I write a function to process large data. This is my first time writing r function. It works well if only read one .cvs file or just one data frame in the space. However, if there is another data frame or more, it will not work. It seems the function can identify other data frames even though the name of the used data frame is given.
My function is:
Sun <- function(data,A) {
data$V4 <- as.character(data$V4)
dt_1 <- data %>%
mutate(V4 = str_replace(V4, " ",""))
dt_2 <- dt_1 %>%
mutate(V4 = str_replace(V4, "//",""))
dt_3 <- dt_2 %>%
mutate(V4 = str_replace(V4, " ","("))
dt_4 <- str_split_fixed(dt_3$V4 , ";", 100)
dt_5 <- data.frame(dt,dt_4)
dt_6 <- dt_5 %>%
mutate_at(vars(X1:X100), ~ substr(., 1,4))
dt_7 <- data.table(dt_6[!sapply(dt_6, function(x) all(x == ""))])
DT <- dt_7[,-1]
col_names <- tail(names(DT), -4)
co <- DT[,
sapply(
A,
function (code) { pmin(1, rowSums(.SD == code, na.rm=T)) },
simplify=F, USE.NAMES=T
),
.SDcols=col_names
]
}
for example, if I have two data frames at the same time in r space, named DF1 and DF2.
then there will be something wrong. I am confused about this.
Sun(DF1, A)
DF1 is like:
V1 V2 V3 V4
1 id1 2012.09.28 E05B63/14(2006.01);E05B47/00(2006.01)
2 id2 2010.08.20 G01B5/14(2006.01);G01B5/02(2006.01)
3 id3 2009.01.08 H02J3/00(2006.01);G01R23/02(2006.01)
DF2 for example:
V1 V2 V3 V4
1 id1 2012.09.28 A05B63/14;E05B47/00(2006.01)
2 id2 2010.08.20 D01B5/14
3 id3 2009.01.08 H02J3/00(2006.01);G01R23/02(2006.01)
A is a vector as below
A01B A02B A03B A04B A05B G01B H02J G01R E05B
Upvotes: 0
Views: 29
Reputation: 3888
All I did was "optimize" the code and the function works correctly, again i think the issue is with using an undefined dt
as I mentioned in the comments:
Sun <- function(data,A) {
dt <- data.table(data)
dt[, V4:=str_replace_all(as.character(V4),c(" |//"="", "//"="") )][,
str_split_fixed(V4 , ";", 100)
] -> splits
data.table(substr(splits, 1,4)) -> splits
splits[, which(sapply(.SD, function(x) all(!nzchar(x))))] -> rem
splits[, (rem):=NULL]
splits[,
sapply(
A,
function (code) { pmin(1, rowSums(.SD == code, na.rm=T)) },
simplify=F, USE.NAMES=T
)]
}
> Sun(DF1, A)
A01B A02B A03B A04B A05B G01B H02J G01R E05B
1: 0 0 0 0 0 0 0 0 1
2: 0 0 0 0 0 1 0 0 0
3: 0 0 0 0 0 0 1 1 0
> Sun(DF2, A)
A01B A02B A03B A04B A05B G01B H02J G01R E05B
1: 0 0 0 0 1 0 0 0 1
2: 0 0 0 0 0 0 0 0 0
3: 0 0 0 0 0 0 1 1 0
Upvotes: 1