Jamie Varney
Jamie Varney

Reputation: 47

r- Using sum and match to find first occurrence of a high frequency

I have several data frames in wide format imported from dbf. So every column is a date and every row is an observation. Thus for every day i have between 500-2000 observations depending on the size of the geographic shape i am looking at. For the purposes of reproducible I created 2 dummy data frames with a range of values I may see in my actual data frames.

Data1<- data.frame(replicate(10, sample(0:1000, 20, rep= TRUE)))

Data<- data.frame(replicate(10, sample(0:1000, 20, rep= TRUE)))

Since I have many of these data frames I have put them in a list so I can run functions on many at once.

filenames<- mget(ls(pattern= 'Data'))

Now my issue is that I am trying to write a function to count the number of occurrences in each column where values are within the range 0-100. I can accomplish this with

library(plyr)
Datacount<- ldply(Data, function(x) length(which(x>=0 & x<=100))) 

Then i need to be able to match the first column instance (date) in which this counted number is greater than 10% of the total number of observations per column. So for a dataframe with 20 observations I would want the first date where the number of cells between 0-100 is greater than 2. I previously accomplished this using apply (where "V1" is the column name containing the counts)

Datamatch<- apply (Datacount["V1"]>2,2,function(x) match (TRUE,x))

My question is whether there is a way I can combine these functions into one process that I can employ into either a for loop over "filenames" or using one of the lapply family functions?

For detail here is an example of a single function I built to run across each row of the dataframe. This gives me a column index of the last date where each row value is <= 100. Then i used lapply to loop over all dataframes in my list and append the results of the function to the original dataframe.

icein<- function(dataframe){ dataframe$icein<- apply(dataframe, 1, function(x){tail(which(x<=100), 1)}) dataframe }
list2env(lapply(filenames, icein), envir= .GlobalEnv)

Upvotes: 1

Views: 302

Answers (1)

akrun
akrun

Reputation: 886938

After loading all the 'Data' into a list, loop over the list with map, get the mean of logical vector (between(., 0, 100)) check if it greater than or equal to 2, unlist the data.frame, wrap with which to get the position index, extract the first one

library(dplyr)
library(purrr)
n <- 0.2
mget(ls(pattern= 'Data')) %>%
      map_int(~ .x %>% 
                  summarise_all(~ mean(between(., 0, 100)) >= n) %>% 
                  unlist %>%
                  which %>%
                  first)

Upvotes: 1

Related Questions