D3nz13
D3nz13

Reputation: 74

R Select only rows equal to another row

I'm struggling with a potencially silly problem, however I've yet to figure out how to solve it.

Shortly - I have a dataframe containing string and numerical columns. String columns define a key. I have to find all the keys and then for each key filter the dataframe so that the string columns are equal to the key and later do some calculations on the numerical columns. The problem is that the string column names are not fixed (it's supposed to be a generic function and a dataframe can have different number and names of string columns).

My solution so far (it works but it's painfully slow, I'm trying to make it vectorized so that it's faster):

only_string_columns <- df %>% select_if(is.character) # get only the string columns
only_numerical_columns <- df%>% select_if(is.numeric)
  
string_columns_names <- names(only_string_columns)
numerical_columns_names <- names(only_numerical_columns)

all_keys <- unique(only_string_columns) # get all unique keys

for(i in 1:nrow(all_keys)) {
    
    current_key <- all_keys[i, ]
    
    mask <- foreach(d=iter(only_string_columns, by='row'), .combine=rbind) %dopar% identical(unlist(d), unlist(current_key)) # create a boolean mask for rows that are equal to that specific key row
    
    selected_rows <- df[mask[,], ] # select only those that are equal to the key row

Previously I was trying to create a mask like this:

mask <- only_string_columns == current_key
OR
mask <- which(only_string_columns == current_key)

But they raised this error: ‘==’ only defined for equally-sized data frames

An example of what a dataframe could look like:

  country category value       date
1     USA        A     1 some_date1
2      UK        B     1 some_date2
3      UK        A     3 some_date3
4     USA        A     4 some_date4
5      UK        A     2 some_date5
6     USA        B     3 some_date6

Then the string columns are country and category and value is numerical.
Selected only string columns:

  country category 
1     USA        A 
2      UK        B 
3      UK        A 
4     USA        A 
5      UK        A 
6     USA        B

Unique keys:

  country category
1     USA        A
2      UK        B
3      UK        A
4     USA        B

So the first unique key is:

  country category
1     USA        A

After filtering, I should get this output:

  country category 
1     USA        A 
4     USA        A 

Using only the filtered rows in the original df:

  country category value       date
1     USA        A     1 some_date1
4     USA        A     4 some_date4

Note: simple subsetting like (df$country == key$country) & (df$category == key$category) will not work because another dataframe that in the future could be passed as an argument could have different string columns names (so instead of a country it could be a continent, market, anything).

Edit: Later, after the filtering is done, the pairs (date, value) are created. So I match each date with a corresponding numerical value. In that case it would be:

(some_date1, 1), (some_date4, 4)

The pairs have to be created for each key, so for the next key (UK, B) that would be:

(some_date2, 1)

Upvotes: 0

Views: 159

Answers (1)

Onyambu
Onyambu

Reputation: 79228

you are looking for the split function;

split(df[3:4], df[1:2], sep='_')
$UK_A
  value       date
3     3 some_date3
5     2 some_date5

$USA_A
  value       date
1     1 some_date1
4     4 some_date4

$UK_B
  value       date
2     1 some_date2

$USA_B
  value       date
6     3 some_date6

I am not sure what you neeed to do after this, though note that you can work on your whole data without splitting. ie using group_by:

Upvotes: 1

Related Questions