Reputation: 74
I'm struggling with a potencially silly problem, however I've yet to figure out how to solve it.
Shortly - I have a dataframe containing string and numerical columns. String columns define a key. I have to find all the keys and then for each key filter the dataframe so that the string columns are equal to the key and later do some calculations on the numerical columns. The problem is that the string column names are not fixed (it's supposed to be a generic function and a dataframe can have different number and names of string columns).
My solution so far (it works but it's painfully slow, I'm trying to make it vectorized so that it's faster):
only_string_columns <- df %>% select_if(is.character) # get only the string columns
only_numerical_columns <- df%>% select_if(is.numeric)
string_columns_names <- names(only_string_columns)
numerical_columns_names <- names(only_numerical_columns)
all_keys <- unique(only_string_columns) # get all unique keys
for(i in 1:nrow(all_keys)) {
current_key <- all_keys[i, ]
mask <- foreach(d=iter(only_string_columns, by='row'), .combine=rbind) %dopar% identical(unlist(d), unlist(current_key)) # create a boolean mask for rows that are equal to that specific key row
selected_rows <- df[mask[,], ] # select only those that are equal to the key row
Previously I was trying to create a mask like this:
mask <- only_string_columns == current_key
OR
mask <- which(only_string_columns == current_key)
But they raised this error:
‘==’ only defined for equally-sized data frames
An example of what a dataframe could look like:
country category value date
1 USA A 1 some_date1
2 UK B 1 some_date2
3 UK A 3 some_date3
4 USA A 4 some_date4
5 UK A 2 some_date5
6 USA B 3 some_date6
Then the string columns are country and category and value is numerical.
Selected only string columns:
country category
1 USA A
2 UK B
3 UK A
4 USA A
5 UK A
6 USA B
Unique keys:
country category
1 USA A
2 UK B
3 UK A
4 USA B
So the first unique key is:
country category
1 USA A
After filtering, I should get this output:
country category
1 USA A
4 USA A
Using only the filtered rows in the original df:
country category value date
1 USA A 1 some_date1
4 USA A 4 some_date4
Note: simple subsetting like (df$country == key$country) & (df$category == key$category) will not work because another dataframe that in the future could be passed as an argument could have different string columns names (so instead of a country it could be a continent, market, anything).
Edit: Later, after the filtering is done, the pairs (date, value) are created. So I match each date with a corresponding numerical value. In that case it would be:
(some_date1, 1), (some_date4, 4)
The pairs have to be created for each key, so for the next key (UK, B) that would be:
(some_date2, 1)
Upvotes: 0
Views: 159
Reputation: 79228
you are looking for the split
function;
split(df[3:4], df[1:2], sep='_')
$UK_A
value date
3 3 some_date3
5 2 some_date5
$USA_A
value date
1 1 some_date1
4 4 some_date4
$UK_B
value date
2 1 some_date2
$USA_B
value date
6 3 some_date6
I am not sure what you neeed to do after this, though note that you can work on your whole data without splitting. ie using group_by
:
Upvotes: 1