user1758995
user1758995

Reputation: 31

Filtering data in a data frame

I have a data frame that looks like this:

S1State S1Value S2State S2Value
NSW     20      VIC     30
WA      30      NSW     20

I would like to filter and select the state(from S1State and S2State) that has the maximum value(from S1Value and S2Value). The result should look like this:

SState  SValue
VIC     30
WA      30

I am new to R and have been experimenting with dplyr.

Upvotes: 1

Views: 146

Answers (3)

thelatemail
thelatemail

Reputation: 93813

Just to show that this is far from impossible with standard R tools:

nams <- c("State","Value")
tmp  <- reshape(dt, direction="long", varying=lapply(nams, grep, x=names(dt)),
                v.names=nams, timevar=NULL)
tmp[with(tmp, Value == ave(Value, id, FUN=max)),]
#    State Value id
#2.1    WA    30  2
#1.2   VIC    30  1

Upvotes: 2

www
www

Reputation: 39154

I assume that the OP may have more states in the data frame, such as S3State, S4State, ...

The following solutions are based on this assumption, trying to be able to process more than one states. If there are only two states, the approach proposed by @lebelinoz is simple and straightforward.

Solution 1

A solution using functions from dplyr and tidyr. dt2 is the final output.

# Load packages
library(dplyr)
library(tidyr)

# Process the data
dt2 <- dt %>%
  gather(Num, Value, contains("Value")) %>%
  gather(State, Name, contains("State")) %>%
  # Only keep records with the same state number
  filter(substring(Num, 1, 2) == substring(State, 1, 2)) %>%
  mutate(Group = substring(Num, 1, 2)) %>%
  group_by(Group) %>%
  filter(Value == max(Value)) %>%
  ungroup() %>%
  select(SState = Name, SSValue = Value)

Solution 2

A solution using functions from dplyr, purrr, and stringr. I loaded the package tidyverse for the first two packages. Again, dt2 is the final output.

# Load packages
library(tidyverse)
library(stringr)

# Extract the column names
Col <- colnames(dt)

# Extract state numbers
ColNum <- Col %>%
  str_extract(pattern = "[0-9]") %>%
  unique()

# Design a function to process the data
dt_process <- function(pattern, dt){
  dt2 <- dt %>%
    # Extract columns based on a pattern (numbers)
    select(dplyr::contains(pattern)) %>%
    # Rename the columns
    rename_all(~sub(pattern, "", .)) %>%
    # Filter the maximum row
    filter(SValue == max(SValue))
  return(dt2)
}

# Apply the dt_process function
dt_list <- map(.x = ColNum, .f = dt_process, dt = dt)

# Bind all data frames
dt2 <- bind_rows(dt_list) %>% arrange(SState)

Data Preparation

# Create example data frame
    dt <- read.table(text = "S1State S1Value S2State S2Value
                     NSW     20      VIC     30
                     WA      30      NSW     20",
                     header = TRUE, stringsAsFactors = FALSE)

Upvotes: 1

lebelinoz
lebelinoz

Reputation: 5068

The answer I was hinting at is as follows:

library(dplyr)
dt <- read.table(text = "S1State S1Value S2State S2Value
                 NSW     20      VIC     30
                 WA      30      NSW     20",
                 header = TRUE, stringsAsFactors = FALSE)
answer = dt %>% 
  mutate(SState = ifelse(S1Value > S2Value, S1State, S2State), 
         SValue = ifelse(S1Value > S2Value, S1Value, S2Value)) %>%
  select(SState, SValue)

Upvotes: 2

Related Questions