Fernanda
Fernanda

Reputation: 313

Data Transformation in R - Dummies

I want to work with 4 national football teams (England, Belgium, Germany and France) and n dates

Date        Matches
16.03       England X Brazil
16.03       Belgium X Argentina
16.03       Chile X Japan
16.03       Uruguay X Germany
16.03       Italy x France
17.03       South Korea X India
17.03       Germany X France
17.03       Poland  X Belgium
17.03       Colombia X Russia
18.03       South Africa X Mexico
18.03       China X Japon
18.03       Brazil X Venezuela
...          ...

DESIRED DATAFRAME with dummies. When dummy = 1 the team played. When dummy = 0 the team didnt played (in the day). Important: Only one date per row.

Date   Dummy_england   Dummy_belgium    Dummy_germany    Dummy_France
16.03  1               1                1                1
17.03  0               1                1                1
18.03  0               0                0                0

Thank you very much!!

Upvotes: 3

Views: 80

Answers (2)

brunomp
brunomp

Reputation: 44

Dummy_england <- tapply(Matches, Date, function(x) {
                      grepl("England", paste(x, collapse = ''))*1
                })

Upvotes: 0

akrun
akrun

Reputation: 887118

We can use tidyverse methods.

  1. Extract the selected team from each row of 'Matches' with str_extract
  2. Keep only the rows where there is a match i.e. NA rows are removed with filter
  3. Use pivot_wider to reshape from long to 'wide' after selecting the columns of interest - specify the values_fn as length and values_fill as 0 to change the default NA to 0
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
   mutate(team = str_c('Dummy_', str_extract(Matches, 
       regex('England|Belgium|Germany|France', ignore_case = TRUE)))) %>%
   filter(complete.cases(team)) %>%
   select(-Matches) %>%
   pivot_wider(names_from = team, values_from = team,  
          values_fn = length, values_fill = 0)

-output

# A tibble: 2 x 5
   Date Dummy_England Dummy_Belgium Dummy_Germany Dummy_France
  <dbl>         <int>         <int>         <int>        <int>
1  16.0             1             1             1            1
2  17.0             0             1             1            0

If we want to keep the 'Date' where there are no matches, use complete

df1 %>%
   mutate(team = str_c('Dummy_', str_extract(Matches, 
    regex('England|Belgium|Germany|France', ignore_case = TRUE)))) %>%
   filter(complete.cases(team)) %>%
   select(-Matches) %>%
   pivot_wider(names_from = team, values_from = team, 
        values_fn = length, values_fill = 0) %>% 
   complete(Date = unique(df1$Date), fill = list(Dummy_England = 0,
         Dummy_Belgium = 0, Dummy_Germany = 0, Dummy_France = 0))

-output

# A tibble: 3 x 5
   Date Dummy_England Dummy_Belgium Dummy_Germany Dummy_France
  <dbl>         <dbl>         <dbl>         <dbl>        <dbl>
1  16.0             1             1             1            1
2  17.0             0             1             1            0
3  18.0             0             0             0            0

data

df1 <- structure(list(Date = c(16.03, 16.03, 16.03, 16.03, 16.03, 17.03, 
17.03, 17.03, 17.03, 18.03, 18.03, 18.03), Matches = c("England X Brazil", 
"Belgium X Argentina", "Chile X Japan", "Uruguay X Germany", 
"Italy x France", "South Korea X India", "Germany X France", 
"Poland  X Belgium", "Colombia X Russia", "South Africa X Mexico", 
"China X Japon", "Brazil X Venezuela")), class = "data.frame", row.names = c(NA, 
-12L))

Upvotes: 2

Related Questions