mishytkakyiv
mishytkakyiv

Reputation: 11

How to create a data group (factor variables) in my dataframe based on categorical variables #R

I want to create a factor variables in my dataframes based on categorical variables. My data:

 # A tibble: 159 x 3
   name.country           gpd rate_suicide
   <chr>                <dbl>        <dbl>
 1 Afghanistan          2129.          6.4
 2 Albania             12003.          5.6
 3 Algeria             11624.          3.3
 4 Angola               7103.          8.9
 5 Antigua and Barbuda 19919.          0.5
 6 Argentina           20308.          9.1
 7 Armenia             10704.          5.7
 8 Australia           47350.         11.7
 9 Austria             52633.         11.4
10 Azerbaijan          14371.          2.6
# ... with 149 more rows

I want to create factor variable region, which contains a factors as:

region <- c('Asian', 'Europe', 'South America', 'North America', 'Africa')
region = factor(region, levels = c('Asian', 'Europe', 'South America', 'North America', 'Africa'))

I want to do this with dplyr packages, that can to choose a factor levels depends on name.countrybut it doesn't work. Example:

 if (new_data$name.country[new_data$name.country == "N"]) {
  mutate(new_data, region_ = region[1])
} 

How i can solve the problem?

Upvotes: 1

Views: 1165

Answers (2)

Len Greski
Len Greski

Reputation: 10865

Here is a working example that combines data from the question with a file of countries and region information from Github. H/T to Luke Duncalfe for maintaining the region data, which is:

...a combination of the Wikipedia ISO-3166 article for alpha and numeric country codes and the UN Statistics site for countries' regional and sub-regional codes.

regionFile <- "https://raw.githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv"
regionData <- read.csv(regionFile,header=TRUE)

textFile <- "rowID|country|gdp|suicideRate
1|Afghanistan|2129.|6.4
2|Albania|12003.|5.6
3|Algeria|11624.|3.3
4|Angola|7103.|8.9
5|Antigua and Barbuda|19919.|0.5
6|Argentina|20308.|9.1
7|Armenia|10704.|5.7
8|Australia|47350.|11.7
9|Austria|52633.|11.4
10|Azerbaijan|14371.|2.6"

data <- read.csv(text=textFile,sep="|")
library(dplyr)
data %>% 
     left_join(.,regionData,by = c("country" = "name"))

...and the output:

   rowID             country   gdp suicideRate alpha.2 alpha.3 country.code
1      1         Afghanistan  2129         6.4      AF     AFG            4
2      2             Albania 12003         5.6      AL     ALB            8
3      3             Algeria 11624         3.3      DZ     DZA           12
4      4              Angola  7103         8.9      AO     AGO           24
5      5 Antigua and Barbuda 19919         0.5      AG     ATG           28
6      6           Argentina 20308         9.1      AR     ARG           32
7      7             Armenia 10704         5.7      AM     ARM           51
8      8           Australia 47350        11.7      AU     AUS           36
9      9             Austria 52633        11.4      AT     AUT           40
10    10          Azerbaijan 14371         2.6      AZ     AZE           31
      iso_3166.2   region                      sub.region intermediate.region
1  ISO 3166-2:AF     Asia                   Southern Asia                    
2  ISO 3166-2:AL   Europe                 Southern Europe                    
3  ISO 3166-2:DZ   Africa                 Northern Africa                    
4  ISO 3166-2:AO   Africa              Sub-Saharan Africa       Middle Africa
5  ISO 3166-2:AG Americas Latin America and the Caribbean           Caribbean
6  ISO 3166-2:AR Americas Latin America and the Caribbean       South America
7  ISO 3166-2:AM     Asia                    Western Asia                    
8  ISO 3166-2:AU  Oceania       Australia and New Zealand                    
9  ISO 3166-2:AT   Europe                  Western Europe                    
10 ISO 3166-2:AZ     Asia                    Western Asia                    
   region.code sub.region.code intermediate.region.code
1          142              34                       NA
2          150              39                       NA
3            2              15                       NA
4            2             202                       17
5           19             419                       29
6           19             419                        5
7          142             145                       NA
8            9              53                       NA
9          150             155                       NA
10         142             145                       NA

At this point one can decide whether to use the region, sub region, or intermediate region and convert it to a factor.

We can set region to a factor by adding a mutate() function to the dplyr pipeline:

data %>% 
     left_join(.,regionData,by = c("country" = "name")) %>%
     mutate(region = factor(region)) -> mergedData

At this point mergedData$region is a factor.

str(mergedData$region)
table(mergedData$region)

> str(mergedData$region)
 Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 2 3 5 4 3
> table(mergedData$region)

  Africa Americas     Asia   Europe  Oceania 
       2        2        3        2        1

Now the data is ready for further analysis. We will generate a table of average suicide rates by region.

library(knitr) # for kable
mergedData %>% group_by(region) %>%
     summarise(suicideRate = mean(suicideRate)) %>%
     kable(.)

...and the output:

|region   | suicideRate|
|:--------|-----------:|
|Africa   |         6.1|
|Americas |         4.8|
|Asia     |         4.9|
|Europe   |         8.5|
|Oceania  |        11.7|

When rendered in an HTML / markdown viewer, the result looks like this:

enter image description here

Upvotes: 0

Joe Erinjeri
Joe Erinjeri

Reputation: 1250

I think the way I would think about your problem is

  1. Create a reproducible problem. (see How to make a great R reproducible example. ) Since you already have the data, use dput to make it easier for people like me to recreate your data in their environment.
dput(yourdf)

structure(list(name.country = c("Afghanistan", "Albania", "Algeria"
), gpd = c(2129L, 12003L, 11624L), rate_suicide = c(6.4, 5.6, 
3.3)), class = "data.frame", row.names = c(NA, -3L))

raw_data<-structure(list(name.country = c("Afghanistan", "Albania", "Algeria"
), gpd = c(2129L, 12003L, 11624L), rate_suicide = c(6.4, 5.6, 
3.3)), class = "data.frame", row.names = c(NA, -3L))
  1. Define vectors that specify your regions
  2. Use case_when to separate countries into regions
  3. Use as.factor to convert your character variable to a factor

asia=c("Afghanistan","India","...","Rest of countries in Asia")
europe=c("Albania","France","...","Rest of countries in Europe")
africa=c("Algeria","Egypt","...","Rest of countries in Africa")

df<-raw_data %>%
  mutate(region=case_when(
    name.country %in% asia ~ "asia",
    name.country %in% europe ~ "europe",
    name.country %in% africa ~ "africa",
    TRUE ~ "other"
  )) %>%
  mutate(region=region %>% as.factor())

You can check that your variable region is a factor using str

str(df)

'data.frame':   3 obs. of  4 variables:
 $ name.country: chr  "Afghanistan" "Albania" "Algeria"
 $ gpd         : int  2129 12003 11624
 $ rate_suicide: num  6.4 5.6 3.3
 $ region      : Factor w/ 3 levels "africa","asia",..: 2 3 1

Upvotes: 1

Related Questions