CarloF
CarloF

Reputation: 11

If value in a column starts with...mutate another column with given text, in R

I'm trying to build an if function that allows me to mutate the "city" column of a dataframe with a certain city name if in the "zipcode" column the value starts with a certain number.

For example: If zipcode starts with 1, mutate city column value with "NYC", else if zipcode starts with 6, mutate city column value with "Chicago", else if zipcode starts with 2, mutate city column value with "Boston",

and so on.

From:

city              zipcode
NYC               11211
DC                20910
NYC               11104
NA                11106
NA                2008
NA                60614

To:

city             zipcode
NYC               11211
DC                20910
NYC               11104
NYC               11106
DC                2008
Chicago           60614

It's a way to deal with NA values: The if function would just rewrite the same city for the values in which they are already present, and type the city name in case there's an NA value

Dataframe name data.frame Column name zipcode and city. Both of them are factor type and have to remain such for my further models.

I want do directly mutate the dataframe as I will need it for further use.

PS: Sorry for bad writing. I'm new in the community.

Thanks in advance!

Upvotes: 1

Views: 2476

Answers (1)

Nik
Nik

Reputation: 116

Here's a solution that might work for you.

Full code:

# load library
library(tidyverse)

# create the sample dataframe
df <- tribble(~city, ~zipcode,
              'NYC',11211,
              'DC',20910,
              'NYC', 11104,
              NA, 11106,
              NA, 2008,
              NA, 60614)

# change the NAs to the appropriate values
df <- df %>%
  mutate(
    city = case_when(
      str_sub(zipcode, 1, 1) == '1' ~ 'NYC',
      str_sub(zipcode, 1, 1) == '2' ~ 'DC',
      str_sub(zipcode, 1, 1) == '6' ~ 'Chicago',
      TRUE ~ city
    )
  )

# convert everything to factors
df <- df %>%
  mutate(
    city = as.factor(city),
    zipcode = as.factor(zipcode)
  )

#preview the output
glimpse(df)

The output of the glimpse() is:

Observations: 6
Variables: 2
$ city    <fct> NYC, DC, NYC, NYC, DC, Chicago
$ zipcode <fct> 11211, 20910, 11104, 11106, 2008, 60614

The trick that I used was first keep everything as a string or number, fill in the missing values, and then convert to factor.

Upvotes: 1

Related Questions