Reputation: 1535
I am trying to create a look up table in R in order to get my data in the same format as the company that I am working for.
It regards different education categories that I want to merge using dplyr.
library(dplyr)
# Create data
education <- c("Mechanichal Engineering","Electric Engineering","Political Science","Economics")
data <- data.frame(X1=replicate(1,sample(education,1000,rep=TRUE)))
tbl_df(data)
# Create lookup table
lut <- c("Mechanichal Engineering" = "Engineering",
"Electric Engineering" = "Engineering",
"Political Science" = "Social Science",
"Economics" = "Social Science")
# Assign lookup table
data$X1 <- lut[data$X1]
But in my output my old values are replace with the wrong ones, i.e. not the ones that I created in the lookup table. Rather it seems like the lookup table is assign randomly.
Upvotes: 6
Views: 4775
Reputation: 1850
I had a similar problem where I had a dataframe with lots of columns and one of the columns had about 95 different values. I wanted to create another column that grouped (mapped) the 95 values into something more manageable. I created a simple dataframe that had the mappings to columns as a lookup table.
In your case rather than do this step:
# Create lookup table
lut <- c("Mechanichal Engineering" = "Engineering",
"Electric Engineering" = "Engineering",
"Political Science" = "Social Science",
"Economics" = "Social Science")
You make a simple data frame:
subject = c("Mechanichal Engineering",
"Electric Engineering",
"Political Science"",
"Economics")
category = c("Engineering",
"Engineering",
"Social Science",
"Social Science")
lookup_table = data.frame(subject, category)
You need two libraries to do the hash table and lookup:
library(hash)
library(qdapTools)
Load in a lookup_table
dataframe with the two columns that you want to be the hash table:
In this dataframe column 1 will be the key and column 2 is the lookup value.
# make the hash table
h = hash::hash(keys = lookup_table$subject, values = lookup_table$category)
# create the categories for your education values
# find the match of education in your h table and return the value category from the h table
data$ed_category = hash_look(data$education, h, missing = data$education)
Now your output will look like the following; using the missing parameter just uses Other since it's not in the hash table but in the education column.
#> education ed_category
#> 1 Electric Engineering Engineering
#> 2 Other Other
#> 3 Other Other
#> 4 Other Other
#> 5 Other Other
#> 6 Political Science Social Science
Upvotes: 1
Reputation: 1280
I've just been trying to figure this out myself. I wasn't quite happy with most of the solutions I found, so here's what I ended up with. I added an "other" category to show that it works even if there are values not defined in the lookup table.
library(dplyr)
# Create data
education <- c("Mechanichal Engineering",
"Electric Engineering",
"Political Science",
"Economics",
"Other")
data <- data.frame(X1 = replicate(1, sample(education, 20, rep=TRUE)))
# Create lookup table
lut <- c("Mechanichal Engineering" = "Engineering",
"Electric Engineering" = "Engineering",
"Political Science" = "Social Science",
"Economics" = "Social Science")
data %>%
mutate(X2 = recode(X1, !!!lut))
#> X1 X2
#> 1 Electric Engineering Engineering
#> 2 Other Other
#> 3 Other Other
#> 4 Other Other
#> 5 Other Other
#> 6 Political Science Social Science
#> 7 Other Other
#> 8 Economics Social Science
#> 9 Political Science Social Science
#> 10 Electric Engineering Engineering
#> 11 Economics Social Science
#> 12 Economics Social Science
#> 13 Mechanichal Engineering Engineering
#> 14 Economics Social Science
#> 15 Political Science Social Science
#> 16 Other Other
#> 17 Other Other
#> 18 Other Other
#> 19 Mechanichal Engineering Engineering
#> 20 Political Science Social Science
Upvotes: 4
Reputation: 2722
education <- c("Mechanichal Engineering","Electric Engineering","Political Science","Economics")
lut <- list("Mechanichal Engineering" = "Engineering",
"Electric Engineering" = "Engineering",
"Political Science" = "Social Science",
"Economics" = "Social Science")
lut2<-melt(lut)
data1 <- data.frame(X1=replicate(1,sample(education,1000,rep=TRUE)))
data1$new <- lut2[match(data1$X1,lut2$L1),'value']
head(data1)
======================= ==============
X1 new
======================= ==============
Political Science Social Science
Political Science Social Science
Mechanichal Engineering Engineering
Mechanichal Engineering Engineering
Political Science Social Science
Political Science Social Science
======================= ==============
Upvotes: 2
Reputation: 1535
I found that the best way to do this is to use recode()
from the car
package
# Observe that dplyr also has a recode function, so require car after dplyr
require(dplyr)
require(car)
The data are four education categories that are sampled from.
education <- c("Mechanichal Engineering",
"Electric Engineering","Political Science","Economics")
data <- data.frame(ID = c(1:1000), X1 = replicate(1,sample(education,1000,rep=TRUE)))
Using recode()
on the data I recode the categories
lut <- data.frame(ID = c(1:1000), X2 = recode(data$X1, '"Economics" = "Social Science";
"Electric Engineering" = "Engineering";
"Political Science" = "Social Science";
"Mechanichal Engineering" = "Engineering"'))
To see if it performed correct, join the original data and the recoded data
data <- full_join(data, lut, by = "ID")
head(data)
ID X1 X2
1 1 Political Science Social Science
2 2 Economics Social Science
3 3 Electric Engineering Engineering
4 4 Political Science Social Science
5 5 Economics Social Science
6 6 Mechanichal Engineering Engineering
With recode you don't have to sort the data before recoding it.
Upvotes: 0