Marcel
Marcel

Reputation: 223

Renaming character variables in a column in data frame - R

I have a data frame that contains a column called ProjectSubject. The data frame is approximately 1,000,000 rows long.

Within the ProjectSubject column, I have lots of different strings. Here is an example:

>unique(unlist(projectdf$ProjectSubject))

[1] "Applied Learning"                           "Applied Learning, Literacy 
& Language"     
[3] "Literacy & Language"                        "Special Needs"                             
[5] "Literacy & Language, History & Civics"      "Math & Science"                            
[7] "History & Civics, Math & Science"           "Literacy & Language, 
Special Needs"        
[9] "Applied Learning, Special Needs"            "Health & Sports, Special 
Needs"            
[11] "Math & Science, Literacy & Language"        "Literacy & Language, Math 
& Science"       
[13] "Literacy & Language, Music & The Arts"      "Math & Science, Special 
Needs"             
[15] "Health & Sports"                            "Music & The Arts"                          
[17] "Math & Science, Applied Learning"           "Literacy & Language, 
Applied Learning"     
[19] "Applied Learning, Music & The Arts"         "History & Civics, 
Literacy & Language"     
[21] "Applied Learning, Math & Science"           "Health & Sports, Math & 
Science"           
[23] "Applied Learning, Health & Sports"          "History & Civics"                          
[25] "History & Civics, Music & The Arts"         "Math & Science, History & 
Civics"          
[27] "Math & Science, Music & The Arts"           "Special Needs, Music & 
The Arts"           
[29] "History & Civics, Applied Learning"         "History & Civics, Special 
Needs"           

I need a succinct, non-manual way to go over the entire column in the data frame and replace a bunch of these strings with a different one. For example, I would like to replace "Applied Learning, Special Needs" with "Special Needs", or similarly replace "Applied Learning, Math & Science" with "Math".

I have about 50 unique strings much like the sample code given above that I want to reduce to about 10 unique strings. Preferably there is a method where I don't have to do this without manually typing a line of code for each of the 50 strings.

Upvotes: 6

Views: 26746

Answers (2)

AndS.
AndS.

Reputation: 8110

If you already know which strings you want to change, one solution could be to use gsub.

projectdf$ProjectSubject <- gsub("Applied Learning, Special Needs", "Special Needs", projectdf$ProjectSubject)

This would change the string "Applied Learning, Special Needs" to just ""Special Needs". It may be tedious with 50 gsub calls, so some clever regex may help to get around that issue. For example, if any string contains "Special Needs" at all, change to "Special Needs":

projectdf$ProjectSubject <- gsub("^.*?Special Needs", "Special Needs", projectdf$ProjectSubject)

Upvotes: 1

lefft
lefft

Reputation: 2105

Here's a way I think is nice:

# first create some fake data that approximates your situation
set.seed(6933)

fruit_words <- c("apple", "orange", "banana", "pappels", "orong", "bernaner")

dat <- data.frame(fruit = sample(fruit_words, size=10, replace=TRUE), 
                  stringsAsFactors=FALSE)

Create a table associating each unique value of dat$fruit with the desired category/string you want to substitute for it:

fruit_lkup <- c(apple="appl", orange="orng", banana="bnna", 
                pappels="appl", orong="orng", bernaner="bnna")

Then exploit the fact that dat$fruit holds the names of fruit_lkup

dat$fruit_clean <- as.character(fruit_lkup[dat$fruit])

And here's the result:

print(dat)
##       fruit   fruit_clean
## 1   pappels        appl
## 2     orong        orng
## 3     apple        appl
## 4    banana        bnna
## 5     apple        appl
## 6  bernaner        bnna
## 7  bernaner        bnna
## 8   pappels        appl
## 9  bernaner        bnna
## 10 bernaner        bnna

So really most of the work lies in creating the object you use to look up the values -- fruit_lkup.

One way to get started is just use dput(unique(dat$fruit)), then paste that into a script, and start supplying the values you want to replace.

If there's too many unique values, you could also write the unique values to a csv, and then manually add the values you want to replace after them. Then you could read in the (now) two-column csv as a data frame (say lookup_df), and create fruit_lkup with fruit_lkup <- setNames(lookup_df$new_values, lookup_df$old_values)

I've found this approach to be quite handy many times, in basically exactly the situation you describe.

Hope this helps ~~

Upvotes: 6

Related Questions