Reputation: 223
I have a data frame that contains a column called ProjectSubject. The data frame is approximately 1,000,000 rows long.
Within the ProjectSubject column, I have lots of different strings. Here is an example:
>unique(unlist(projectdf$ProjectSubject))
[1] "Applied Learning" "Applied Learning, Literacy
& Language"
[3] "Literacy & Language" "Special Needs"
[5] "Literacy & Language, History & Civics" "Math & Science"
[7] "History & Civics, Math & Science" "Literacy & Language,
Special Needs"
[9] "Applied Learning, Special Needs" "Health & Sports, Special
Needs"
[11] "Math & Science, Literacy & Language" "Literacy & Language, Math
& Science"
[13] "Literacy & Language, Music & The Arts" "Math & Science, Special
Needs"
[15] "Health & Sports" "Music & The Arts"
[17] "Math & Science, Applied Learning" "Literacy & Language,
Applied Learning"
[19] "Applied Learning, Music & The Arts" "History & Civics,
Literacy & Language"
[21] "Applied Learning, Math & Science" "Health & Sports, Math &
Science"
[23] "Applied Learning, Health & Sports" "History & Civics"
[25] "History & Civics, Music & The Arts" "Math & Science, History &
Civics"
[27] "Math & Science, Music & The Arts" "Special Needs, Music &
The Arts"
[29] "History & Civics, Applied Learning" "History & Civics, Special
Needs"
I need a succinct, non-manual way to go over the entire column in the data frame and replace a bunch of these strings with a different one. For example, I would like to replace "Applied Learning, Special Needs" with "Special Needs", or similarly replace "Applied Learning, Math & Science" with "Math".
I have about 50 unique strings much like the sample code given above that I want to reduce to about 10 unique strings. Preferably there is a method where I don't have to do this without manually typing a line of code for each of the 50 strings.
Upvotes: 6
Views: 26746
Reputation: 8110
If you already know which strings you want to change, one solution could be to use gsub
.
projectdf$ProjectSubject <- gsub("Applied Learning, Special Needs", "Special Needs", projectdf$ProjectSubject)
This would change the string "Applied Learning, Special Needs" to just ""Special Needs". It may be tedious with 50 gsub
calls, so some clever regex may help to get around that issue. For example, if any string contains "Special Needs" at all, change to "Special Needs":
projectdf$ProjectSubject <- gsub("^.*?Special Needs", "Special Needs", projectdf$ProjectSubject)
Upvotes: 1
Reputation: 2105
Here's a way I think is nice:
# first create some fake data that approximates your situation
set.seed(6933)
fruit_words <- c("apple", "orange", "banana", "pappels", "orong", "bernaner")
dat <- data.frame(fruit = sample(fruit_words, size=10, replace=TRUE),
stringsAsFactors=FALSE)
Create a table associating each unique value of dat$fruit
with the desired category/string you want to substitute for it:
fruit_lkup <- c(apple="appl", orange="orng", banana="bnna",
pappels="appl", orong="orng", bernaner="bnna")
Then exploit the fact that dat$fruit
holds the names of fruit_lkup
dat$fruit_clean <- as.character(fruit_lkup[dat$fruit])
And here's the result:
print(dat)
## fruit fruit_clean
## 1 pappels appl
## 2 orong orng
## 3 apple appl
## 4 banana bnna
## 5 apple appl
## 6 bernaner bnna
## 7 bernaner bnna
## 8 pappels appl
## 9 bernaner bnna
## 10 bernaner bnna
So really most of the work lies in creating the object you use to look up the values -- fruit_lkup
.
One way to get started is just use dput(unique(dat$fruit))
, then paste that into a script, and start supplying the values you want to replace.
If there's too many unique values, you could also write the unique values to a csv, and then manually add the values you want to replace after them. Then you could read in the (now) two-column csv as a data frame (say lookup_df
), and create fruit_lkup
with fruit_lkup <- setNames(lookup_df$new_values, lookup_df$old_values)
I've found this approach to be quite handy many times, in basically exactly the situation you describe.
Hope this helps ~~
Upvotes: 6