Reputation: 298
My data set is not in English but in Korean. The number of observations is more than 3000.
The data set's name is demo.
str(demo)
This has information of each person in each row.
$ 거주지역: Factor w/ 900 levels "","강원 강릉시 포남1동",..: 595 235 595 832 12 126 600 321 600 589 ...
Above is the 4th column's structure of the data set.
I want to make groups according to 4th column which indicates addresses of people. The problem is that the level of the factor is 900. This happens because the addresses are fully written.
I want to make groups to assign people in some provinces. So R needs to read the factors and identify the letters to make groups.
How can I do this? Please give me a help. I googled it for so much time but I could not find it.
Upvotes: 1
Views: 68
Reputation: 1437
Here's maybe a start, not sure how it will work with non-Latin characters.
foo <- data.frame(value=rnorm(3),
address=c("blah blah province1", "blah blah province2", "province3"),
stringsAsFactors=FALSE)
words <- strsplit(foo$address, " ")
words <- do.call(rbind, words)
foo$province <- words[, 3]
head(foo)
Output:
value address province
1 0.01129269 blah blah province1 province1
2 0.99160104 blah blah province2 province2
3 1.59396745 province3 province3
Guessing by this wiki page on South Korean address formats, if the city and province (ward?) are always in the beginning of the address, then it's a bit easier and we can avoid using rbind
, which in the code above recycles shorter addresses.
foo <- data.frame(value=rnorm(3),
address=c("seoul ward1 street", "seoul ward2 street", "not-seoul ward-something street"),
stringsAsFactors=FALSE)
foo$city <- sapply(foo$address, function(x) strsplit(x, split=" ")[[1]][1])
foo$ward <- sapply(foo$address, function(x) strsplit(x, split=" ")[[1]][2])
Now we can also use ifelse
to use wards if in Seoul and cities otherwise.
foo$group <- with(foo, ifelse(city=="seoul", ward, city))
foo
value address city ward group
1 1.0071995 seoul ward1 street seoul ward1 ward1
2 0.7192918 seoul ward2 street seoul ward2 ward2
3 -0.6047117 not-seoul ward-something street not-seoul ward-something not-seoul
Upvotes: 1