Doo Hyun Shin
Doo Hyun Shin

Reputation: 298

To make groups according to addresses in demographic data

My data set is not in English but in Korean. The number of observations is more than 3000.

The data set's name is demo.

str(demo)

This has information of each person in each row.

$ 거주지역: Factor w/ 900 levels "","강원 강릉시 포남1동",..: 595 235 595 832 12 126 600 321 600 589      ...

Above is the 4th column's structure of the data set.

I want to make groups according to 4th column which indicates addresses of people. The problem is that the level of the factor is 900. This happens because the addresses are fully written.

I want to make groups to assign people in some provinces. So R needs to read the factors and identify the letters to make groups.

How can I do this? Please give me a help. I googled it for so much time but I could not find it.

Upvotes: 1

Views: 68

Answers (1)

andybega
andybega

Reputation: 1437

Here's maybe a start, not sure how it will work with non-Latin characters.

foo <- data.frame(value=rnorm(3), 
                  address=c("blah blah province1", "blah blah province2", "province3"),
                  stringsAsFactors=FALSE)

words <- strsplit(foo$address, " ")
words <- do.call(rbind, words)
foo$province <- words[, 3]

head(foo)

Output:

       value             address  province
1 0.01129269 blah blah province1 province1
2 0.99160104 blah blah province2 province2
3 1.59396745           province3 province3

Guessing by this wiki page on South Korean address formats, if the city and province (ward?) are always in the beginning of the address, then it's a bit easier and we can avoid using rbind, which in the code above recycles shorter addresses.

foo <- data.frame(value=rnorm(3), 
                  address=c("seoul ward1 street", "seoul ward2 street", "not-seoul ward-something     street"),
                  stringsAsFactors=FALSE)

foo$city <- sapply(foo$address, function(x) strsplit(x, split=" ")[[1]][1])
foo$ward <- sapply(foo$address, function(x) strsplit(x, split=" ")[[1]][2])

Now we can also use ifelse to use wards if in Seoul and cities otherwise.

foo$group <- with(foo, ifelse(city=="seoul", ward, city))
foo

       value                         address      city           ward     group
1  1.0071995              seoul ward1 street     seoul          ward1     ward1
2  0.7192918              seoul ward2 street     seoul          ward2     ward2
3 -0.6047117 not-seoul ward-something street not-seoul ward-something not-seoul

Upvotes: 1

Related Questions