Reputation: 516
I have a dataframe with 1000 observations belonging to n different countries. Each country has more than 1 observation and the number of observations of each country differ. I need to create a column with numbers going from (1 to n-1), with each number corresponding to a different country. That is, I am creating a dummy variable and I don't care which country has which number. I just need to create such dummies. My data are something like this
Region x
1 be1 71615
4 be211 54288
5 be112 51158
6 it213 69856
8 it221 71412
9 uk222 79537
10 de101 94827
11 de10a 98273
12 dea10 92827
.. .. ..
Each country has its own "code" in the column Region, for instance beXXXX correpsonds to Belgium, ukXXX to the United Kingdom and so on. Hence I suppose I could exploit the initial 2 letters in the column Region to create my dummies. I know from here that the command grep()
could do the job, but I need to have a script which automatically switches from 1 to n-1 whenever the initial letters of the Region change.
The expected output should be like this
Region x Dummy
1 be1 71615 1
4 be211 54288 1
5 be112 51158 1
6 it213 69856 2
8 it221 71412 2
9 uk222 79537 3
10 de101 94827 4
11 de10a 98273 4
12 dea10 92827 4
.. .. .. ..
and in this case 1 corresponds to "be" (Belgium), 2 to "it" (Italy) and so on for the ´n´countries in my sample.
Upvotes: 1
Views: 467
Reputation: 121568
Another option using gsub
is :
gsub('.*(^[a-z]{2}).*','\\1',c('de111', 'de11a','dea11'))
"de" "de" "de"
Then you use factor
and as.integer
as showed in the previous answer.
Upvotes: 2
Reputation: 59970
How about creating a factor variable (you can show the underlying integer codes with as.integer
). We use regexec
and regmatches
to extract the letter codes that occur at the beginning of the Region
variable (ignoring letters that occur later) and turn them into the factor...
# Data with an extra row (row number 11)
df <- read.table( text = " Region x
1 be1 71615
4 be211 54288
5 be112 51158
6 it213 69856
8 it221 71412
9 uk222 79537
11 uk222a 79537
10 de101 94827" , h = T , stringsAsFactors = FALSE )
levs <- regmatches( df$Region , regexec( "^[a-z]+" , df$Region ) )
df$Country <- as.integer( factor( levs , levels = unique(levs ) ) )
Region x Country
1 be1 71615 1
4 be211 54288 1
5 be112 51158 1
6 it213 69856 2
8 it221 71412 2
9 uk222 79537 3
11 uk222a 79537 3
10 de101 94827 4
unlist( regmatches( df$Region , regexec( "^[a-z]+" , df$Region ) ) )
[1] "be" "be" "be" "it" "it" "uk" "uk" "de"
Upvotes: 5