Collin Sullivan
Collin Sullivan

Reputation: 33

Creating new variable and new data rows for country-conflict-year observations

I'm very new to R, still learning the very basics, and I haven't yet figured out how to perform this particular operation, but it would save me lots and lots of labor and time.

I have a dataset of international conflicts with columns for country and dates that looks something like this:

country     dates
Angola      1951-1953
Belize      1970-1972

I would like to reorganize the data to create variables for start year and end year, as well as create a year-observed (call it 'yrobs') column, so the set looks more like this:

country     yrobs  yrstart     yrend
Angola      1951     1951       1953
Angola      1952     1951       1953
Angola      1953     1951       1953
Belize      1970     1970       1972
Belize      1971     1970       1972
Belize      1972     1970       1972

Someone suggested using data frames and a double for-loop, but I got a little confused trying that. Any help would be greatly appreciated, and feel free to use dummy language, as I'm still pretty green to the programming here. Thanks much.

Upvotes: 3

Views: 1986

Answers (1)

Andrie
Andrie

Reputation: 179398

No need for any for loops here. Use the power of R and its contributed packages, particularly plyr and reshape2.

library(reshape2)
library(plyr)

Create some data:

df <- data.frame(
        country =c("Angola","Belize"),
        dates = c("1951-1953", "1970-1972")
)

Use colsplit in the reshape package to split your dates column into two, and cbind this to the original data frame.

df <- cbind(df, colsplit(df$date, "-", c("start", "end")))

Now for the fun bit. Use ddply in package plyr to split, apply and combine (SAC). This will take df and apply a function to each change in country. The anonymous function inside ddply creates a small data.frame with country and observations, and the key bit is to use seq() to generate a sequence from start to end date. The power of ddply is that it does all of this splitting, combining and applying in one step. Think of it as a loop in other languages, but you don't need to keep track of your indexing variables.

ddply(df, .(country), function(x){
            data.frame(
                    country=x$country,
                    yrobs=seq(x$start, x$end),
                    yrstart=x$start,
                    yrend=x$end
            )
        }
)

And the results:

  country yrobs yrstart yrend
1  Angola  1951    1951  1953
2  Angola  1952    1951  1953
3  Angola  1953    1951  1953
4  Belize  1970    1970  1972
5  Belize  1971    1970  1972
6  Belize  1972    1970  1972

Upvotes: 9

Related Questions