Reputation: 725
I'm an R newbie, so please pardon the basic question.
Here's a Dropbox link to a .csv of my data.
I have data from 1990-2010 on countries. My data is wide: Each country is a row, and each year has two columns corresponding to two data sources. However, the data is not complete for some countries. For example, one country row might have NA values in the 1990-1995 columns.
I'd like to create two columns, and for each country row, I'd like the values in these columns to be the earliest non-NA value of each of the two data types.
I'd also like to create two other columns, and for each country row, I'd like the values in these columns to be the earliest non-NA year of each of the two data types.
So the last four columns would be something like:
1990, 12, 1990, 87
1990, 7, 1990, 132
1996, 22, 1996, 173
1994, 14, 1994, 124
Here is my rough semi-pseudo code attempt at what I imagine the nested for loops would look like:
for i in (number of rows){
for j in names(df){
if(is.na(df$j) == FALSE) df$earliest_year = j
}
}
How can I generate these desired four columns? Thanks!
Upvotes: 0
Views: 712
Reputation: 692
You mentioned for loop; so i tried making a for-loop. But you might want to try other R functions, like apply later on. This code is a bit lengthy, hope this helps you:
# read data; i'm assuming the first column is row name and not important
df <- read.csv("wb_wide.csv", row.names = 1)
# get names of columns for the two datasource
# here I used grep to find columns names using NY and SP pattern;
# but if the format is consistentto be alternating,
# you can use sequence of number
dataSourceA <- names(df)[grep(x = names(df), pattern = "NY")]
dataSourceB <- names(df)[grep(x = names(df), pattern = "SP")]
# create new columns for the data set
# if i understand it correctly, first non-NA data from source 1
# and source 2; and then the year of these non-NAs
df$sourceA <- vector(length = nrow(df))
df$yearA <- vector(length = nrow(df))
df$sourceB <- vector(length = nrow(df))
df$yearB <- vector(length = nrow(df))
# start for loop that will iterate per row
for(i in 1:nrow(df)){
# this is a bit nasty; but the point here is to first select columns for source A
# then determine non-NAs, after which select the first and store it in the sourceA column
df$sourceA[i] <- df[i, dataSourceA][which(!is.na(df[i , dataSourceA]))[1]]
# another nasty one; but I used gsub to clean the column name so that the year will be left
# you can also skip this and then just clean afterward
df$yearA[i] <- gsub(x = names(df[i, dataSourceA][which(!is.na(df[i , dataSourceA]))[1]]),
pattern = "^.*X", replacement = "")
# same with the first bit of code, but here selecting from source B
df$sourceB[i] <- df[i, dataSourceB][which(!is.na(df[i , dataSourceB]))[1]]
# same with the second bit for source B
df$yearB[i] <- gsub(x = names(df[i, dataSourceB][which(!is.na(df[i , dataSourceB]))[1]]),
pattern = "^.*X", replacement = "")
}
I tried to make the code specific to your example and wanted output.
Upvotes: 2