Reputation: 153
I have created df
gvkey year
1 1004 1965
2 1004 1966
3 1004 1967
4 1005 1950
5 1005 1951
...
gvkey
= company id
My dataframe contains more than 1,200 unique gvkeys
with different starting values for year
.
I want to create a column in df
called ipo_date
which equals the first entry in the year
column for a particular gvkey
.
gvkey year ipo_date
1 1004 1965 1965
2 1004 1966 1965
3 1004 1967 1965
4 1005 1950 1950
5 1005 1951 1950
...
Afterwards I want to delete all duplicate rows for ipo_date
and delete the column year
gvkey ipo_date
1 1004 1965
2 1005 1950
...
Thank you so much!!!
Upvotes: 0
Views: 47
Reputation: 771
This should do the trick:
library(tidyverse)
df %>%
group_by(gvkey) %>%
mutate(ipo_date = min(year)) %>%
select(-year) %>%
distinct()
Output:
# A tibble: 2 x 2
# Groups: gvkey [2]
gvkey ipo_date
<dbl> <dbl>
1 1004 1965
2 1005 1950
Upvotes: 1
Reputation: 1224
Here is a one line solution:
library(plyr)
df <- data.frame(gvkey = c(1004, 1004, 1004, 1005, 1005),
year = c(1965, 1966, 1967, 1950, 1951))
df_agg <- ddply(df, c('gvkey'), summarise, ipo_date = min(year))
> df_agg
gvkey ipo_date
1 1004 1965
2 1005 1950
Upvotes: 1
Reputation: 3116
library(data.table)
df = as.data.table(df)
df = df[order(gvkey,year)]
df[,`:=`(ipo_date=min(year)),by=c("gvkey")][,`:=`(year=NULL)]
df = unique(df)
Upvotes: 1
Reputation: 389135
You could select the first row of every gvkey
library(dplyr)
df %>% group_by(gvkey) %>% slice(1L) %>% rename_at(2, ~"ipo_date")
# gvkey ipo_date
# <int> <int>
#1 1004 1965
#2 1005 1950
Using the same logic in data.table
and base R, we can do
library(data.table)
setDT(df)[, .SD[1L], gvkey]
and
aggregate(year~gvkey, df, head, 1)
Upvotes: 1