Reputation:
I have a data frame that has 6,497,651 observations of 6 variables that I got from the National Emissions Inventory website and it has the following variables:
fips SCC Pollutant Emissions type year
09001 10100401 PM25 15.14 POINT 1999
09001 10100402 PM25 234.75 POINT 1999
Where fips
is the county code, SCC
is name of the source string, Pollutant
is the type of pollutant (PM2.5 emission in this case), Emissions
indicates the amount of the pollutant emitted in tons, type
is the type of source where pollutant was emitted (road, non-road, point, etc) and year
notes down years from 1999 to 2008.
Basically, I have to plot a simple line plot to showcase the change in the level of emissions according to each year. Now, the year 1999 alone has over a thousand observations; same goes for the rest of the years till 2008. The problem is not at all difficult since I can easily form a new data frame for each year with the sum of all the emissions recorded and then row bind all those subsetted data frames. But a more efficient and tidier way to accomplish this might be to use the FOR loop where I can calculate the sum of all the values under 'Emissions' according to each year and store all that information into a new data frame, but I am stuck on where to start. How do I enter the exact syntax that will calculate the sum of values according to each year? I should be having a data frame that looks something like this:
Year Emissions
Where Emissions
notes down the sum of values of all emissions in that specific year.
Upvotes: 1
Views: 228
Reputation: 887501
A dplyr/ggplot
option. We group by 'year', get the sum
of 'Emissions' using summarise
and plot with ggplot
.
library(dplyr)
library(ggplot2)
df1 %>%
group_by(year) %>%
summarise(Emissions=sum(Emissions)) %>%
ggplot(., aes(x=year, y=Emissions))+
geom_line()
Or this can be done directly within ggplot
ggplot(df1, aes(x=year, y=Emissions)) +
stat_summary(fun.y='sum', geom='line')
Upvotes: 0
Reputation: 1082
data.table
package is probably the most efficient package to handle things like that. The syntax to calculate sum of emissions for every year would be like that (assuming your data is stored in dt
):
library(data.table)
dt=data.table(dt)
dt[,.(Emissions=sum(Emissions)),by=year]
Upvotes: 1