user5250820
user5250820

Reputation:

Using FOR loop for finding out the sum of variables?

I have a data frame that has 6,497,651 observations of 6 variables that I got from the National Emissions Inventory website and it has the following variables:

fips    SCC       Pollutant     Emissions    type    year
09001   10100401  PM25          15.14        POINT   1999
09001   10100402  PM25          234.75       POINT   1999

Where fips is the county code, SCC is name of the source string, Pollutant is the type of pollutant (PM2.5 emission in this case), Emissions indicates the amount of the pollutant emitted in tons, type is the type of source where pollutant was emitted (road, non-road, point, etc) and year notes down years from 1999 to 2008.

Basically, I have to plot a simple line plot to showcase the change in the level of emissions according to each year. Now, the year 1999 alone has over a thousand observations; same goes for the rest of the years till 2008. The problem is not at all difficult since I can easily form a new data frame for each year with the sum of all the emissions recorded and then row bind all those subsetted data frames. But a more efficient and tidier way to accomplish this might be to use the FOR loop where I can calculate the sum of all the values under 'Emissions' according to each year and store all that information into a new data frame, but I am stuck on where to start. How do I enter the exact syntax that will calculate the sum of values according to each year? I should be having a data frame that looks something like this:

Year    Emissions

Where Emissions notes down the sum of values of all emissions in that specific year.

Upvotes: 1

Views: 228

Answers (2)

akrun
akrun

Reputation: 887501

A dplyr/ggplot option. We group by 'year', get the sum of 'Emissions' using summarise and plot with ggplot.

library(dplyr)
library(ggplot2) 
df1 %>%
   group_by(year) %>% 
   summarise(Emissions=sum(Emissions)) %>%
   ggplot(., aes(x=year, y=Emissions))+
   geom_line()

Or this can be done directly within ggplot

ggplot(df1, aes(x=year, y=Emissions)) + 
                stat_summary(fun.y='sum', geom='line')

Upvotes: 0

Maksim Gayduk
Maksim Gayduk

Reputation: 1082

data.table package is probably the most efficient package to handle things like that. The syntax to calculate sum of emissions for every year would be like that (assuming your data is stored in dt):

library(data.table)
dt=data.table(dt)
dt[,.(Emissions=sum(Emissions)),by=year]

Upvotes: 1

Related Questions