Vectorize a manipulation of a portion of DataFrame

Question

I have this assignment and I'm not trying to get it solved for me, I only want to solve it using the most PYTHONIC way "then what is the meaning of programing by doing 100 for loops while techniques like vectorization exist", Besides I have stuck at a certain point and I do not know why it is not working with me.

The Task My dataset is a subset of The National Centers for Environmental Information, which is the daily climate records from thousands of land stations. The data are as below I'm supposed to clean the data and plot the temp against each day of the 365 days of the year.

df
    ID          Date        Element Data_Value
0   USW00094889 2014-11-12  TMAX    2.2
1   USC00208972 2009-04-29  TMIN    5.6
2   USC00200032 2008-05-26  TMAX    27.8
3   USC00205563 2005-11-11  TMAX    13.9
4   USC00200230 2014-02-27  TMAX    -10.6
5   USW00014833 2010-10-01  TMAX    19.4
6   USC00207308 2010-06-29  TMIN    14.4
7   USC00203712 2005-10-04  TMAX    28.9
8   USW00004848 2007-12-14  TMIN    -1.6
9   USC00200220 2011-04-21  TMAX    7.2

df.shape
(165085, 4)

My Strategy 1- Split the DF into two(2) DFs one for Element='TMAX' and one for the other for the Element='TMIN', because I didn't find a way to group by 'Date' and 'Element' then having the result of each in a separate column using one vectorized command.

2- Group by 'Date' aggregate on 'Data_Value' with MAX for DFMAX and with MIN for DFMIN.

3- Merge both DFs using outer and index for both = True

< 1st checkpoint: if there is one command which is really considered pythonic and professional to do this task instead of these 3 steps >

4- Remove leap days

5- add the 'Day_of_Year' column which will hold the day of year number. Here I stuck, as the leap year has 366 days, meaning that Mar, 1st in the normal year has the Day_of_Year=60, while the same day Mar, 1st in the leap Year has the Day_of_Year=61. as a result, the final plot will not be correct as there is a day shifting of one day in the leap year.

I tried to apply a lambda function to change the value of this specific leap year but it raises an error. Code and error are below.

df['2008']['Day_of_Year'] = df['2008']['Day_of_Year'].apply(lambda x: x for x in range(1, 366))

at this point, the df.index is dateobject accordingly the df['2008'] is used.

Error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
 in ()
      4 #            
      5 
----> 6 df['2008']['Day_of_Year'] = df['2008']['Day_of_Year'].apply(lambda x: x for x in range(1, 366))

/opt/conda/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
   2292             else:
   2293                 values = self.asobject
-> 2294                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   2295 
   2296         if len(mapped) and isinstance(mapped[0], Series):

pandas/src/inference.pyx in pandas.lib.map_infer (pandas/lib.c:66124)()

TypeError: 'generator' object is not callable

appreciate it in advance.

Vectorize a manipulation of a portion of DataFrame

Answers (1)

Related Questions