Reputation: 23
I have this assignment and I'm not trying to get it solved for me, I only want to solve it using the most PYTHONIC way "then what is the meaning of programing by doing 100 for loops while techniques like vectorization exist", Besides I have stuck at a certain point and I do not know why it is not working with me.
The Task My dataset is a subset of The National Centers for Environmental Information, which is the daily climate records from thousands of land stations. The data are as below I'm supposed to clean the data and plot the temp against each day of the 365 days of the year.
df
ID Date Element Data_Value
0 USW00094889 2014-11-12 TMAX 2.2
1 USC00208972 2009-04-29 TMIN 5.6
2 USC00200032 2008-05-26 TMAX 27.8
3 USC00205563 2005-11-11 TMAX 13.9
4 USC00200230 2014-02-27 TMAX -10.6
5 USW00014833 2010-10-01 TMAX 19.4
6 USC00207308 2010-06-29 TMIN 14.4
7 USC00203712 2005-10-04 TMAX 28.9
8 USW00004848 2007-12-14 TMIN -1.6
9 USC00200220 2011-04-21 TMAX 7.2
df.shape
(165085, 4)
My Strategy 1- Split the DF into two(2) DFs one for Element='TMAX' and one for the other for the Element='TMIN', because I didn't find a way to group by 'Date' and 'Element' then having the result of each in a separate column using one vectorized command.
2- Group by 'Date' aggregate on 'Data_Value' with MAX for DFMAX and with MIN for DFMIN.
3- Merge both DFs using outer and index for both = True
< 1st checkpoint: if there is one command which is really considered pythonic and professional to do this task instead of these 3 steps >
4- Remove leap days
5- add the 'Day_of_Year' column which will hold the day of year number. Here I stuck, as the leap year has 366 days, meaning that Mar, 1st in the normal year has the Day_of_Year=60, while the same day Mar, 1st in the leap Year has the Day_of_Year=61. as a result, the final plot will not be correct as there is a day shifting of one day in the leap year.
I tried to apply a lambda function to change the value of this specific leap year but it raises an error. Code and error are below.
df['2008']['Day_of_Year'] = df['2008']['Day_of_Year'].apply(lambda x: x for x in range(1, 366))
at this point, the df.index is dateobject accordingly the df['2008'] is used.
Error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-47-f60e8731fc55> in <module>()
4 #
5
----> 6 df['2008']['Day_of_Year'] = df['2008']['Day_of_Year'].apply(lambda x: x for x in range(1, 366))
/opt/conda/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
2292 else:
2293 values = self.asobject
-> 2294 mapped = lib.map_infer(values, f, convert=convert_dtype)
2295
2296 if len(mapped) and isinstance(mapped[0], Series):
pandas/src/inference.pyx in pandas.lib.map_infer (pandas/lib.c:66124)()
TypeError: 'generator' object is not callable
appreciate it in advance.
Upvotes: 0
Views: 47
Reputation: 1
You can convert your "Date" column to a timestamp:
df["Date"] = pd.to_datetime(df["Date"])
Then you can set that same column as the index:
df = df.set_index("Date")
And finally plot the "Data_Value":
df['Data_Value"].plot()
If you want to plot TMAX and TMIN separate, then:
df['Data_Value"][df["Element"] == "TMAX"].plot()
df['Data_Value"][df["Element"] == "TMIN"].plot()
Upvotes: 0