Reputation: 16185
I have time series data that looks like:
1998-01-02 09:30:00,0.4298,0.4337,0.4258,0.4317,6426369
1999-01-02 09:45:00,0.4317,0.4337,0.4258,0.4298,10589080
2000-01-02 10:00:00,0.4298,0.4337,0.4278,0.4337,9507980
2001-01-02 10:15:00,0.4337,0.4416,0.4298,0.4416,13639022
What I want is a list of years,
years = list['1998'.'1999','2000','2001']
So I can use that list to know what years I can query against in that dataframe. Not all dataframes will have the same years in it.
data = pd.read_csv(str(inFileName), index_col=0, parse_dates=True, header=None)
#data.iloc[:, 0]
print(pd.DatetimeIndex(data.iloc[:, 0]).year)
#print(data.iloc[:, 0])
#years = list(data.index)
#print(years)
for x in years:
I am trying so many things, but not succeeding. Can someone explain to me how to solve a problem like this?
Edit 1: After some advice, I am doing this:
data = pd.read_csv(str(inFileName), parse_dates=[0], header=None)
data.iloc[:, 0] = pd.to_datetime(data.iloc[:, 0])
data['year'] = data.iloc[:, 0].apply(lambda x: x.year)
year_list = data['year'].unique().tolist()
print(year_list)
for x in year_list:
newDF = data[x]
newDF.head()
print(newDF.head(5))
and I get the list: [2017, 2018, 2019]
but I cannot create a new dataframe from the list. I want to create a new dataframe for each value in the list. I get errors:
[2017, 2018, 2019]
Traceback (most recent call last):
File "/home/jason/Applications/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3078, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 2017
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./massageSM.py", line 123, in <module>
main(sys.argv[1:])
File "./massageSM.py", line 33, in main
newDF = data[x]
File "/home/jason/Applications/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 2688, in __getitem__
return self._getitem_column(key)
File "/home/jason/Applications/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 2695, in _getitem_column
return self._get_item_cache(key)
File "/home/jason/Applications/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 2489, in _get_item_cache
values = self._data.get(item)
File "/home/jason/Applications/anaconda3/lib/python3.7/site-packages/pandas/core/internals.py", line 4115, in get
loc = self.items.get_loc(item)
File "/home/jason/Applications/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 2017
I am using this:
data = pd.read_csv("RHE.SM", parse_dates=[0], header=None)
data.iloc[:, 0] = pd.to_datetime(data.iloc[:, 0])
data['year'] = data.iloc[:, 0].apply(lambda x: x.year)
year_list = data['year'].unique().tolist()
print(year_list)
for x in year_list:
df = pd.DataFrame({'years':year_list})
print(df.head(5))
and it produces output:
[2017, 2018, 2019]
years
0 2017
1 2018
2 2019
years
0 2017
1 2018
2 2019
years
0 2017
1 2018
2 2019
but what I want is to create: dataframe with just 2017 dataframe with just 2018 dataframe with just 2019
but I can't hard code this because other files wont contain the same years. I need to make a list of what years are available and iterate through it.
I have also tried:
data = pd.read_csv("RHE.SM", header=None, parse_dates=[0])
year_list = data[0].dt.year.unique().tolist()
print(year_list)
data.index = pd.DatetimeIndex(data[0])
print(type(data.index))
print(data.index)
for x in year_list:
print(x)
newDF = data[x]
#newDF.head()
#print(newDF.head(5))
I get the following output, which starts good but then I get an error creating the newDF.
[2017, 2018, 2019]
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
DatetimeIndex(['2017-10-02 10:15:00', '2017-10-02 10:30:00',
'2017-10-02 10:45:00', '2017-10-02 11:00:00',
'2017-10-02 11:15:00', '2017-10-02 11:30:00',
'2017-10-02 11:45:00', '2017-10-02 12:00:00',
'2017-10-02 12:15:00', '2017-10-02 12:30:00',
...
'2019-01-03 14:45:00', '2019-01-03 15:00:00',
'2019-01-03 15:15:00', '2019-01-03 15:30:00',
'2019-01-03 15:45:00', '2019-01-03 16:00:00',
'2019-01-03 16:30:00', '2019-01-03 16:45:00',
'2019-01-03 17:15:00', '2019-01-03 18:30:00'],
dtype='datetime64[ns]', name=0, length=8685, freq=None)
2017
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/Applications/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3077 try:
-> 3078 return self._engine.get_loc(key)
3079 except KeyError:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
KeyError: 2017
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-19-f31493ccbf2a> in <module>
9 for x in year_list:
10 print(x)
---> 11 newDF = data[x]
12 #newDF.head()
13
~/Applications/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
2686 return self._getitem_multilevel(key)
2687 else:
-> 2688 return self._getitem_column(key)
2689
2690 def _getitem_column(self, key):
~/Applications/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in _getitem_column(self, key)
2693 # get column
2694 if self.columns.is_unique:
-> 2695 return self._get_item_cache(key)
2696
2697 # duplicate columns & possible reduce dimensionality
~/Applications/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
2487 res = cache.get(item)
2488 if res is None:
-> 2489 values = self._data.get(item)
2490 res = self._box_item_values(item, values)
2491 cache[item] = res
~/Applications/anaconda3/lib/python3.7/site-packages/pandas/core/internals.py in get(self, item, fastpath)
4113
4114 if not isna(item):
-> 4115 loc = self.items.get_loc(item)
4116 else:
4117 indexer = np.arange(len(self.items))[isna(self.items)]
~/Applications/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3078 return self._engine.get_loc(key)
3079 except KeyError:
-> 3080 return self._engine.get_loc(self._maybe_cast_indexer(key))
3081
3082 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
KeyError: 2017
Upvotes: 4
Views: 4295
Reputation: 372
I have not tested this but I think it will work for you.
data.iloc[:, 0] = pd.to_datetime(data.iloc[:, 0])
data['year'] = data.iloc[:, 0].apply(lambda x: x.year)
year_list = data['year'].unique().tolist()
It first converts the first column to a DateTime format. Then it creates a new column with only the year component of each DateTime. Finally, it will output a list of every unique value in that column.
If you also want to convert the resulting list to a new dataframe simply add this line after:
df = pd.DataFrame({'years':year_list})
edit If you want to convert each individual item in the list to a new dataframe you could add this instead:
df = []
for x in year_list:
df.append(pd.DataFrame({'years':[x]}))
Upvotes: 2
Reputation: 5460
If you want to separate a dataframe into individual dataframes by year, you can do the following:
dfs = {
year: sub_df.drop(columns=["year"])
for year, sub_df in data.assign(year=lambda df: df[0].dt.year)\
.groupby("year")
}
Out:
{1998: 0 1 2 3 4 5
0 1998-01-02 09:30:00 0.4298 0.4337 0.4258 0.4317 6426369,
1999: 0 1 2 3 4 5
1 1999-01-02 09:45:00 0.4317 0.4337 0.4258 0.4298 10589080,
2000: 0 1 2 3 4 5
2 2000-01-02 10:00:00 0.4298 0.4337 0.4278 0.4337 9507980,
2001: 0 1 2 3 4 5
3 2001-01-02 10:15:00 0.4337 0.4416 0.4298 0.4416 13639022}
If you want to iterate through and write the individual dfs
to a separate CSV, you can do the following:
for year, df in dfs.items():
filename = "base_name_{}.csv".format(year)
df.to_csv(filename, index=False)
In principle, you'd want a basename based off the original file name.
Upvotes: 1
Reputation: 2329
The easiest thing in your case would be:
data = pd.read_csv(inFileName, header=None, parse_dates=[0])
data[0].dt.year.unique().tolist()
This makes use of the datetime accessor which is fast and vectorized
Upvotes: 0
Reputation: 863
First, You need to make sure that You are extracting years from datetime
type. Assuming that You know the name of Your column, where the dates are stored, You do this:
df['datetime'] = pd.to_datetime(df['datetime'])
df['year'] = df['datetime'].apply(lambda x: x.year)
If the dates are in index, You do the following:
df['datetime'] = pd.to_datetime(df.reset_index()['index'])
df['datetime'] = pd.to_datetime(df['datetime'])
df['year'] = df['datetime'].apply(lambda x: x.year)
The first line takes values from index and puts them into a column named 'index' by default. The second one converts data into datetime
format.
After this is done, You extract unique years:
years = df['year'].unique().tolist()
Upvotes: 0