Reputation: 5097
I have some code below that imports some time series csv files into a dataframe and changes the column names of the dataframe to 'date' for the column with the dates of the time series and the other columns are set to the names of the files they have come from. All good so far. Now I want to read in data that is between two pre set dates. This is where I am having a problem. I cannot get the code to only return the dataframe from startDate to endDate and remove the other rows of data.
I have had various goes at this but I cannot get the filter to work. Please see the current version of my code below:
def getTimeseriesData4(DataPath,columnNum,startDate,endDate):
colNames = ['date']
path = DataPath
filePath = path, "*.csv"
allfiles = glob.glob(os.path.join(path, "*.csv"))
for fname in allfiles:
name = os.path.splitext(fname)[0]
name = os.path.split(name)[1]
colNames.append(name)
dataframes = [pd.read_csv(fname, header=None,usecols=[0,columnNum]) for fname in allfiles]
#this is the part where I am trying to filter out the data I do not need. So dataframes would only have data between the startDate and the endDate
dataframes = dataframes.set_index(['date'])
print(dataframes.loc[startDate:endDate])
timeseriesData = reduce(partial(pd.merge, on=0, how='outer'), dataframes)
timeseriesData.columns=colNames
return timeseriesData
below is a sample of the data I am importing
date BBG.BBG.AUDEUR.FX BBG.BBG.CADEUR.FX BBG.BBG.CHFEUR.FX \
0 01/01/2001 0.5932 0.7084 0.6588
1 02/01/2001 0.5893 0.7038 0.6576
2 03/01/2001 0.6000 0.7199 0.6610
3 04/01/2001 0.5972 0.7021 0.6563
4 05/01/2001 0.5973 0.6972 0.6532
5 08/01/2001 0.5987 0.7073 0.6562
6 09/01/2001 0.5972 0.7095 0.6565
7 10/01/2001 0.5923 0.7105 0.6548
8 11/01/2001 0.5888 0.7029 0.6512
9 12/01/2001 0.5861 0.7013 0.6494
10 15/01/2001 0.5870 0.7064 0.6492
11 16/01/2001 0.5892 0.7047 0.6497
12 17/01/2001 0.5912 0.7070 0.6507
13 18/01/2001 0.5920 0.7015 0.6544
14 19/01/2001 0.5953 0.7083 0.6535
so if I set the startDate to be '02/01/2001' and the endDate to be '05/01/2001'
the code would return:
date BBG.BBG.AUDEUR.FX BBG.BBG.CADEUR.FX BBG.BBG.CHFEUR.FX \
0 02/01/2001 0.5893 0.7038 0.6576
1 03/01/2001 0.6000 0.7199 0.6610
2 04/01/2001 0.5972 0.7021 0.6563
3 05/01/2001 0.5973 0.6972 0.6532
So instead of returning all the data imported from the CSV files, the code would return data between the startDate and the endDate.
Upvotes: 2
Views: 2977
Reputation: 394041
Convert the dtype
to datetime
using pd.to_datetime
:
In [98]:
df['date'] = pd.to_datetime(df['date'])
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 15 entries, 0 to 14
Data columns (total 4 columns):
date 15 non-null datetime64[ns]
BBG.BBG.AUDEUR.FX 15 non-null float64
BBG.BBG.CADEUR.FX 15 non-null float64
BBG.BBG.CHFEUR.FX 15 non-null float64
dtypes: datetime64[ns](1), float64(3)
memory usage: 600.0 bytes
You can then pass your dates as filtering criteria to create a boolean mask:
In [97]:
df[(df['date'] >= '02/01/2001') & (df['date'] <= '05/01/2001')]
Out[97]:
date BBG.BBG.AUDEUR.FX BBG.BBG.CADEUR.FX BBG.BBG.CHFEUR.FX
1 2001-02-01 0.5893 0.7038 0.6576
2 2001-03-01 0.6000 0.7199 0.6610
3 2001-04-01 0.5972 0.7021 0.6563
4 2001-05-01 0.5973 0.6972 0.6532
Upvotes: 3