Reputation: 932
I have a question concerning reading bits and pieces of a csv file. When just reading the file using
pd.read_csv(path,sep=';',na_values=[''],thousands='.',decimal=',',date_parser=[0])
I get:
EUR 1Y 2Y 3Y
0 2013-09-25 0,198 0,307 0,485
1 2013-09-26 0,204 0,318 0,497
2 2013-09-27 0,204 0,306 0,487
3 2013-09-28 0,204 0,306 0,487
4 USD 1Y 2Y 3Y
5 2013-09-25 0,462 0,571 0,749
6 2013-09-26 0,468 0,582 0,761
7 2013-09-27 0,468 0,57 0,751
8 2013-09-28 0,468 0,57 0,751
As you can see, the data is arranged date-wise, and each data set is in chunks after one another (in this case the USD-data comes straight after the EUR-data). The currency-label tricks up things a bit, and the data becomes one single data frame.
What I would like to have are two separate data frames, as
EUR 1Y 2Y 3Y
0 2013-09-25 0,198 0,307 0,485
1 2013-09-26 0,204 0,318 0,497
2 2013-09-27 0,204 0,306 0,487
3 2013-09-28 0,204 0,306 0,487
USD 1Y 2Y 3Y
0 2013-09-25 0,462 0,571 0,749
1 2013-09-26 0,468 0,582 0,761
2 2013-09-27 0,468 0,57 0,751
3 2013-09-28 0,468 0,57 0,751
That is, I would like to separate each currency data set from one another.
Any suggestions?
Upvotes: 3
Views: 2411
Reputation: 879103
Here is an alternative approach to the problem. It reads the csv into a single DataFrame and then uses a bit of data-wrangling to create a currency column:
currency 1Y 2Y 3Y
date
2013-09-25 EUR 0,198 0,307 0,485
2013-09-26 EUR 0,204 0,318 0,497
2013-09-27 EUR 0,204 0,306 0,487
2013-09-28 EUR 0,204 0,306 0,487
2013-09-25 USD 0,462 0,571 0,749
2013-09-26 USD 0,468 0,582 0,761
2013-09-27 USD 0,468 0,57 0,751
2013-09-28 USD 0,468 0,57 0,751
You could then "split" the DataFrame into smaller DataFrames according to the currency using groupby
:
groups = df.groupby(['currency'])
for key, grp in groups:
print(grp)
import numpy as np
import pandas as pd
df = pd.read_table('data',sep=';',na_values=[''],thousands='.',decimal=',',
names=['date', '1Y', '2Y', '3Y'])
mask = df['date'].str.contains('^\s*\D') # 1
df['currency'] = (df['date']
.where(mask, np.nan) # 2
.fillna(method='ffill')) # 3
df = df.loc[~mask] # 4
print(df)
groups = df.groupby(['currency'])
for key, grp in groups:
print(grp)
Use str.contains
to find values in df['date']
which begin with a non-digit. The values are presumed to be currencies. The mask
is True
on these rows.
In [120]: mask
Out[120]:
0 True
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 False
9 False
Name: date, dtype: bool
df['date'].where(mask, np.nan)
returns a Series, equal to
df['date']
where the mask is True
, and np.nan
otherwise. Forward-fill in the nans
with the currency values
In [123]: df['date'].where(mask, np.nan).fillna(method='ffill')
Out[123]:
0 EUR
1 EUR
2 EUR
3 EUR
4 EUR
5 USD
6 USD
7 USD
8 USD
9 USD
Name: date, dtype: object
False
, thus removing the header rows.Upvotes: 3
Reputation: 393893
Use the nrows
and skiprows
parameter to read_csv
So for first dataframe read just the first 4 rows:
eur = pd.read_csv(path,sep=';',na_values=[''],thousands='.',decimal=',',date_parser=[0], nrows=4)
and subsequent dataframe skip first 5 rows:
usd = pd.read_csv(path,sep=';',na_values=[''],thousands='.',decimal=',',date_parser=[0], skiprows=5)
should work
Upvotes: 1