gussilago
gussilago

Reputation: 932

Reading chunks of csv file in Python using pandas

I have a question concerning reading bits and pieces of a csv file. When just reading the file using

pd.read_csv(path,sep=';',na_values=[''],thousands='.',decimal=',',date_parser=[0])

I get:

     EUR     1Y     2Y     3Y
0  2013-09-25  0,198  0,307  0,485
1  2013-09-26  0,204  0,318  0,497
2  2013-09-27  0,204  0,306  0,487
3  2013-09-28  0,204  0,306  0,487
4         USD     1Y     2Y     3Y
5  2013-09-25  0,462  0,571  0,749
6  2013-09-26  0,468  0,582  0,761
7  2013-09-27  0,468   0,57  0,751
8  2013-09-28  0,468   0,57  0,751

As you can see, the data is arranged date-wise, and each data set is in chunks after one another (in this case the USD-data comes straight after the EUR-data). The currency-label tricks up things a bit, and the data becomes one single data frame.

What I would like to have are two separate data frames, as

     EUR     1Y     2Y     3Y
0  2013-09-25  0,198  0,307  0,485
1  2013-09-26  0,204  0,318  0,497
2  2013-09-27  0,204  0,306  0,487
3  2013-09-28  0,204  0,306  0,487

     USD     1Y     2Y     3Y
0  2013-09-25  0,462  0,571  0,749
1  2013-09-26  0,468  0,582  0,761
2  2013-09-27  0,468   0,57  0,751
3  2013-09-28  0,468   0,57  0,751

That is, I would like to separate each currency data set from one another.

Any suggestions?

Upvotes: 3

Views: 2411

Answers (2)

unutbu
unutbu

Reputation: 879103

Here is an alternative approach to the problem. It reads the csv into a single DataFrame and then uses a bit of data-wrangling to create a currency column:

           currency     1Y     2Y     3Y
date                                    
2013-09-25      EUR  0,198  0,307  0,485
2013-09-26      EUR  0,204  0,318  0,497
2013-09-27      EUR  0,204  0,306  0,487
2013-09-28      EUR  0,204  0,306  0,487
2013-09-25      USD  0,462  0,571  0,749
2013-09-26      USD  0,468  0,582  0,761
2013-09-27      USD  0,468   0,57  0,751
2013-09-28      USD  0,468   0,57  0,751

You could then "split" the DataFrame into smaller DataFrames according to the currency using groupby:

groups = df.groupby(['currency'])
for key, grp in groups:
    print(grp)

import numpy as np
import pandas as pd

df = pd.read_table('data',sep=';',na_values=[''],thousands='.',decimal=',',
                   names=['date', '1Y', '2Y', '3Y'])
mask = df['date'].str.contains('^\s*\D')              # 1
df['currency'] = (df['date']
                  .where(mask, np.nan)                # 2
                  .fillna(method='ffill'))            # 3
df = df.loc[~mask]                                    # 4

print(df)    

groups = df.groupby(['currency'])
for key, grp in groups:
    print(grp)

  1. Use str.contains to find values in df['date'] which begin with a non-digit. The values are presumed to be currencies. The mask is True on these rows.

    In [120]: mask
    Out[120]: 
    0     True
    1    False
    2    False
    3    False
    4    False
    5     True
    6    False
    7    False
    8    False
    9    False
    Name: date, dtype: bool
    
  2. df['date'].where(mask, np.nan) returns a Series, equal to df['date'] where the mask is True, and np.nan otherwise.
  3. Forward-fill in the nans with the currency values

    In [123]: df['date'].where(mask, np.nan).fillna(method='ffill')
    Out[123]: 
    0    EUR
    1    EUR
    2    EUR
    3    EUR
    4    EUR
    5    USD
    6    USD
    7    USD
    8    USD
    9    USD
    Name: date, dtype: object
    
  4. Select only those rows where the mask is False, thus removing the header rows.

Upvotes: 3

EdChum
EdChum

Reputation: 393893

Use the nrows and skiprows parameter to read_csv

So for first dataframe read just the first 4 rows:

eur = pd.read_csv(path,sep=';',na_values=[''],thousands='.',decimal=',',date_parser=[0], nrows=4)

and subsequent dataframe skip first 5 rows:

usd = pd.read_csv(path,sep=';',na_values=[''],thousands='.',decimal=',',date_parser=[0], skiprows=5)

should work

Upvotes: 1

Related Questions