edb500
edb500

Reputation: 323

Pandas Read_CSV incorrectly reads numbers

Method1

def getAndBuildDatafrmeFromCsvBasic(filename):
    colTypes = {'Open': 'float64', 'High': 'float64', 'Low': 'float64', 'Close': 'float64', 'Volume': 'float64'}
    dfEurUsd2017 = pd.read_csv(filename, delimiter=",", index_col='Gmt time', dtype=colTypes, parse_dates=['Gmt time'])
    return dfEurUsd2017

Method2

def getAndBuildDatafrmeFromCsv(filename):
    df = pd.read_csv(filename, header=None)
    df.columns = ['date', 'Open', 'High', 'Low', 'Close', 'Volume']
    df.date = pd.to_datetime(df.date, format='%d.%m.%Y %H:%M:%S.%f')
    df.index = df['date']
    df = df[['Open', 'High', 'Low', 'Close', 'Volume']]
    return df

Results Method1

                        Open     High      Low    Close   Volume
Gmt time                                                        
2017-12-04 23:00:00  1.06672  1.06699  1.06636  1.06698  1889.56

Results Method2

                        Open     High      Low    Close   Volume
Gmt time                                                        
2017-12-04 23:00:00  1.18686  1.18699  1.18666  1.18682  2004.46

Why does method1 incorrectly parse the values for Open,High,Low,Close,Volume? Method2 produces the correct output. I'm concerned why both methods output completely different numerical values, even the volume is different. Yet the csv file is the same.

Rows from CSV

04.12.2017 23:00:00.000,1.18686,1.18699,1.18666,1.18682,2004.4599999999998
04.12.2017 23:30:00.000,1.18682,1.18706,1.18652,1.18681,1242.68
05.12.2017 00:00:00.000,1.18681,1.18691,1.18639,1.18653,2666.81
05.12.2017 00:30:00.000,1.18653,1.18726,1.18650,1.18709,3567.2400000000007
05.12.2017 01:00:00.000,1.18708,1.18750,1.18707,1.18738,3105.4699999999993
05.12.2017 01:30:00.000,1.18738,1.18744,1.18691,1.18732,3561.5
05.12.2017 02:00:00.000,1.18732,1.18766,1.18704,1.18740,2706.6400000000003

Upvotes: 1

Views: 950

Answers (1)

Anton vBR
Anton vBR

Reputation: 18916

I added dayfirst=True and your code works fine.

What pandas version are u using? And where would that falsy data come from?

import pandas as pd

data = '''\
Gmt time,Open,High,Low,Close,Volume
04.12.2017 23:00:00.000,1.18686,1.18699,1.18666,1.18682,2004.4599999999998
04.12.2017 23:30:00.000,1.18682,1.18706,1.18652,1.18681,1242.68
05.12.2017 00:00:00.000,1.18681,1.18691,1.18639,1.18653,2666.81
05.12.2017 00:30:00.000,1.18653,1.18726,1.18650,1.18709,3567.2400000000007
05.12.2017 01:00:00.000,1.18708,1.18750,1.18707,1.18738,3105.4699999999993
05.12.2017 01:30:00.000,1.18738,1.18744,1.18691,1.18732,3561.5
05.12.2017 02:00:00.000,1.18732,1.18766,1.18704,1.18740,2706.6400000000003
'''

with open('test.csv', 'w') as f:
    f.write(data)

def getAndBuildDatafrmeFromCsvBasic(filename):
    colTypes = {'Open': 'float64', 'High': 'float64', 'Low': 'float64', 'Close': 'float64', 'Volume': 'float64'}
    dfEurUsd2017 = pd.read_csv(filename, delimiter=",", index_col='Gmt time', dtype=colTypes, parse_dates=['Gmt time'], dayfirst=True)
    return dfEurUsd2017

print(getAndBuildDatafrmeFromCsvBasic('test.csv'))

Returns:

                        Open     High      Low    Close   Volume
Gmt time                                                        
2017-12-04 23:00:00  1.18686  1.18699  1.18666  1.18682  2004.46
2017-12-04 23:30:00  1.18682  1.18706  1.18652  1.18681  1242.68
2017-12-05 00:00:00  1.18681  1.18691  1.18639  1.18653  2666.81
2017-12-05 00:30:00  1.18653  1.18726  1.18650  1.18709  3567.24
2017-12-05 01:00:00  1.18708  1.18750  1.18707  1.18738  3105.47
2017-12-05 01:30:00  1.18738  1.18744  1.18691  1.18732  3561.50
2017-12-05 02:00:00  1.18732  1.18766  1.18704  1.18740  2706.64

Upvotes: 1

Related Questions