Reputation: 433
I've trambled into a small issue with the pandas pd.read_csv
function:
I've downloaded very large amounts of data in the form of csv.gzip files, and i'd rather let them compressed on my computer, because of the tremendous amount of space they take.
I want to load them into python, to do so, I've been using the usual pd.read_csv
function, adding the compression='gzip'
argument, while pandas manages to read the csv with the correct amount of columns and the correct index length, the data is complety buggy:
tick = pd.read_csv("D:\Finance python\Data\EUR_USD\Tick\\2015\\1.csv.gz",compression='gzip')
tick.head()
Out[30]:
D Unnamed: 1 Unnamed: 2
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
Would anyone have an idea of what I'm doing wrong when I try to read the file?
Pandas clearly recognizes that the data is in gzip form, but I have no idea of why it doesn't manages to extract it correctly.^
Thanks
The data that I'm trying to read: https://tickdata.fxcorporate.com/EURUSD/2015/1.csv.gz
Upvotes: 2
Views: 2988
Reputation: 1319
A quick look into the original csv file shows that it contains null characters ^@
which is why pandas
cannot parse it correctly
You can cleanup those characaters by using shell command
gzip -dc 1.csv.gz | tr -d '\0' | gzip > 1_clean.csv.gz
gzip -dc
decompresses the file into stdouttr -d '\0'
deletes the null charactersgzip
compresses it back to a gzipped fileAfter that pandas
should be able to read it correctly
UPDATE
In case when you don't have access to shell, you can still use python to do the trick, although it would be slower
import gzip
with gzip.open('1.csv.gz', 'rb') as f:
data = f.read()
with gzip.open('1_clean.csv.gz', 'wb') as f:
f.write(data.decode('utf-8').replace('\x00', '').encode('utf-8'))
Upvotes: 1