Reputation: 792
I am trying to find, and count, the data in an HDF5 file. The file has a data column called IAS. The strange thing is that I am getting completely different answers doing this two different ways, that I think should be equivalent, and should give identical results. Here is version one, where I am only testing for > 120 inside the loop:
import pandas as pd
import numpy as np
store = pd.HDFStore('AllData.h5','r')
a=store.select('mydata',iterator=True,chunksize=50000)
bins=np.arange(0.00,1.5,.02)
#Fill in a blank counting Series:
counts_total= pd.value_counts(pd.cut([],bins,include_lowest=True))
for chunk in a:
counts=pd.value_counts(pd.cut(abs(chunk[(chunk['IAS']>120.0)]['LatAc']),bins,include_lowest=True))
counts_total=counts+counts_total
While in this case, I am using a select to find the ones I want, then counting them inside the loop. This second version runs MUCH faster, but seems to only return a single chunk. I am not sure which one is correct, or why they are giving different answers.
import pandas as pd
import numpy as np
store = pd.HDFStore('AllData.h5','r')
a=store.select('mydata',iterator=True,chunksize=50000,where='IAS > 120.0')
bins=np.arange(0.00,1.5,.02)
#Fill in a blank counting Series:
counts_total2= pd.value_counts(pd.cut([],bins,include_lowest=True))
for chunk in a:
counts=pd.value_counts(pd.cut(abs(chunk['LatAc']),bins,include_lowest=True))
counts_total2=counts+counts_total2
If I just look at the numbers of rows returned, the second version is only returning all the rows in a single chunk (the data file as 2.8 million rows).
Is there something wrong with this second version? I think that I am using "where" correctly, but, it is not giving the expected results.
And, just to clarify my main goal. I am trying to bin my LatAC column data, in cases where IAS exceeds 120, and to ignore the rest of the data.
To add a bit more information. I am using PyTables 3.2.1 (I had seen the warning on the Pandas IO Tools page about an indexing bug in prior versions).
The data that I am reading is made up of a number of csv files that I parsed and appended to an HDF5 file. It seems as though the second approach, using "where" is only returning data from the last few appended files, while the first approach is returning much more data.
After seeing this bug: https://github.com/pydata/pandas/issues/5913 I wondered if I had run into something similar. Although, in my case, I did not provide expectedrows. So, I decided to try to run ptrepack anyway, and now I am getting an error from that:
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/bin/ptrepack --chunkshape=auto --propindexes --keep-source-filters --complib blosc --complevel 7 /Volumes/Untitled/AllData.h5 /Volumes/Untitled/AllData_repack.h5
Problems doing the copy from '/Volumes/Untitled/AllData.h5:/ (RootGroup) ''' to '/Volumes/Untitled/AllData_repack.h5:/ (RootGroup) '''
The error was --> <class 'tables.exceptions.HDF5ExtError'>: HDF5 error back trace
File "H5F.c", line 522, in H5Fcreate
unable to create file
File "H5Fint.c", line 1022, in H5F_open
unable to truncate a file which is already open
End of HDF5 error back trace
Unable to open/create file '/Volumes/Untitled/pytables-powkl3.tmp'
The destination file looks like:
/Volumes/Untitled/AllFlightLogs_repack.h5 (File) ''
Last modif.: 'Wed Aug 19 09:37:28 2015'
Object Tree:
/ (RootGroup) ''
/mydata (Group) ''
/mydata/table (Table(2772065,), shuffle, blosc(7)) ''
Traceback (most recent call last):
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/bin/ptrepack", line 9, in <module>
load_entry_point('tables==3.2.1', 'console_scripts', 'ptrepack')()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tables/scripts/ptrepack.py", line 525, in main
use_hardlinks=True)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tables/scripts/ptrepack.py", line 251, in copy_children
raise RuntimeError("Please check that the node names are not "
RuntimeError: Please check that the node names are not duplicated in destination, and if so, add the --overwrite-nodes flag if desired. In particular, pay attention that root_uep is not fooling you.
So, it seems that something is wrong with the hdf5 file, but I am not sure what this error means, or how to correct it.
To answer questions from the comments: store.mydata.info() returns:
Int64Index: 2772065 entries, 0 to 1640
Data columns (total 63 columns):
DateTimeGMT datetime64[ns]
AtvWpt object
Latitude float64
Longitude float64
AltB float64
BaroA float64
AltMSL float64
OAT float64
IAS float64
GndSpd float64
VSpd float64
Pitch float64
Roll float64
LatAc float64
NormAc float64
HDG float64
TRK float64
volt1 float64
volt2 float64
amp1 float64
amp2 float64
FQtyL float64
FQtyR float64
E1_FFlow float64
E1_FPres float64
E1_OilT float64
E1_OilP float64
E1_Torq float64
E1_NP float64
E1_NG float64
E1_ITT float64
E2_Torq float64
E2_NP float64
E2_NG float64
E2_ITT float64
AltGPS float64
TAS float64
HSIS object
CRS float64
NAV1 float64
NAV2 float64
COM1 float64
COM2 float64
HCDI float64
VCDI float64
WndSpd float64
WndDr float64
WptDst float64
WptBrg float64
MagVar float64
AfcsOn float64
RollM object
PitchM object
RollC float64
PichC float64
VSpdG float64
GPSfix object
HAL float64
VAL float64
HPLwas float64
HPLfd float64
VPLwas float64
SrcFile object
dtypes: datetime64[ns](1), float64(56), object(6)
memory usage: 1.3+ GB
I generated the file like this:
for file in files:
print (file + " Num: "+str(file_num)+" of: "+str(len(files)))
file_num=file_num+1
in_pd=read_file(file)
head, tail = path.split(file)
in_pd["SrcFile"]=tail
in_pd.to_hdf('AllData.h5','mydata',mode='a',append=True,complib='blosc', complevel=7,data_columns=Search_cols,min_itemsize={'SrcFile':30})
where the function read_file just reads a csv, does some parsing and returns a DataFrame.
Finally, ptdump -av returns this:
/ (RootGroup) ''
/._v_attrs (AttributeSet), 4 attributes:
[CLASS := 'GROUP',
PYTABLES_FORMAT_VERSION := '2.1',
TITLE := '',
VERSION := '1.0']
/mydata (Group) ''
/mydata._v_attrs (AttributeSet), 15 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := ['DateTimeGMT', 'Latitude', 'Longitude', 'AltB', 'BaroA', 'AltMSL', 'IAS', 'GndSpd', 'VSpd', 'Pitch', 'Roll', 'LatAc', 'NormAc', 'HDG', 'FQtyL', 'FQtyR', 'E1_FFlow', 'E1_FPres', 'E1_OilT', 'E1_OilP', 'E1_Torq', 'E1_NP', 'E1_NG', 'E1_ITT', 'TAS', 'SrcFile'],
encoding := None,
index_cols := [(0, 'index')],
info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
levels := 1,
metadata := [],
nan_rep := 'nan',
non_index_axes := [(1, ['DateTimeGMT', 'AtvWpt', 'Latitude', 'Longitude', 'AltB', 'BaroA', 'AltMSL', 'OAT', 'IAS', 'GndSpd', 'VSpd', 'Pitch', 'Roll', 'LatAc', 'NormAc', 'HDG', 'TRK', 'volt1', 'volt2', 'amp1', 'amp2', 'FQtyL', 'FQtyR', 'E1_FFlow', 'E1_FPres', 'E1_OilT', 'E1_OilP', 'E1_Torq', 'E1_NP', 'E1_NG', 'E1_ITT', 'E2_Torq', 'E2_NP', 'E2_NG', 'E2_ITT', 'AltGPS', 'TAS', 'HSIS', 'CRS', 'NAV1', 'NAV2', 'COM1', 'COM2', 'HCDI', 'VCDI', 'WndSpd', 'WndDr', 'WptDst', 'WptBrg', 'MagVar', 'AfcsOn', 'RollM', 'PitchM', 'RollC', 'PichC', 'VSpdG', 'GPSfix', 'HAL', 'VAL', 'HPLwas', 'HPLfd', 'VPLwas', 'SrcFile'])],
pandas_type := 'frame_table',
pandas_version := '0.15.2',
table_type := 'appendable_frame',
values_cols := ['values_block_0', 'values_block_1', 'DateTimeGMT', 'Latitude', 'Longitude', 'AltB', 'BaroA', 'AltMSL', 'IAS', 'GndSpd', 'VSpd', 'Pitch', 'Roll', 'LatAc', 'NormAc', 'HDG', 'FQtyL', 'FQtyR', 'E1_FFlow', 'E1_FPres', 'E1_OilT', 'E1_OilP', 'E1_Torq', 'E1_NP', 'E1_NG', 'E1_ITT', 'TAS', 'SrcFile']]
/mydata/table (Table(2772065,), shuffle, blosc(7)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(32,), dflt=0.0, pos=1),
"values_block_1": StringCol(itemsize=6, shape=(5,), dflt='', pos=2),
"DateTimeGMT": Int64Col(shape=(), dflt=0, pos=3),
"Latitude": Float64Col(shape=(), dflt=0.0, pos=4),
"Longitude": Float64Col(shape=(), dflt=0.0, pos=5),
"AltB": Float64Col(shape=(), dflt=0.0, pos=6),
"BaroA": Float64Col(shape=(), dflt=0.0, pos=7),
"AltMSL": Float64Col(shape=(), dflt=0.0, pos=8),
"IAS": Float64Col(shape=(), dflt=0.0, pos=9),
"GndSpd": Float64Col(shape=(), dflt=0.0, pos=10),
"VSpd": Float64Col(shape=(), dflt=0.0, pos=11),
"Pitch": Float64Col(shape=(), dflt=0.0, pos=12),
"Roll": Float64Col(shape=(), dflt=0.0, pos=13),
"LatAc": Float64Col(shape=(), dflt=0.0, pos=14),
"NormAc": Float64Col(shape=(), dflt=0.0, pos=15),
"HDG": Float64Col(shape=(), dflt=0.0, pos=16),
"FQtyL": Float64Col(shape=(), dflt=0.0, pos=17),
"FQtyR": Float64Col(shape=(), dflt=0.0, pos=18),
"E1_FFlow": Float64Col(shape=(), dflt=0.0, pos=19),
"E1_FPres": Float64Col(shape=(), dflt=0.0, pos=20),
"E1_OilT": Float64Col(shape=(), dflt=0.0, pos=21),
"E1_OilP": Float64Col(shape=(), dflt=0.0, pos=22),
"E1_Torq": Float64Col(shape=(), dflt=0.0, pos=23),
"E1_NP": Float64Col(shape=(), dflt=0.0, pos=24),
"E1_NG": Float64Col(shape=(), dflt=0.0, pos=25),
"E1_ITT": Float64Col(shape=(), dflt=0.0, pos=26),
"TAS": Float64Col(shape=(), dflt=0.0, pos=27),
"SrcFile": StringCol(itemsize=30, shape=(), dflt='', pos=28)}
byteorder := 'little'
chunkshape := (125,)
autoindex := True
colindexes := {
"E1_OilT": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"E1_NG": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"NormAc": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"Pitch": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"E1_Torq": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"E1_FPres": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"HDG": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"Longitude": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"Latitude": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"Roll": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"IAS": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"E1_FFlow": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"E1_ITT": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"FQtyL": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"LatAc": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"TAS": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"AltMSL": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"BaroA": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"SrcFile": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"AltB": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"DateTimeGMT": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"E1_OilP": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"E1_NP": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"VSpd": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"FQtyR": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"GndSpd": Index(6, medium, shuffle, zlib(1)).is_csi=False}
/flights/table._v_attrs (AttributeSet), 147 attributes:
[AltB_dtype := 'float64',
AltB_kind := ['AltB'],
AltB_meta := None,
AltMSL_dtype := 'float64',
AltMSL_kind := ['AltMSL'],
AltMSL_meta := None,
BaroA_dtype := 'float64',
BaroA_kind := ['BaroA'],
BaroA_meta := None,
CLASS := 'TABLE',
DateTimeGMT_dtype := 'datetime64',
DateTimeGMT_kind := ['DateTimeGMT'],
DateTimeGMT_meta := None,
E1_FFlow_dtype := 'float64',
E1_FFlow_kind := ['E1_FFlow'],
E1_FFlow_meta := None,
E1_FPres_dtype := 'float64',
E1_FPres_kind := ['E1_FPres'],
E1_FPres_meta := None,
E1_ITT_dtype := 'float64',
E1_ITT_kind := ['E1_ITT'],
E1_ITT_meta := None,
E1_NG_dtype := 'float64',
E1_NG_kind := ['E1_NG'],
E1_NG_meta := None,
E1_NP_dtype := 'float64',
E1_NP_kind := ['E1_NP'],
E1_NP_meta := None,
E1_OilP_dtype := 'float64',
E1_OilP_kind := ['E1_OilP'],
E1_OilP_meta := None,
E1_OilT_dtype := 'float64',
E1_OilT_kind := ['E1_OilT'],
E1_OilT_meta := None,
E1_Torq_dtype := 'float64',
E1_Torq_kind := ['E1_Torq'],
E1_Torq_meta := None,
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_10_FILL := 0.0,
FIELD_10_NAME := 'GndSpd',
FIELD_11_FILL := 0.0,
FIELD_11_NAME := 'VSpd',
FIELD_12_FILL := 0.0,
FIELD_12_NAME := 'Pitch',
FIELD_13_FILL := 0.0,
FIELD_13_NAME := 'Roll',
FIELD_14_FILL := 0.0,
FIELD_14_NAME := 'LatAc',
FIELD_15_FILL := 0.0,
FIELD_15_NAME := 'NormAc',
FIELD_16_FILL := 0.0,
FIELD_16_NAME := 'HDG',
FIELD_17_FILL := 0.0,
FIELD_17_NAME := 'FQtyL',
FIELD_18_FILL := 0.0,
FIELD_18_NAME := 'FQtyR',
FIELD_19_FILL := 0.0,
FIELD_19_NAME := 'E1_FFlow',
FIELD_1_FILL := 0.0,
FIELD_1_NAME := 'values_block_0',
FIELD_20_FILL := 0.0,
FIELD_20_NAME := 'E1_FPres',
FIELD_21_FILL := 0.0,
FIELD_21_NAME := 'E1_OilT',
FIELD_22_FILL := 0.0,
FIELD_22_NAME := 'E1_OilP',
FIELD_23_FILL := 0.0,
FIELD_23_NAME := 'E1_Torq',
FIELD_24_FILL := 0.0,
FIELD_24_NAME := 'E1_NP',
FIELD_25_FILL := 0.0,
FIELD_25_NAME := 'E1_NG',
FIELD_26_FILL := 0.0,
FIELD_26_NAME := 'E1_ITT',
FIELD_27_FILL := 0.0,
FIELD_27_NAME := 'TAS',
FIELD_28_FILL := '',
FIELD_28_NAME := 'SrcFile',
FIELD_2_FILL := '',
FIELD_2_NAME := 'values_block_1',
FIELD_3_FILL := 0,
FIELD_3_NAME := 'DateTimeGMT',
FIELD_4_FILL := 0.0,
FIELD_4_NAME := 'Latitude',
FIELD_5_FILL := 0.0,
FIELD_5_NAME := 'Longitude',
FIELD_6_FILL := 0.0,
FIELD_6_NAME := 'AltB',
FIELD_7_FILL := 0.0,
FIELD_7_NAME := 'BaroA',
FIELD_8_FILL := 0.0,
FIELD_8_NAME := 'AltMSL',
FIELD_9_FILL := 0.0,
FIELD_9_NAME := 'IAS',
FQtyL_dtype := 'float64',
FQtyL_kind := ['FQtyL'],
FQtyL_meta := None,
FQtyR_dtype := 'float64',
FQtyR_kind := ['FQtyR'],
FQtyR_meta := None,
GndSpd_dtype := 'float64',
GndSpd_kind := ['GndSpd'],
GndSpd_meta := None,
HDG_dtype := 'float64',
HDG_kind := ['HDG'],
HDG_meta := None,
IAS_dtype := 'float64',
IAS_kind := ['IAS'],
IAS_meta := None,
LatAc_dtype := 'float64',
LatAc_kind := ['LatAc'],
LatAc_meta := None,
Latitude_dtype := 'float64',
Latitude_kind := ['Latitude'],
Latitude_meta := None,
Longitude_dtype := 'float64',
Longitude_kind := ['Longitude'],
Longitude_meta := None,
NROWS := 2772065,
NormAc_dtype := 'float64',
NormAc_kind := ['NormAc'],
NormAc_meta := None,
Pitch_dtype := 'float64',
Pitch_kind := ['Pitch'],
Pitch_meta := None,
Roll_dtype := 'float64',
Roll_kind := ['Roll'],
Roll_meta := None,
SrcFile_dtype := 'string240',
SrcFile_kind := ['SrcFile'],
SrcFile_meta := None,
TAS_dtype := 'float64',
TAS_kind := ['TAS'],
TAS_meta := None,
TITLE := '',
VERSION := '2.7',
VSpd_dtype := 'float64',
VSpd_kind := ['VSpd'],
VSpd_meta := None,
index_kind := 'integer',
values_block_0_dtype := 'float64',
values_block_0_kind := ['AfcsOn', 'AltGPS', 'COM1', 'COM2', 'CRS', 'E2_ITT', 'E2_NG', 'E2_NP', 'E2_Torq', 'HAL', 'HCDI', 'HPLfd', 'HPLwas', 'MagVar', 'NAV1', 'NAV2', 'OAT', 'PichC', 'RollC', 'TRK', 'VAL', 'VCDI', 'VPLwas', 'VSpdG', 'WndDr', 'WndSpd', 'WptBrg', 'WptDst', 'amp1', 'amp2', 'volt1', 'volt2'],
values_block_0_meta := None,
values_block_1_dtype := 'string48',
values_block_1_kind := ['AtvWpt', 'GPSfix', 'HSIS', 'PitchM', 'RollM'],
values_block_1_meta := None]
Upvotes: 2
Views: 366
Reputation: 792
Apparently when appending the index is not updated. I wasn't able to find this documented anywhere in Pandas or Py-tables.
So, the problem was when I created the file, it didn't have a correct index. While if I don't create the index until after creating the whole hdf5 file it seems allow select to work. Creating the file this way seems to allow a correct search:
for file in files:
print (file + " Num: "+str(file_num)+" of: "+str(len(files)))
file_num=file_num+1
in_pd=read_file(file)
head, tail = path.split(file)
in_pd["SrcFile"]=tail
in_pd.to_hdf('AllData.h5','mydata',mode='a',append=True,complib='blosc', complevel=7,index=False,data_columns=Search_cols,min_itemsize={'SrcFile':30})
store = pd.HDFStore('AllData.h5')
store.create_table_index('mydata',columns=Search_cols,optlevel=6,kind='medium')
Upvotes: 1