Adam
Adam

Reputation: 792

Selecting from pandas.HDFStore gives different answers

I am trying to find, and count, the data in an HDF5 file. The file has a data column called IAS. The strange thing is that I am getting completely different answers doing this two different ways, that I think should be equivalent, and should give identical results. Here is version one, where I am only testing for > 120 inside the loop:

import pandas as pd
import numpy as np

store = pd.HDFStore('AllData.h5','r')

a=store.select('mydata',iterator=True,chunksize=50000)
bins=np.arange(0.00,1.5,.02)
#Fill in a blank counting Series:
counts_total= pd.value_counts(pd.cut([],bins,include_lowest=True))
for chunk in a:
    counts=pd.value_counts(pd.cut(abs(chunk[(chunk['IAS']>120.0)]['LatAc']),bins,include_lowest=True))
    counts_total=counts+counts_total

While in this case, I am using a select to find the ones I want, then counting them inside the loop. This second version runs MUCH faster, but seems to only return a single chunk. I am not sure which one is correct, or why they are giving different answers.

import pandas as pd
import numpy as np

store = pd.HDFStore('AllData.h5','r')

a=store.select('mydata',iterator=True,chunksize=50000,where='IAS > 120.0')
bins=np.arange(0.00,1.5,.02)
#Fill in a blank counting Series:
counts_total2= pd.value_counts(pd.cut([],bins,include_lowest=True))
for chunk in a:
    counts=pd.value_counts(pd.cut(abs(chunk['LatAc']),bins,include_lowest=True))
    counts_total2=counts+counts_total2

If I just look at the numbers of rows returned, the second version is only returning all the rows in a single chunk (the data file as 2.8 million rows).
Is there something wrong with this second version? I think that I am using "where" correctly, but, it is not giving the expected results. And, just to clarify my main goal. I am trying to bin my LatAC column data, in cases where IAS exceeds 120, and to ignore the rest of the data. To add a bit more information. I am using PyTables 3.2.1 (I had seen the warning on the Pandas IO Tools page about an indexing bug in prior versions).
The data that I am reading is made up of a number of csv files that I parsed and appended to an HDF5 file. It seems as though the second approach, using "where" is only returning data from the last few appended files, while the first approach is returning much more data.

After seeing this bug: https://github.com/pydata/pandas/issues/5913 I wondered if I had run into something similar. Although, in my case, I did not provide expectedrows. So, I decided to try to run ptrepack anyway, and now I am getting an error from that:

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/bin/ptrepack --chunkshape=auto --propindexes --keep-source-filters   --complib blosc --complevel 7 /Volumes/Untitled/AllData.h5 /Volumes/Untitled/AllData_repack.h5 
Problems doing the copy from '/Volumes/Untitled/AllData.h5:/ (RootGroup) ''' to '/Volumes/Untitled/AllData_repack.h5:/ (RootGroup) '''
The error was --> <class 'tables.exceptions.HDF5ExtError'>: HDF5 error back trace

  File "H5F.c", line 522, in H5Fcreate
    unable to create file
  File "H5Fint.c", line 1022, in H5F_open
    unable to truncate a file which is already open

End of HDF5 error back trace

Unable to open/create file '/Volumes/Untitled/pytables-powkl3.tmp'
The destination file looks like:
 /Volumes/Untitled/AllFlightLogs_repack.h5 (File) ''
Last modif.: 'Wed Aug 19 09:37:28 2015'
Object Tree: 
/ (RootGroup) ''
/mydata (Group) ''
/mydata/table (Table(2772065,), shuffle, blosc(7)) ''

Traceback (most recent call last):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/bin/ptrepack", line 9, in <module>
    load_entry_point('tables==3.2.1', 'console_scripts', 'ptrepack')()
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tables/scripts/ptrepack.py", line 525, in main
    use_hardlinks=True)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tables/scripts/ptrepack.py", line 251, in copy_children
    raise RuntimeError("Please check that the node names are not "
RuntimeError: Please check that the node names are not duplicated in destination, and if so, add the --overwrite-nodes flag if desired. In particular, pay attention that root_uep is not fooling you.

So, it seems that something is wrong with the hdf5 file, but I am not sure what this error means, or how to correct it.

To answer questions from the comments: store.mydata.info() returns:

Int64Index: 2772065 entries, 0 to 1640
Data columns (total 63 columns):
DateTimeGMT    datetime64[ns]
AtvWpt         object
Latitude       float64
Longitude      float64
AltB           float64
BaroA          float64
AltMSL         float64
OAT            float64
IAS            float64
GndSpd         float64
VSpd           float64
Pitch          float64
Roll           float64
LatAc          float64
NormAc         float64
HDG            float64
TRK            float64
volt1          float64
volt2          float64
amp1           float64
amp2           float64
FQtyL          float64
FQtyR          float64
E1_FFlow       float64
E1_FPres       float64
E1_OilT        float64
E1_OilP        float64
E1_Torq        float64
E1_NP          float64
E1_NG          float64
E1_ITT         float64
E2_Torq        float64
E2_NP          float64
E2_NG          float64
E2_ITT         float64
AltGPS         float64
TAS            float64
HSIS           object
CRS            float64
NAV1           float64
NAV2           float64
COM1           float64
COM2           float64
HCDI           float64
VCDI           float64
WndSpd         float64
WndDr          float64
WptDst         float64
WptBrg         float64
MagVar         float64
AfcsOn         float64
RollM          object
PitchM         object
RollC          float64
PichC          float64
VSpdG          float64
GPSfix         object
HAL            float64
VAL            float64
HPLwas         float64
HPLfd          float64
VPLwas         float64
SrcFile        object
dtypes: datetime64[ns](1), float64(56), object(6)
memory usage: 1.3+ GB

I generated the file like this:

for file in files:
    print (file + " Num: "+str(file_num)+" of: "+str(len(files)))
    file_num=file_num+1
    in_pd=read_file(file)
    head, tail = path.split(file)
    in_pd["SrcFile"]=tail
    in_pd.to_hdf('AllData.h5','mydata',mode='a',append=True,complib='blosc', complevel=7,data_columns=Search_cols,min_itemsize={'SrcFile':30})

where the function read_file just reads a csv, does some parsing and returns a DataFrame.

Finally, ptdump -av returns this:

/ (RootGroup) ''
  /._v_attrs (AttributeSet), 4 attributes:
   [CLASS := 'GROUP',
    PYTABLES_FORMAT_VERSION := '2.1',
    TITLE := '',
    VERSION := '1.0']
/mydata (Group) ''
  /mydata._v_attrs (AttributeSet), 15 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := ['DateTimeGMT', 'Latitude', 'Longitude', 'AltB', 'BaroA', 'AltMSL', 'IAS', 'GndSpd', 'VSpd', 'Pitch', 'Roll', 'LatAc', 'NormAc', 'HDG', 'FQtyL', 'FQtyR', 'E1_FFlow', 'E1_FPres', 'E1_OilT', 'E1_OilP', 'E1_Torq', 'E1_NP', 'E1_NG', 'E1_ITT', 'TAS', 'SrcFile'],
    encoding := None,
    index_cols := [(0, 'index')],
    info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
    levels := 1,
    metadata := [],
    nan_rep := 'nan',
    non_index_axes := [(1, ['DateTimeGMT', 'AtvWpt', 'Latitude', 'Longitude', 'AltB', 'BaroA', 'AltMSL', 'OAT', 'IAS', 'GndSpd', 'VSpd', 'Pitch', 'Roll', 'LatAc', 'NormAc', 'HDG', 'TRK', 'volt1', 'volt2', 'amp1', 'amp2', 'FQtyL', 'FQtyR', 'E1_FFlow', 'E1_FPres', 'E1_OilT', 'E1_OilP', 'E1_Torq', 'E1_NP', 'E1_NG', 'E1_ITT', 'E2_Torq', 'E2_NP', 'E2_NG', 'E2_ITT', 'AltGPS', 'TAS', 'HSIS', 'CRS', 'NAV1', 'NAV2', 'COM1', 'COM2', 'HCDI', 'VCDI', 'WndSpd', 'WndDr', 'WptDst', 'WptBrg', 'MagVar', 'AfcsOn', 'RollM', 'PitchM', 'RollC', 'PichC', 'VSpdG', 'GPSfix', 'HAL', 'VAL', 'HPLwas', 'HPLfd', 'VPLwas', 'SrcFile'])],
    pandas_type := 'frame_table',
    pandas_version := '0.15.2',
    table_type := 'appendable_frame',
    values_cols := ['values_block_0', 'values_block_1', 'DateTimeGMT', 'Latitude', 'Longitude', 'AltB', 'BaroA', 'AltMSL', 'IAS', 'GndSpd', 'VSpd', 'Pitch', 'Roll', 'LatAc', 'NormAc', 'HDG', 'FQtyL', 'FQtyR', 'E1_FFlow', 'E1_FPres', 'E1_OilT', 'E1_OilP', 'E1_Torq', 'E1_NP', 'E1_NG', 'E1_ITT', 'TAS', 'SrcFile']]
/mydata/table (Table(2772065,), shuffle, blosc(7)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(32,), dflt=0.0, pos=1),
  "values_block_1": StringCol(itemsize=6, shape=(5,), dflt='', pos=2),
  "DateTimeGMT": Int64Col(shape=(), dflt=0, pos=3),
  "Latitude": Float64Col(shape=(), dflt=0.0, pos=4),
  "Longitude": Float64Col(shape=(), dflt=0.0, pos=5),
  "AltB": Float64Col(shape=(), dflt=0.0, pos=6),
  "BaroA": Float64Col(shape=(), dflt=0.0, pos=7),
  "AltMSL": Float64Col(shape=(), dflt=0.0, pos=8),
  "IAS": Float64Col(shape=(), dflt=0.0, pos=9),
  "GndSpd": Float64Col(shape=(), dflt=0.0, pos=10),
  "VSpd": Float64Col(shape=(), dflt=0.0, pos=11),
  "Pitch": Float64Col(shape=(), dflt=0.0, pos=12),
  "Roll": Float64Col(shape=(), dflt=0.0, pos=13),
  "LatAc": Float64Col(shape=(), dflt=0.0, pos=14),
  "NormAc": Float64Col(shape=(), dflt=0.0, pos=15),
  "HDG": Float64Col(shape=(), dflt=0.0, pos=16),
  "FQtyL": Float64Col(shape=(), dflt=0.0, pos=17),
  "FQtyR": Float64Col(shape=(), dflt=0.0, pos=18),
  "E1_FFlow": Float64Col(shape=(), dflt=0.0, pos=19),
  "E1_FPres": Float64Col(shape=(), dflt=0.0, pos=20),
  "E1_OilT": Float64Col(shape=(), dflt=0.0, pos=21),
  "E1_OilP": Float64Col(shape=(), dflt=0.0, pos=22),
  "E1_Torq": Float64Col(shape=(), dflt=0.0, pos=23),
  "E1_NP": Float64Col(shape=(), dflt=0.0, pos=24),
  "E1_NG": Float64Col(shape=(), dflt=0.0, pos=25),
  "E1_ITT": Float64Col(shape=(), dflt=0.0, pos=26),
  "TAS": Float64Col(shape=(), dflt=0.0, pos=27),
  "SrcFile": StringCol(itemsize=30, shape=(), dflt='', pos=28)}
  byteorder := 'little'
  chunkshape := (125,)
  autoindex := True
  colindexes := {
    "E1_OilT": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "E1_NG": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "NormAc": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "Pitch": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "E1_Torq": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "E1_FPres": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "HDG": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "Longitude": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "Latitude": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "Roll": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "IAS": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "E1_FFlow": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "E1_ITT": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "FQtyL": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "LatAc": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "TAS": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "AltMSL": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "BaroA": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "SrcFile": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "AltB": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "DateTimeGMT": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "E1_OilP": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "E1_NP": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "VSpd": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "FQtyR": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "GndSpd": Index(6, medium, shuffle, zlib(1)).is_csi=False}
  /flights/table._v_attrs (AttributeSet), 147 attributes:
   [AltB_dtype := 'float64',
    AltB_kind := ['AltB'],
    AltB_meta := None,
    AltMSL_dtype := 'float64',
    AltMSL_kind := ['AltMSL'],
    AltMSL_meta := None,
    BaroA_dtype := 'float64',
    BaroA_kind := ['BaroA'],
    BaroA_meta := None,
    CLASS := 'TABLE',
    DateTimeGMT_dtype := 'datetime64',
    DateTimeGMT_kind := ['DateTimeGMT'],
    DateTimeGMT_meta := None,
    E1_FFlow_dtype := 'float64',
    E1_FFlow_kind := ['E1_FFlow'],
    E1_FFlow_meta := None,
    E1_FPres_dtype := 'float64',
    E1_FPres_kind := ['E1_FPres'],
    E1_FPres_meta := None,
    E1_ITT_dtype := 'float64',
    E1_ITT_kind := ['E1_ITT'],
    E1_ITT_meta := None,
    E1_NG_dtype := 'float64',
    E1_NG_kind := ['E1_NG'],
    E1_NG_meta := None,
    E1_NP_dtype := 'float64',
    E1_NP_kind := ['E1_NP'],
    E1_NP_meta := None,
    E1_OilP_dtype := 'float64',
    E1_OilP_kind := ['E1_OilP'],
    E1_OilP_meta := None,
    E1_OilT_dtype := 'float64',
    E1_OilT_kind := ['E1_OilT'],
    E1_OilT_meta := None,
    E1_Torq_dtype := 'float64',
    E1_Torq_kind := ['E1_Torq'],
    E1_Torq_meta := None,
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_10_FILL := 0.0,
    FIELD_10_NAME := 'GndSpd',
    FIELD_11_FILL := 0.0,
    FIELD_11_NAME := 'VSpd',
    FIELD_12_FILL := 0.0,
    FIELD_12_NAME := 'Pitch',
    FIELD_13_FILL := 0.0,
    FIELD_13_NAME := 'Roll',
    FIELD_14_FILL := 0.0,
    FIELD_14_NAME := 'LatAc',
    FIELD_15_FILL := 0.0,
    FIELD_15_NAME := 'NormAc',
    FIELD_16_FILL := 0.0,
    FIELD_16_NAME := 'HDG',
    FIELD_17_FILL := 0.0,
    FIELD_17_NAME := 'FQtyL',
    FIELD_18_FILL := 0.0,
    FIELD_18_NAME := 'FQtyR',
    FIELD_19_FILL := 0.0,
    FIELD_19_NAME := 'E1_FFlow',
    FIELD_1_FILL := 0.0,
    FIELD_1_NAME := 'values_block_0',
    FIELD_20_FILL := 0.0,
    FIELD_20_NAME := 'E1_FPres',
    FIELD_21_FILL := 0.0,
    FIELD_21_NAME := 'E1_OilT',
    FIELD_22_FILL := 0.0,
    FIELD_22_NAME := 'E1_OilP',
    FIELD_23_FILL := 0.0,
    FIELD_23_NAME := 'E1_Torq',
    FIELD_24_FILL := 0.0,
    FIELD_24_NAME := 'E1_NP',
    FIELD_25_FILL := 0.0,
    FIELD_25_NAME := 'E1_NG',
    FIELD_26_FILL := 0.0,
    FIELD_26_NAME := 'E1_ITT',
    FIELD_27_FILL := 0.0,
    FIELD_27_NAME := 'TAS',
    FIELD_28_FILL := '',
    FIELD_28_NAME := 'SrcFile',
    FIELD_2_FILL := '',
    FIELD_2_NAME := 'values_block_1',
    FIELD_3_FILL := 0,
    FIELD_3_NAME := 'DateTimeGMT',
    FIELD_4_FILL := 0.0,
    FIELD_4_NAME := 'Latitude',
    FIELD_5_FILL := 0.0,
    FIELD_5_NAME := 'Longitude',
    FIELD_6_FILL := 0.0,
    FIELD_6_NAME := 'AltB',
    FIELD_7_FILL := 0.0,
    FIELD_7_NAME := 'BaroA',
    FIELD_8_FILL := 0.0,
    FIELD_8_NAME := 'AltMSL',
    FIELD_9_FILL := 0.0,
    FIELD_9_NAME := 'IAS',
    FQtyL_dtype := 'float64',
    FQtyL_kind := ['FQtyL'],
    FQtyL_meta := None,
    FQtyR_dtype := 'float64',
    FQtyR_kind := ['FQtyR'],
    FQtyR_meta := None,
    GndSpd_dtype := 'float64',
    GndSpd_kind := ['GndSpd'],
    GndSpd_meta := None,
    HDG_dtype := 'float64',
    HDG_kind := ['HDG'],
    HDG_meta := None,
    IAS_dtype := 'float64',
    IAS_kind := ['IAS'],
    IAS_meta := None,
    LatAc_dtype := 'float64',
    LatAc_kind := ['LatAc'],
    LatAc_meta := None,
    Latitude_dtype := 'float64',
    Latitude_kind := ['Latitude'],
    Latitude_meta := None,
    Longitude_dtype := 'float64',
    Longitude_kind := ['Longitude'],
    Longitude_meta := None,
    NROWS := 2772065,
    NormAc_dtype := 'float64',
    NormAc_kind := ['NormAc'],
    NormAc_meta := None,
    Pitch_dtype := 'float64',
    Pitch_kind := ['Pitch'],
    Pitch_meta := None,
    Roll_dtype := 'float64',
    Roll_kind := ['Roll'],
    Roll_meta := None,
    SrcFile_dtype := 'string240',
    SrcFile_kind := ['SrcFile'],
    SrcFile_meta := None,
    TAS_dtype := 'float64',
    TAS_kind := ['TAS'],
    TAS_meta := None,
    TITLE := '',
    VERSION := '2.7',
    VSpd_dtype := 'float64',
    VSpd_kind := ['VSpd'],
    VSpd_meta := None,
    index_kind := 'integer',
    values_block_0_dtype := 'float64',
    values_block_0_kind := ['AfcsOn', 'AltGPS', 'COM1', 'COM2', 'CRS', 'E2_ITT', 'E2_NG', 'E2_NP', 'E2_Torq', 'HAL', 'HCDI', 'HPLfd', 'HPLwas', 'MagVar', 'NAV1', 'NAV2', 'OAT', 'PichC', 'RollC', 'TRK', 'VAL', 'VCDI', 'VPLwas', 'VSpdG', 'WndDr', 'WndSpd', 'WptBrg', 'WptDst', 'amp1', 'amp2', 'volt1', 'volt2'],
    values_block_0_meta := None,
    values_block_1_dtype := 'string48',
    values_block_1_kind := ['AtvWpt', 'GPSfix', 'HSIS', 'PitchM', 'RollM'],
    values_block_1_meta := None]

Upvotes: 2

Views: 366

Answers (1)

Adam
Adam

Reputation: 792

Apparently when appending the index is not updated. I wasn't able to find this documented anywhere in Pandas or Py-tables.
So, the problem was when I created the file, it didn't have a correct index. While if I don't create the index until after creating the whole hdf5 file it seems allow select to work. Creating the file this way seems to allow a correct search:

for file in files:
    print (file + " Num: "+str(file_num)+" of: "+str(len(files)))
    file_num=file_num+1
    in_pd=read_file(file)
    head, tail = path.split(file)
    in_pd["SrcFile"]=tail
    in_pd.to_hdf('AllData.h5','mydata',mode='a',append=True,complib='blosc', complevel=7,index=False,data_columns=Search_cols,min_itemsize={'SrcFile':30})

store = pd.HDFStore('AllData.h5')
store.create_table_index('mydata',columns=Search_cols,optlevel=6,kind='medium')

Upvotes: 1

Related Questions