Reputation: 2991

Values missing when loaded from Pandas HDF5 file

I've some twitter feed loaded in Pandas Series that I want to store in HDF5 format. Here's a sample of it:

    >>> feeds[80:90]

    80    BØR MAN STARTE en tweet med store bokstaver? F...
    81    @NRKSigrid @audunlysbakken Har du husket Per S...
    82    Lurer på om IS har fått med seg kaoset ved Eur...
    83    synes han hørte på P3 at Opoku uttales Opoko. ...
    84    De statsbærende partiene Ap og Høyre må ta sky...
    85    April 2014. Blir MDG det nye arbeider @partiet...
    86                       MDG: Hasj for kjøtt. #valg2015
    87               Grønt skifte.. https://t.co/OuM8quaMz0
    88                    Kinderegg https://t.co/AsECmw2sV9
    89    MDG for honning, frukt og grønt. https://t.co/...
    Name: feeds, dtype: object

Whenever I try to load the above data from a saved HDF5 file, some values are missing and are replaced by ''... And the same values reappear when I change the indexing. For example, while storing rows with index 84-85:

    >>> store = pd.HDFStore('feed.hd5')
    >>> store.append('feed', feeds[84:86], min_itemsize=200, encoding='utf-8')
    >>> store.close()

when I read the file, the value of 84th row is now missing:

    >>> pd.read_hdf('feed.hd5', 'feed')

    84                                                     
    85    April 2014. Blir MDG det nye arbeider @partiet...
    Name: feeds, dtype: object

I get the same output as above if I do this way too:

    >>> feeds[84:86].to_hdf('feed.hd5', 'feed', format='table', data_columns=True)
    >>> pd.read_hdf('feed.hd5', 'feed')

But If I change the index to, say, [84:87] from [84:86], the 84th row is now loaded.

    >>> feeds[84:87].to_hdf('feed.hd5', 'feed', format='table', data_columns=True)
    >>> res = pd.read_hdf('feed.hd5', 'feed')
    >>> res

    84    De statsbærende partiene Ap og Høyre må ta sky...
    85    April 2014. Blir MDG det nye arbeider @partiet...
    86                       MDG: Hasj for kjøtt. #valg2015
    Name: feeds, dtype: object

But now, the loaded string is missing some characters when compared with the original tweet. here's that 84th row valued tweet:

    >>> # Original tweet (Length: 140)
    >>> print (feeds[84])

    De statsbærende partiene Ap og Høyre må ta skylda for Miljøpartiets fremgang. Velgerne har sett at SV og V ikke vinner frem i miljøspørsmål.

    >>> # Loaded tweet (Length: 134)
    >>> print (res[84])

    De statsbærende partiene Ap og Høyre må ta skylda for Miljøpartiets fremgang. Velgerne har sett at SV og V ikke vinner frem i miljøspø

I plan to use Python 3.3.x mainly for this unicode column support in PyTables (Am I wrong?) but could not store all the data successfully, yet. Can anyone explain this and let me know how can I avoid it ?

I am using OS: Mac OS X Yosemite, Pandas: 0.16.2, Python: 3.3.5, PyTables: 3.2.0

UPDATE: I confirmed with HDFView (http://www.hdfgroup.org/products/java/hdfview/) that the data is indeed getting stored always (although with some last characters missing) but I am unable to load it successfully every time though.

Thanks.

Upvotes: 1

Answers (2)

Kevad

Reputation: 2991

I have found the issue and could partly able to correct it . If someone first tweets a tweet that is near to 140 characters and another person retweets it, the latter one doesn't contain the full tweet as there will be some text pre-appended for retweets like RT @username:. As a result, the tweet is now more than 140 characters and hence is stripped down to 140 and is obtained as such via twitter APIs for python like tweepy or Python twitter tools (these are the two I have tested...). Sometimes the last character of these kind of tweets is a character '…' which has a length of 1 and an ordinal value of 8230 (try chr(8230) in python 3.x or unichr(8230) for python 2.x...). When these are stored in a HDF5 file and read via pd.read_hdf, it could not be done and instead pandas replaces the whole tweet with just ''.

This could be rectified as below:

>>> # replace that '…' character with '' (empty char)
>>> ch = chr(8230)
>>> feeds.str.replace(ch, '')

>>> # Store it in HDF5 now... # Not sure if it preserves the encoding...
>>> feeds.to_hdf('feed.h5', 'feed', format='table', append=True, encoding='utf-8', data_columns=True)

>>> # If you prefer this way
>>> with pd.HDFStore('feed.h5') as store:
        store.append('feed', feeds, min_itemsize=200, encoding='utf-8')

>>> # Now read it safely
>>> pd.read_hdf('feed.h5', 'feed')

However, the problem still appears sometimes if there are some unicode characters... Giving the encoding='utf-8' option didn't really help, at least for my case. Any help in this regard is appreciated... :)

Upvotes: 0

Jeff

Reputation: 128918

See the doc-string here.

You need to provide encoding='utf-8' otherwise this will be stored with your default python encoding (which might or might not work). Reading will use the written encoding.

The data

In [13]: df[84:86]
Out[13]: 
              tweet_id  username                 tweet_time                                              tweet
84  641437756275720192  @nicecap  2015-09-09T02:27:33+00:00  De statsbÃ¦rende partiene Ap og HÃ¸yre mÃ¥ ta sky...
85  641434661391101952  @nicecap  2015-09-09T02:15:15+00:00  April 2014. Blir MDG det nye arbeider @partiet...

Appending, supply the encoding.

In [11]: store.append('feed',df[84:86],encoding='utf-8')

Supply the encoding when read as well

In [12]: store.select('feed',encoding='utf-8')
Out[12]: 
              tweet_id  username                 tweet_time                                              tweet
84  641437756275720192  @nicecap  2015-09-09T02:27:33+00:00  De statsbÃ¦rende partiene Ap og HÃ¸yre mÃ¥ ta sky...
85  641434661391101952  @nicecap  2015-09-09T02:15:15+00:00  April 2014. Blir MDG det nye arbeider @partiet...

Here's how its stored

In [14]: store.get_storer('feed')
Out[14]: frame_table  (typ->appendable,nrows->2,ncols->4,indexers->[index])

In [15]: store.get_storer('feed').attrs
Out[15]: 
/feed._v_attrs (AttributeSet), 15 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := [],
    encoding := 'utf-8',
    index_cols := [(0, 'index')],
    info := {1: {'names': [None], 'type': 'Index'}, 'index': {}},
    levels := 1,
    metadata := [],
    nan_rep := 'nan',
    non_index_axes := [(1, ['tweet_id', 'username', 'tweet_time', 'tweet'])],
    pandas_type := 'frame_table',
    pandas_version := '0.15.2',
    table_type := 'appendable_frame',
    values_cols := ['values_block_0', 'values_block_1']]

So, I suppose this is a bug in that I should by default use the stored encoding when reading. I created an issue here

Upvotes: 1

Values missing when loaded from Pandas HDF5 file

Answers (2)

Related Questions