Reputation: 2991
I've some twitter feed loaded in Pandas Series that I want to store in HDF5 format. Here's a sample of it:
>>> feeds[80:90]
80 BØR MAN STARTE en tweet med store bokstaver? F...
81 @NRKSigrid @audunlysbakken Har du husket Per S...
82 Lurer på om IS har fått med seg kaoset ved Eur...
83 synes han hørte på P3 at Opoku uttales Opoko. ...
84 De statsbærende partiene Ap og Høyre må ta sky...
85 April 2014. Blir MDG det nye arbeider @partiet...
86 MDG: Hasj for kjøtt. #valg2015
87 Grønt skifte.. https://t.co/OuM8quaMz0
88 Kinderegg https://t.co/AsECmw2sV9
89 MDG for honning, frukt og grønt. https://t.co/...
Name: feeds, dtype: object
Whenever I try to load the above data from a saved HDF5 file, some values are missing and are replaced by ''
... And the same values reappear when I change the indexing. For example, while storing rows with index 84-85
:
>>> store = pd.HDFStore('feed.hd5')
>>> store.append('feed', feeds[84:86], min_itemsize=200, encoding='utf-8')
>>> store.close()
when I read the file, the value of 84th
row is now missing:
>>> pd.read_hdf('feed.hd5', 'feed')
84
85 April 2014. Blir MDG det nye arbeider @partiet...
Name: feeds, dtype: object
I get the same output as above if I do this way too:
>>> feeds[84:86].to_hdf('feed.hd5', 'feed', format='table', data_columns=True)
>>> pd.read_hdf('feed.hd5', 'feed')
But If I change the index to, say, [84:87]
from [84:86]
, the 84th
row is now loaded.
>>> feeds[84:87].to_hdf('feed.hd5', 'feed', format='table', data_columns=True)
>>> res = pd.read_hdf('feed.hd5', 'feed')
>>> res
84 De statsbærende partiene Ap og Høyre må ta sky...
85 April 2014. Blir MDG det nye arbeider @partiet...
86 MDG: Hasj for kjøtt. #valg2015
Name: feeds, dtype: object
But now, the loaded string is missing some characters when compared with the original tweet. here's that 84th
row valued tweet:
>>> # Original tweet (Length: 140)
>>> print (feeds[84])
De statsbærende partiene Ap og Høyre må ta skylda for Miljøpartiets fremgang. Velgerne har sett at SV og V ikke vinner frem i miljøspørsmål.
>>> # Loaded tweet (Length: 134)
>>> print (res[84])
De statsbærende partiene Ap og Høyre må ta skylda for Miljøpartiets fremgang. Velgerne har sett at SV og V ikke vinner frem i miljøspø
I plan to use Python 3.3.x mainly for this unicode column support in PyTables (Am I wrong?) but could not store all the data successfully, yet. Can anyone explain this and let me know how can I avoid it ?
I am using OS: Mac OS X Yosemite, Pandas: 0.16.2, Python: 3.3.5, PyTables: 3.2.0
UPDATE: I confirmed with HDFView (http://www.hdfgroup.org/products/java/hdfview/) that the data is indeed getting stored always (although with some last characters missing) but I am unable to load it successfully every time though.
Thanks.
Upvotes: 1
Views: 804
Reputation: 2991
I have found the issue and could partly able to correct it . If someone first tweets a tweet that is near to 140 characters and another person retweets it, the latter one doesn't contain the full tweet as there will be some text pre-appended for retweets like RT @username:
. As a result, the tweet is now more than 140 characters and hence is stripped down to 140 and is obtained as such via twitter APIs for python like tweepy
or Python twitter tools
(these are the two I have tested...). Sometimes the last character of these kind of tweets is a character '…'
which has a length of 1
and an ordinal value of 8230
(try chr(8230)
in python 3.x or unichr(8230)
for python 2.x...). When these are stored in a HDF5 file and read via pd.read_hdf
, it could not be done and instead pandas replaces the whole tweet with just ''
.
This could be rectified as below:
>>> # replace that '…' character with '' (empty char)
>>> ch = chr(8230)
>>> feeds.str.replace(ch, '')
>>> # Store it in HDF5 now... # Not sure if it preserves the encoding...
>>> feeds.to_hdf('feed.h5', 'feed', format='table', append=True, encoding='utf-8', data_columns=True)
>>> # If you prefer this way
>>> with pd.HDFStore('feed.h5') as store:
store.append('feed', feeds, min_itemsize=200, encoding='utf-8')
>>> # Now read it safely
>>> pd.read_hdf('feed.h5', 'feed')
However, the problem still appears sometimes if there are some unicode characters... Giving the encoding='utf-8'
option didn't really help, at least for my case. Any help in this regard is appreciated... :)
Upvotes: 0
Reputation: 128918
See the doc-string here.
You need to provide encoding='utf-8'
otherwise this will be stored with your default python encoding (which might or might not work). Reading will use the written encoding.
The data
In [13]: df[84:86]
Out[13]:
tweet_id username tweet_time tweet
84 641437756275720192 @nicecap 2015-09-09T02:27:33+00:00 De statsbærende partiene Ap og Høyre må ta sky...
85 641434661391101952 @nicecap 2015-09-09T02:15:15+00:00 April 2014. Blir MDG det nye arbeider @partiet...
Appending, supply the encoding.
In [11]: store.append('feed',df[84:86],encoding='utf-8')
Supply the encoding when read as well
In [12]: store.select('feed',encoding='utf-8')
Out[12]:
tweet_id username tweet_time tweet
84 641437756275720192 @nicecap 2015-09-09T02:27:33+00:00 De statsbærende partiene Ap og Høyre må ta sky...
85 641434661391101952 @nicecap 2015-09-09T02:15:15+00:00 April 2014. Blir MDG det nye arbeider @partiet...
Here's how its stored
In [14]: store.get_storer('feed')
Out[14]: frame_table (typ->appendable,nrows->2,ncols->4,indexers->[index])
In [15]: store.get_storer('feed').attrs
Out[15]:
/feed._v_attrs (AttributeSet), 15 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := [],
encoding := 'utf-8',
index_cols := [(0, 'index')],
info := {1: {'names': [None], 'type': 'Index'}, 'index': {}},
levels := 1,
metadata := [],
nan_rep := 'nan',
non_index_axes := [(1, ['tweet_id', 'username', 'tweet_time', 'tweet'])],
pandas_type := 'frame_table',
pandas_version := '0.15.2',
table_type := 'appendable_frame',
values_cols := ['values_block_0', 'values_block_1']]
So, I suppose this is a bug in that I should by default use the stored encoding when reading. I created an issue here
Upvotes: 1