RinW
RinW

Reputation: 553

Numpy split data by using specific column name?

How to specify a column for numpy to split the dataset?

Right now I'm trying to split the dataset I have which is of the following format this is dataitems,

{
            "tweet_id": "1234456", 
            "tweet": "hello world", 
            "labels": {
                "item1": 2, 
                "item2": 1
            }
        }, 
        {
            "tweet_id": "567890976", 
            "tweet": "testing", 
            "labels": {
                "item1": 2, 
                "item2": 1, 
                "item3": 1, 
                "item4": 1
            }
        }

at the moment the workable method is getting just the tweet_ids in a list and splitting that, but I'd like to know if there is method to directly split this json file using the numpy.split()

TRAINPCT = 0.50
DEVPCT = 0.25
TESTPCT = 1 - TRAINPCT - DEVPCT

train, dev, test = np.split(dataitems, [int(TRAINPCT * len(dataitems)), int((TRAINPCT+DEVPCT) * len(dataitems))]) 

this just throws and error

OrderedDict([('tweet_id', '1234456'), ('tweet', "hello world""), ('labels', Counter({'item1': 2, 'item2': 1}))])],
      dtype=object) is not JSON serializable

Thanks

Upvotes: 0

Views: 501

Answers (2)

RinW
RinW

Reputation: 553

Figured out I couldn't do this as thought with everything on the same dataframe. What I did exactly was extract only the tweet_ids into one dataframe -> Split them and then match the labels from the initial dataset depending on the tweet_id's split.

Upvotes: 0

mommermi
mommermi

Reputation: 1052

pandas provides functionality to turn json data into a DataFrame object, which basically work like a table. Might be worth considering this instead of using numpy:

In [1]: from pandas.io.json import json_normalize
   ...: 
   ...: raw = [{"tweet_id": "1234456",
   ...:         "tweet": "hello world",
   ...:         "labels": {
   ...:             "item1": 2,
   ...:             "item2": 1
   ...:         }},
   ...:        {"tweet_id": "567890976",
   ...:         "tweet": "testing",
   ...:         "labels": {
   ...:             "item1": 2,
   ...:             "item2": 1,
   ...:             "item3": 1,
   ...:             "item4": 1
   ...:         }
   ...:         }]
   ...: 
   ...: df = json_normalize(raw)

In [2]: df
Out[2]: 
   labels.item1  labels.item2  labels.item3  labels.item4        tweet  \
0             2             1           NaN           NaN  hello world   
1             2             1           1.0           1.0      testing   

    tweet_id  
0    1234456  
1  567890976  

Upvotes: 1

Related Questions