Reputation: 435
I am currently working on a project where I use Sentiment Analysis for Twitter Posts. I am classifying the Tweets with Sentiment140. With the tool I can classify up to 1,000,000 Tweets per day and I have collected around 750,000 Tweets. So that should be fine. The only problem is that I can send a max of 15,000 Tweets to the JSON Bulk Classification at once.
My whole code is set up and running. The only problem is that my JSON file now contains all 750,000 Tweets.
Therefore my question: What is the best way to split the JSON into smaller files with the same structure? I would prefer to do this in Python.
I have thought about iterating through the file. But how do I specify in the code that it should create a new file after for example 5,000 elements?
I would love to get some hints what the most reasonable approach is. Thank you!
EDIT: This is the code that I have at the moment.
import itertools
import json
from itertools import izip_longest
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
# Open JSON file
values = open('Tweets.json').read()
#print values
# Adjust formatting of JSON file
values = values.replace('\n', '') # do your cleanup here
#print values
v = values.encode('utf-8')
#print v
# Load JSON file
v = json.loads(v)
print type(v)
for i, group in enumerate(grouper(v, 5000)):
with open('outputbatch_{}.json'.format(i), 'w') as outputfile:
json.dump(list(group), outputfile)
The output gives:
["data", null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, ...]
in a file called: "outputbatch_0.json"
EDIT 2: This is the structure of the JSON.
{
"data": [
{
"text": "So has @MissJia already discussed this Kelly Rowland Dirty Laundry song? I ain't trying to go all through her timelime...",
"id": "1"
},
{
"text": "RT @UrbanBelleMag: While everyone waits for Kelly Rowland to name her abusive ex, don't hold your breath. But she does say he's changed: ht\u00e2\u20ac\u00a6",
"id": "2"
},
{
"text": "@Iknowimbetter naw if its weak which I dont think it will be im not gonna want to buy and up buying Kanye or even Kelly Rowland album lol",
"id": "3"}
]
}
Upvotes: 2
Views: 18265
Reputation: 553
I think your first thought is good. Just iterate over all tweets you got, save them in a temp array and track an index which you increment by one every tweet. Always when the current-index modulo 5000 is equals 0 call a method that converts the tweets in string-format and save this in a file with the index in the filename. If you reach the end of tweets, do the same on this last rest.
Upvotes: 0
Reputation: 1121914
Use an iteration grouper; the itertools
module recipes list includes the following:
from itertools import izip_longest
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
This lets you iterate over your tweets in groups of 5000:
for i, group in enumerate(grouper(input_tweets, 5000)):
with open('outputbatch_{}.json'.format(i), 'w') as outputfile:
json.dump(list(group), outputfile)
Upvotes: 8