Reputation: 65
I've streamed tweets from Tweepy and stored it as a text file, as such. Now I am looking to convert this into a pandas dataframe but I don't know how. I've tried looking for similar posts here on Stack Overflow and in the pandas documentation as well, but I'm still not sure on how I would start parsing all of this data.
Answer: Solved this by turning the json file into a list and then was able to turn it into a dataframe. Thank you everyone who helped.
tweets = []
for line in open('tweets.txt', 'r'):
tweets.append(json.loads(line))
df = pd.DataFrame(tweets)
Upvotes: 2
Views: 1376
Reputation: 114
If you have multiple tweets in your json file (yourfile.txt) and you want to read them all into your data frame:
df = pd.read_json('yourfile.txt', lines=True)
Upvotes: 0
Reputation: 48
You don't have to convert your text file to json in order to read it as a pandas dataframe just do:
pd.read_json('yourfile.txt')
and it should work. This assumes that your format is:
{"name": "first json"}
and not:
{"name": "first json"}{"name": "second json"}
However, if you do have the second format then you can just any of these methods (there are many more):
Iterate through the file -> track the open brackets -> create json objects on the go -> append them to a list -> feed the list into pandas.
def parseMultipleJSON(lines):
skip = prev = 0
data = []
lines = ''.join(lines)
for idx, line in enumerate(lines):
if line == "{":
skip += 1
elif line == "}":
skip -= 1
if skip == 0:
json_string = ''.join(lines[prev:idx+1])
data.append(json.loads(json_string))
prev = idx+1
return data
Or use split as such and add removed brackets:
def parseMultipleJSON2(lines):
lines = ''.join(lines).split('}{')
data = []
for line in lines:
if line.endswith('}') == False:
line += '}'
if line.startswith('{') == False:
line = '{%s' % line
data.append(json.loads(line))
return data
This is the same as the second solution but abbreviated:
def parseMultipleJSON3(lines):
lines = ''.join(lines).split('}{')
data = [json.loads('%s}' % line) if idx == 0 else json.loads('{%s' % line) if idx == len(lines)-1 else json.loads('{%s}' % line) for idx, line in enumerate(lines)]
return data
Then you can call any which you want to choose as such:
import pandas as pd
import json
with open('yourfile.txt','r') as json_file:
lines = json_file.readlines()
lines = [line.strip("\n") for line in lines]
#data = parseMultipleJSON(lines)
#data = parseMultipleJSON2(lines)
data = parseMultipleJSON3(lines)
df = pd.DataFrame(data)
Upvotes: 1