Reputation: 1127
I got a file of twitter streaming data in json. Now I'm trying to load it in python:
import json
tweets_data=[]
tweets_file=open('test1.txt',"r")
for line in tweets_file:
try:
tweet=json.load(line)
tweets_data.append(tweet)
except:
continue
print(len(tweets_data))
The result is always 0. If "try" and "except" are removed, then error is "ValueError: Expecting value: line 2 column 1 (char 1)". However, each line of the file is a valid JSON according to online validator.
Here is a segment of test1.txt:
{"created_at":"Fri Jul 24 16:35:22 +0000 2015","id":624618886277640192,"id_str":"624618886277640192","text":"RT @nodenow: Essential Steps: Long Term Support for Node.js\nhttp:\/\/t.co\/MzPfvenwtT\n+1 micshasan #javascript","source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":3290861609,"id_str":"3290861609","name":"Rajiin","screen_name":"Rajiin_07","location":"Pokhara city","url":"http:\/\/www.pokharacity.com","description":null,"protected":false,"verified":false,"followers_count":1101,"friends_count":1119,"listed_count":155,"favourites_count":2048,"statuses_count":5498,"created_at":"Wed May 20 04:58:23 +0000 2015","utc_offset":-25200,"time_zone":"Pacific Time (US & Canada)","geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"000000","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"4A913C","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"000000","profile_text_color":"000000","profile_use_background_image":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/617620457336893440\/3HTEKnMx_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/617620457336893440\/3HTEKnMx_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/3290861609\/1435854327","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweeted_status":{"created_at":"Fri Jul 24 16:33:04 +0000 2015","id":624618308050915328,"id_str":"624618308050915328","text":"Essential Steps: Long Term Support for Node.js\nhttp:\/\/t.co\/MzPfvenwtT\n+1 micshasan #javascript","source":"\u003ca href=\"http:\/\/ifttt.com\" rel=\"nofollow\"\u003eIFTTT\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":3243544179,"id_str":"3243544179","name":"Javascript Digest","screen_name":"nodenow","location":"","url":null,"description":null,"protected":false,"verified":false,"followers_count":1238,"friends_count":1,"listed_count":1148,"favourites_count":2,"statuses_count":130923,"created_at":"Sat May 09 15:45:13 +0000 2015","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"0084B4","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/597066594334941184\/Xe4tTtU8_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/597066594334941184\/Xe4tTtU8_normal.jpg","default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":1,"favorite_count":0,"entities":{"hashtags":[{"text":"javascript","indices":[83,94]}],"trends":[],"urls":[{"url":"http:\/\/t.co\/MzPfvenwtT","expanded_url":"http:\/\/bit.ly\/1LH81ly","display_url":"bit.ly\/1LH81ly","indices":[47,69]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en"},"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[{"text":"javascript","indices":[96,107]}],"trends":[],"urls":[{"url":"http:\/\/t.co\/MzPfvenwtT","expanded_url":"http:\/\/bit.ly\/1LH81ly","display_url":"bit.ly\/1LH81ly","indices":[60,82]}],"user_mentions":[{"screen_name":"nodenow","name":"Javascript Digest","id":3243544179,"id_str":"3243544179","indices":[3,11]}],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1437755722003"}
{"created_at":"Fri Jul 24 16:35:22 +0000 2015","id":624618888387432449,"id_str":"624618888387432449","text":"python \u041c\u043e\u0441\u043a\u0432\u0430 http:\/\/t.co\/itYJmgVvgD","source":"\u003ca href=\"http:\/\/gdepraktika.ru\" rel=\"nofollow\"\u003egdepraktika-trfnslator\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":623605809,"id_str":"623605809","name":"\u0413\u0434\u0435 \u043f\u0440\u0430\u043a\u0442\u0438\u043a\u0430?","screen_name":"gdepraktika","location":"\u0420\u043e\u0441\u0441\u0438\u044f","url":"http:\/\/gdepraktika.ru","description":"\u041f\u0440\u0430\u043a\u0442\u0438\u043a\u0430, \u0441\u0442\u0430\u0436\u0438\u0440\u043e\u0432\u043a\u0430, \u0440\u0430\u0431\u043e\u0442\u0430 \u0434\u043b\u044f \u0441\u0442\u0443\u0434\u0435\u043d\u0442\u043e\u0432, \u043e\u0431\u0443\u0447\u0435\u043d\u0438\u0435 \u0432 \u043a\u043e\u043c\u043f\u0430\u043d\u0438\u044f\u0445","protected":false,"verified":false,"followers_count":17,"friends_count":9,"listed_count":0,"favourites_count":0,"statuses_count":902069,"created_at":"Sun Jul 01 07:53:36 +0000 2012","utc_offset":10800,"time_zone":"Moscow","geo_enabled":false,"lang":"ru","contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"0084B4","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/378800000420815111\/bba61a6dcd4272794a4af41dd8a44cf5_normal.png","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/378800000420815111\/bba61a6dcd4272794a4af41dd8a44cf5_normal.png","default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"trends":[],"urls":[{"url":"http:\/\/t.co\/itYJmgVvgD","expanded_url":"http:\/\/bit.ly\/1GqpqOg","display_url":"bit.ly\/1GqpqOg","indices":[15,37]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"und","timestamp_ms":"1437755722506"}
Upvotes: 2
Views: 9591
Reputation: 1
a line may has \r \n space .etc
and you should use json.loads()
import json
tweets_data=[]
tweets_file=open('test1.txt',"r")
for line in tweets_file:
try:
tweet=json.loads(line.strip())
tweets_data.append(tweet)
except:
continue
print(len(tweets_data))
Upvotes: -1
Reputation: 1935
This is happening because there are two blank lines between the valid json lines. Just add a check for blank lines and you should be good to go.
import json
tweets_data = []
notParsed = []
tweets_file = open('test1.txt',"r")
for line in tweets_file:
if line.strip():
try:
tweet=json.load(line)
tweets_data.append(tweet)
except:
notParsed.append(line)
continue
print(len(tweets_data))
print('Could not parse: ', len(notParsed))
This isn't required and I'm just revising Python because of your answer, but you could edit your code as follows:
map(json.loads, [x for x in open('test1.txt').read().split('\n') if x.strip()])
Upvotes: 5
Reputation: 2539
Two things. First, your segment doesn't look like valid json after all. (And copy-pasting it into the validator http://jsonlint.com confirms that. Syntatically, there needs to be a comma between the two tweets, and the reason is that they need to be elements in some higher-level data structure (the json-equivalent of a python list == an "array," or the json-equivalent of a dict, an "object"). There can be only one root element in a json file. See this prior question: How to read a JSON file containing multiple root elements?.
Second, if you're trying to just get ordinary access to the json data structure, and aren't trying to do anything special that depends on the notion of a line (or worrying about memory management, etc.), then you don't really need to read it in line by line like this. Instead, you can just bind the whole shebang to a variable, and it will turn the json into nested lists and dicts according to the obvious syntax (i.e., the json curlybraces/brackets function the same as the python ones).
So once you get the json valid, the code can be as simple as:
import json
with open('test1.txt') as json_file:
myjson = json.load(json_file)
then just access every element in the json via list indexes/dict keys.
This method ignores whitespace, so you should have no problem with the line breaks.
Ultimately, this suggests that the solution is to wrap your tweets in a top-level list ("array") and stick commas between them (maybe with regexes as a string operation?), then just access them by list indexing rather than going line by line.
Upvotes: 0