skipping Attribute error while importing twitter data into pandas

Question

I have almost 1 gb file storing almost .2 mln tweets. And, the huge size of file obviously carries some errors. The errors are shown as AttributeError: 'int' object has no attribute 'items'. This occurs when I try to run this code.

 raw_data_path = input("Enter the path for raw data file: ")
 tweet_data_path = raw_data_path



 tweet_data = []
 tweets_file = open(tweet_data_path, "r", encoding="utf-8")
 for line in tweets_file:
   try:
    tweet = json.loads(line)
    tweet_data.append(tweet)
   except:
    continue


    tweet_data2 = [tweet for tweet in tweet_data if isinstance(tweet, 
   dict)]



   from pandas.io.json import json_normalize    
tweets = json_normalize(tweet_data2)[["text", "lang", "place.country",
                                     "created_at", "coordinates", 
                                     "user.location", "id"]]

Can a solution be found where those lines where such error occurs can be skipped and continue for the rest of the lines.

akshat · Accepted Answer

The issue here is not with lines in data but with tweet_data itself. If you check your tweet_data, you will find one more elements which are of 'int' datatype (assuming your tweet_data is a list of dictionaries as it only expects "dict or list of dicts").

You may want to check your tweet data to remove values other that dictionaries.

I was able to reproduce with below example for json_normalize document:

Working Example:

from pandas.io.json import json_normalize
data = [{'state': 'Florida',
         'shortname': 'FL',
         'info': {
              'governor': 'Rick Scott'
         },
         'counties': [{'name': 'Dade', 'population': 12345},
                     {'name': 'Broward', 'population': 40000},
                     {'name': 'Palm Beach', 'population': 60000}]},
        {'state': 'Ohio',
         'shortname': 'OH',
         'info': {
              'governor': 'John Kasich'
         },
         'counties': [{'name': 'Summit', 'population': 1234},
                      {'name': 'Cuyahoga', 'population': 1337}]},
       ]
json_normalize(data)

Output:

Displays datarame

Reproducing Error:

from pandas.io.json import json_normalize
data = [{'state': 'Florida',
         'shortname': 'FL',
         'info': {
              'governor': 'Rick Scott'
         },
         'counties': [{'name': 'Dade', 'population': 12345},
                     {'name': 'Broward', 'population': 40000},
                     {'name': 'Palm Beach', 'population': 60000}]},
        {'state': 'Ohio',
         'shortname': 'OH',
         'info': {
              'governor': 'John Kasich'
         },
         'counties': [{'name': 'Summit', 'population': 1234},
                      {'name': 'Cuyahoga', 'population': 1337}]},
       1  # *Added an integer to the list*
       ]
result = json_normalize(data)

Error:

AttributeError: 'int' object has no attribute 'items'

How to prune "tweet_data": Not needed, if you follow update below

Before normalising, run below:

tweet_data = [tweet for tweet in tweet_data if isinstance(tweet, dict)]

Update: (for foor loop)

for line in tweets_file:
    try:
        tweet = json.loads(line)
        if isinstance(tweet, dict): 
            tweet_data.append(tweet)
    except:
        continue

skipping Attribute error while importing twitter data into pandas

Answers (2)

Related Questions