Reputation: 11
I am trying to create a dataframe of twitter data. Using the twitter API, I have a list of twitter objects as a list (tweets
) and want to populate a dataframe with various info from those twitter objects and using some other functions on the text. The current method I have uses list comprehension for each column, iterating through all tweets each time.
df = pd.DataFrame(data=[tweet.all_text for tweet in tweets], columns=["tweets"])
df.loc[:, 'id'] = np.array([tweet.id for tweet in tweets])
df.loc[:, 'len_tweet'] = np.array([len(tweet.all_text) for tweet in tweets])
df.loc[:, 'date_created'] = np.array([tweet.created_at_datetime for tweet in tweets])
df.loc[:, 'author'] = np.array([tweet.name for tweet in tweets])
df.loc[:, 'clean_tweet'] = np.array([self.clean_tweet_eng(tweet) for tweet in df.tweets])
df.loc[:, 'clean_stopwords_tweet'] = np.array([self.stopwords_clean(tweet) for tweet in df.tweets])
etc...
As I scale up the number of tweets, this becomes very slow.
I have looked at two other methods: creating the dataframe through iteratively adding elements to a dictionary, and building up the dataframe one row at a time using iterrows to only cycle through the list of tweets once. Both seem to be slower.
What is the fastest way to do achieve this?
Upvotes: 1
Views: 52
Reputation: 124
I think the simplest way would to be convert the list of twitter objects in to one list of dictionaries then use load the data once
import pandas as pd
list_of_dicts = [{'name': 'jon', 'age': 30}, {'name': 'paul', 'age': 26}]
df = pd.DataFrame(list_of_dicts)
Upvotes: 1