Reputation: 373
I have a dataset with userIDs, Tweets and CreatedDates. Each UserID will have multiple tweets created at different dates. I want to find the frequency of tweets and Ive written a small calculation which gives me the number of tweets per hour per userID. I used group by to do this the code as follows :
twitterDataFrame = twitterDataFrame.set_index(['CreatedAt'])
tweetsByEachUser = twitterDataFrame.groupby('UserID')
numberOfHoursBetweenFirstAndLastTweet = (tweetsByEachUser['CreatedAtForCalculations'].first() - tweetsByEachUser['CreatedAtForCalculations'].last()).astype('timedelta64[h]')
numberOfTweetsByTheUser = tweetsByEachUser.size()
frequency = numberOfTweetsByTheUser / numberOfHoursBetweenFirstAndLastTweet
When printing the value of frequency I get :
UserID
807095 5.629630
28785486 2.250000
134758540 8.333333
Now I need to go back into my big data frame (twitterDataFrame) and add these values alongside the correct UserIDs. How can i possible do that? Id like to say
twitterDataFrame['frequency'] = the frequency corresponding to the UserID. e.g twitterDataFrame['UserID'] and the frequency value we got for that above.
However I am not sure how i do this. Would anyone know how i can achieve this?
Upvotes: 1
Views: 596
Reputation: 77991
You can use join
operation on the frequency
object you created, or do it in one stage:
get_freq = lambda ts: (ts.last() - ts.first()).astype('timedelta64[h]') / len(ts)
twitterDataFrame['frequency'] = twitterDataFrame.groupby('UserID')['CreatedAtForCalculations'].transform(get_freq)
Upvotes: 2