jjjayn
jjjayn

Reputation: 555

Python & Pandas: a proper way to fetch data from a dataframe and create a new one

This issue arises quite often and my current method isn't the optimal one.

eg.

I've a dataframe (df1) that contains some data:

date          tweet                    sent_by

2017/01/01    'OMG @user2 @POTUS'      user1
2017/01/01    'THIS IS A TWEET!'       user17
......        .....                    ......

I've a list of users:

user_list = ['user1','user2', ...]

Now I'd like to go through df1 and check whether any user (exluding the sender) is mentioned in the tweet and create a new dataframe (df2) that contains the relevant information.

My current method:

date_list = []
tweet_list = []
sent_by_list = []
user_mentioned_list = []

for i in range(len(df1)):
    for user in user_list:
        if user in df1['tweet'][i] and user != df1['sent_by'][i]:
            date_list.append(df1['date'][i])
            tweet_list.append(df1['tweet'][i])
            sent_by_list.append(df1['sent_by'][i])
            user_mentioned_list.append(user)

df2 = pd.DataFrame()
df2['date'] = date_list
df2['tweet'] = tweet_list
df2['sent_by'] = sent_by_list
df2['user_mentioned'] = user_mentioned_list

What is the best way to fetch the data and form a new dataframe?

Upvotes: 1

Views: 45

Answers (1)

jezrael
jezrael

Reputation: 863291

I think you need extract by list with | regex (or) and then compare with isin, last filter by boolean indexing:

print (df1)
         date                tweet sent_by
0  2017/01/01  'OMG @user2 @POTUS'   user1
1  2017/01/01  'OMG @user2 @POTUS'   user2
2  2017/01/01  'OMG @user2 @POTUS'   user4
3  2017/01/01  'OMG @user4 @POTUS'   user1
4  2017/01/01   'THIS IS A TWEET!'  user17

user_list = ['user1','user2']
pat = '(' + '|'.join(user_list) + ')'
df1['user_mentioned'] = df1['tweet'].str.extract(pat, expand=False)
df2 = df1[df1['user_mentioned'].isin(user_list) & (df1['user_mentioned']!= df1['sent_by'])]
print (df2)
         date                tweet sent_by user_mentioned
0  2017/01/01  'OMG @user2 @POTUS'   user1          user2
2  2017/01/01  'OMG @user2 @POTUS'   user4          user2

Detail:

print (df1)
         date                tweet sent_by user_mentioned
0  2017/01/01  'OMG @user2 @POTUS'   user1          user2
1  2017/01/01  'OMG @user2 @POTUS'   user2          user2
2  2017/01/01  'OMG @user2 @POTUS'   user4          user2
3  2017/01/01  'OMG @user4 @POTUS'   user1            NaN
4  2017/01/01   'THIS IS A TWEET!'  user17            NaN

Upvotes: 1

Related Questions