Reputation: 2377
I'm trying to filter the usernames that are being referenced in a tweet like in the following example:
Example:
tw = 'TR @uname1, @uname2, @uname3, text1, text2, @uname4, text3, @uname5, RT @uname6'
the desired output will be:
rt_unames = ['uname1', 'uname6']
mt_unames = ['uname2', 'uname3', 'uname4', 'uname5']
I can do something like a for loop that goes over the string like the naïve solution
below:
Naïve Solution:
def find_end_idx(tw_part):
end_space_idx = len(tw)
try:
end_space_idx = tw[start_idx:].index(' ')
except Exception as e:
pass
end_dot_idx = len(tw)
try:
end_dot_idx = tw[start_idx:].index('.')
except Exception as e:
pass
end_semi_idx = len(tw)
try:
end_semi_idx = tw[start_idx:].index(',')
except Exception as e:
pass
return min(end_space_idx, end_dot_idx, end_semi_idx)
tw = 'RT @uname1, @uname2, @uname3, text1, text2, @uname4, text3, @uname5, RT @uname6'
acc = ''
rt_unames = []
mt_unames = []
for i, c in enumerate(tw):
acc += c
if acc[::-1][:2][::-1] == 'RT':
start_idx = i+2
end_idx = find_end_idx(tw_part=tw[start_idx:])
uname = tw[start_idx:start_idx+end_idx]
if uname not in mt_unames:
rt_unames.append(uname)
acc = ''
elif acc[::-1][:1]=='@':
start_idx = i
end_idx = find_end_idx(tw_part=tw[start_idx:])
uname = tw[start_idx:start_idx+end_idx]
if uname not in rt_unames:
mt_unames.append(uname)
acc = ''
rt_unames, mt_unames
which outputs:
(['@uname1', '@uname6'], ['@uname2', '@uname3', '@uname4', '@uname5'])
Question:
As I need to apply it to every tweet in a pandas.DataFrame
, I'm looking for a more elegant and fast solution to get this outcome.
I'd appreciate any suggestions.
Upvotes: 2
Views: 183
Reputation: 21
This is my second time, so I will try to make it as easy as possible.
tw = 'TR @uname1, @uname2, @uname3, text1, text2, @uname4, text3, @uname5, RT @uname6'
res = tw.replace(", ", " ").split()
final = []
k = "@"
for e in res:
if e[0].lower == k.lower:
final.append(e)
stringe = str(final).replace(",", "")
stringe = stringe.replace("[", "")
stringe = stringe.replace("]", "")
stringe =stringe.replace("'", "")
print("Result is :", str(stringe))
from what I can see, you already know python, so this example should only take you a while.
Here, I use the replace function to replace all the commas (,) with blank, and use the split function, which seperates the words seperated by spaces. The result is then stored in res.
In the next few lines, I use the replace function to replace all unwanted strings like "[" and "]" and "'" , to be replaced by a blank.
Then, I simply print the result.
Hit me up at @Vishma Pratim Das on twitter if you don't understand something
Upvotes: 1
Reputation: 71687
Let's try re.findall
with a regex pattern
::
import re
rt_unames = re.findall(r'(?<=TR |RT )@([^,]+)', tw)
mt_unames = re.findall(r'(?<!TR |RT )@([^,]+)', tw)
In the similar way, you can use str.findall
method on the column in dataframe:
df['rt_unames'] = df['tweet'].str.findall(r'(?<=TR |RT )@([^,]+)')
df['mt_unames'] = df['tweet'].str.findall(r'(?<!TR |RT )@([^,]+)')
Result:
['uname1', 'uname6']
['uname2', 'uname3', 'uname4', 'uname5']
Upvotes: 2
Reputation: 5757
You can use regex patterns and use the apply function on the tweet column of your dataframe
import pandas as pd
import re
pattern1 = r"(RT\s+@[^\,]+)|(TR\s+@[^\,]+)"
pattern2 = r"@[^\,]+"
df = pd.DataFrame(['TR @uname1, @uname2, @uname3, text1, text2, @uname4, text3, @uname5, RT @uname6'], columns=['Tweet'])
df['group1'] = df.Tweet.apply(lambda x: re.findall(pattern1, x))
df['group2'] = df.Tweet.apply(lambda x: re.findall(pattern2, x))
Upvotes: 1
Reputation: 1506
An alternative approach using filters and list comprehension.
import re
def your_func_name(tw):
tw_list = [x.strip() for x in tw.split(",")]
rt_unames_raw = filter(lambda x: "@" in x and x.startswith("RT"),tw_list)
mt_unames_raw = filter(lambda x: x.startswith("@"),tw_list)
rt_unames = [re.sub(r"RT|@","",uname).strip() for uname in rt_unames_raw]
mt_unames = [re.sub("@","",uname).strip() for uname in mt_unames_raw]
return rt_unames, mt_unames
tw = 'RT @uname1, @uname2, @uname3, text1, text2, @uname4, text3, @uname5, RT @uname6'
your_func_name(tw=tw)
Upvotes: 1
Reputation: 192
If the format of input string is always the same, I would do it like this:
def parse_tags(str_tags):
rts = []
others = []
for tag in [tag.strip() for tag in str_tags.split(',')]:
if tag.startswith('RT'):
rts.append(tag[3:])
elif tag.startswith('@'):
others.append(tag)
return rts, others
Upvotes: 1