Michael
Michael

Reputation: 2377

Filter user names from a string

I'm trying to filter the usernames that are being referenced in a tweet like in the following example:

Example:

tw = 'TR @uname1, @uname2, @uname3, text1, text2, @uname4, text3, @uname5, RT @uname6'

the desired output will be:

rt_unames = ['uname1', 'uname6']
mt_unames = ['uname2', 'uname3', 'uname4', 'uname5']

I can do something like a for loop that goes over the string like the naïve solution below:

Naïve Solution:

def find_end_idx(tw_part):
    end_space_idx = len(tw)
    try:
        end_space_idx = tw[start_idx:].index(' ')
    except Exception as e:
        pass
    end_dot_idx = len(tw)
    try:
        end_dot_idx = tw[start_idx:].index('.')
    except Exception as e:
        pass
    end_semi_idx = len(tw)
    try:
        end_semi_idx = tw[start_idx:].index(',')
    except Exception as e:
        pass
    return min(end_space_idx, end_dot_idx, end_semi_idx)

tw = 'RT @uname1, @uname2, @uname3, text1, text2, @uname4, text3, @uname5, RT @uname6'
acc = ''
rt_unames = []
mt_unames = []
for i, c in enumerate(tw):
    acc += c
    if acc[::-1][:2][::-1] == 'RT':
        start_idx = i+2
        end_idx = find_end_idx(tw_part=tw[start_idx:])
        uname = tw[start_idx:start_idx+end_idx]
        if uname not in mt_unames:
            rt_unames.append(uname)
        acc = ''
    elif acc[::-1][:1]=='@':
        start_idx = i
        end_idx = find_end_idx(tw_part=tw[start_idx:])
        uname = tw[start_idx:start_idx+end_idx]
        if uname not in rt_unames:
            mt_unames.append(uname)
        acc = ''
rt_unames, mt_unames   

which outputs:

(['@uname1', '@uname6'], ['@uname2', '@uname3', '@uname4', '@uname5'])

Question: As I need to apply it to every tweet in a pandas.DataFrame, I'm looking for a more elegant and fast solution to get this outcome.

I'd appreciate any suggestions.

Upvotes: 2

Views: 183

Answers (5)

VISHMA PRATIM DAS
VISHMA PRATIM DAS

Reputation: 21

This is my second time, so I will try to make it as easy as possible.

tw = 'TR @uname1, @uname2, @uname3, text1, text2, @uname4, text3, @uname5, RT @uname6'

res = tw.replace(", ", " ").split()

final = []
k = "@"
for e in res:
    if e[0].lower == k.lower:
        final.append(e)
stringe = str(final).replace(",", "")
stringe = stringe.replace("[", "")
stringe = stringe.replace("]", "")
stringe =stringe.replace("'", "")
print("Result is :", str(stringe))

from what I can see, you already know python, so this example should only take you a while.

Here, I use the replace function to replace all the commas (,) with blank, and use the split function, which seperates the words seperated by spaces. The result is then stored in res.

In the next few lines, I use the replace function to replace all unwanted strings like "[" and "]" and "'" , to be replaced by a blank.

Then, I simply print the result.

Hit me up at @Vishma Pratim Das on twitter if you don't understand something

Upvotes: 1

Shubham Sharma
Shubham Sharma

Reputation: 71687

Let's try re.findall with a regex pattern::

import re

rt_unames = re.findall(r'(?<=TR |RT )@([^,]+)', tw)
mt_unames = re.findall(r'(?<!TR |RT )@([^,]+)', tw)

In the similar way, you can use str.findall method on the column in dataframe:

df['rt_unames'] = df['tweet'].str.findall(r'(?<=TR |RT )@([^,]+)')
df['mt_unames'] = df['tweet'].str.findall(r'(?<!TR |RT )@([^,]+)')

Result:

['uname1', 'uname6']
['uname2', 'uname3', 'uname4', 'uname5']

Upvotes: 2

abhilb
abhilb

Reputation: 5757

You can use regex patterns and use the apply function on the tweet column of your dataframe

import pandas as pd
import re

pattern1 = r"(RT\s+@[^\,]+)|(TR\s+@[^\,]+)"
pattern2 = r"@[^\,]+"

df = pd.DataFrame(['TR @uname1, @uname2, @uname3, text1, text2, @uname4, text3, @uname5, RT @uname6'], columns=['Tweet'])

df['group1'] = df.Tweet.apply(lambda x: re.findall(pattern1, x))
df['group2'] = df.Tweet.apply(lambda x: re.findall(pattern2, x))

Upvotes: 1

arabinelli
arabinelli

Reputation: 1506

An alternative approach using filters and list comprehension.

import re 

def your_func_name(tw):
    tw_list = [x.strip() for x in tw.split(",")]
    rt_unames_raw = filter(lambda x: "@" in x and x.startswith("RT"),tw_list)
    mt_unames_raw = filter(lambda x: x.startswith("@"),tw_list)
    rt_unames = [re.sub(r"RT|@","",uname).strip() for uname in rt_unames_raw]
    mt_unames = [re.sub("@","",uname).strip() for uname in mt_unames_raw]
    return rt_unames, mt_unames

tw = 'RT @uname1, @uname2, @uname3, text1, text2, @uname4, text3, @uname5, RT @uname6'
your_func_name(tw=tw)

Upvotes: 1

hidden
hidden

Reputation: 192

If the format of input string is always the same, I would do it like this:

def parse_tags(str_tags):
    rts = []
    others = []
    for tag in [tag.strip() for tag in str_tags.split(',')]:
        if tag.startswith('RT'):
            rts.append(tag[3:])
        elif tag.startswith('@'):
            others.append(tag)

    return rts, others

Upvotes: 1

Related Questions