Reputation: 2377

Filter user names from a string

I'm trying to filter the usernames that are being referenced in a tweet like in the following example:

Example:

tw = 'TR @uname1, @uname2, @uname3, text1, text2, @uname4, text3, @uname5, RT @uname6'

the desired output will be:

rt_unames = ['uname1', 'uname6']
mt_unames = ['uname2', 'uname3', 'uname4', 'uname5']

I can do something like a for loop that goes over the string like the naïve solution below:

Naïve Solution:

def find_end_idx(tw_part):
    end_space_idx = len(tw)
    try:
        end_space_idx = tw[start_idx:].index(' ')
    except Exception as e:
        pass
    end_dot_idx = len(tw)
    try:
        end_dot_idx = tw[start_idx:].index('.')
    except Exception as e:
        pass
    end_semi_idx = len(tw)
    try:
        end_semi_idx = tw[start_idx:].index(',')
    except Exception as e:
        pass
    return min(end_space_idx, end_dot_idx, end_semi_idx)

tw = 'RT @uname1, @uname2, @uname3, text1, text2, @uname4, text3, @uname5, RT @uname6'
acc = ''
rt_unames = []
mt_unames = []
for i, c in enumerate(tw):
    acc += c
    if acc[::-1][:2][::-1] == 'RT':
        start_idx = i+2
        end_idx = find_end_idx(tw_part=tw[start_idx:])
        uname = tw[start_idx:start_idx+end_idx]
        if uname not in mt_unames:
            rt_unames.append(uname)
        acc = ''
    elif acc[::-1][:1]=='@':
        start_idx = i
        end_idx = find_end_idx(tw_part=tw[start_idx:])
        uname = tw[start_idx:start_idx+end_idx]
        if uname not in rt_unames:
            mt_unames.append(uname)
        acc = ''
rt_unames, mt_unames

which outputs:

(['@uname1', '@uname6'], ['@uname2', '@uname3', '@uname4', '@uname5'])

Question: As I need to apply it to every tweet in a pandas.DataFrame, I'm looking for a more elegant and fast solution to get this outcome.

I'd appreciate any suggestions.

Upvotes: 2

Answers (5)

VISHMA PRATIM DAS

Reputation: 21

This is my second time, so I will try to make it as easy as possible.

tw = 'TR @uname1, @uname2, @uname3, text1, text2, @uname4, text3, @uname5, RT @uname6'

res = tw.replace(", ", " ").split()

final = []
k = "@"
for e in res:
    if e[0].lower == k.lower:
        final.append(e)
stringe = str(final).replace(",", "")
stringe = stringe.replace("[", "")
stringe = stringe.replace("]", "")
stringe =stringe.replace("'", "")
print("Result is :", str(stringe))

from what I can see, you already know python, so this example should only take you a while.

Here, I use the replace function to replace all the commas (,) with blank, and use the split function, which seperates the words seperated by spaces. The result is then stored in res.

In the next few lines, I use the replace function to replace all unwanted strings like "[" and "]" and "'" , to be replaced by a blank.

Then, I simply print the result.

Hit me up at @Vishma Pratim Das on twitter if you don't understand something

Upvotes: 1

Shubham Sharma

Reputation: 71687

Let's try re.findall with a regex pattern::

import re

rt_unames = re.findall(r'(?<=TR |RT )@([^,]+)', tw)
mt_unames = re.findall(r'(?<!TR |RT )@([^,]+)', tw)

In the similar way, you can use str.findall method on the column in dataframe:

df['rt_unames'] = df['tweet'].str.findall(r'(?<=TR |RT )@([^,]+)')
df['mt_unames'] = df['tweet'].str.findall(r'(?<!TR |RT )@([^,]+)')

Result:

['uname1', 'uname6']
['uname2', 'uname3', 'uname4', 'uname5']

Upvotes: 2

abhilb

Reputation: 5757

You can use regex patterns and use the apply function on the tweet column of your dataframe

import pandas as pd
import re

pattern1 = r"(RT\s+@[^\,]+)|(TR\s+@[^\,]+)"
pattern2 = r"@[^\,]+"

df = pd.DataFrame(['TR @uname1, @uname2, @uname3, text1, text2, @uname4, text3, @uname5, RT @uname6'], columns=['Tweet'])

df['group1'] = df.Tweet.apply(lambda x: re.findall(pattern1, x))
df['group2'] = df.Tweet.apply(lambda x: re.findall(pattern2, x))

Upvotes: 1

arabinelli

Reputation: 1506

An alternative approach using filters and list comprehension.

import re 

def your_func_name(tw):
    tw_list = [x.strip() for x in tw.split(",")]
    rt_unames_raw = filter(lambda x: "@" in x and x.startswith("RT"),tw_list)
    mt_unames_raw = filter(lambda x: x.startswith("@"),tw_list)
    rt_unames = [re.sub(r"RT|@","",uname).strip() for uname in rt_unames_raw]
    mt_unames = [re.sub("@","",uname).strip() for uname in mt_unames_raw]
    return rt_unames, mt_unames

tw = 'RT @uname1, @uname2, @uname3, text1, text2, @uname4, text3, @uname5, RT @uname6'
your_func_name(tw=tw)

Upvotes: 1

hidden

Reputation: 192

If the format of input string is always the same, I would do it like this:

def parse_tags(str_tags):
    rts = []
    others = []
    for tag in [tag.strip() for tag in str_tags.split(',')]:
        if tag.startswith('RT'):
            rts.append(tag[3:])
        elif tag.startswith('@'):
            others.append(tag)

    return rts, others

Upvotes: 1

Filter user names from a string

Answers (5)

Related Questions