Removing rows from a DataFrame based on words in a string

Question

Novice programmer here seeking help. I have a Dataframe that looks like this:

           Current
0  "Invest in $APPL, $FB and $AMZN"      
1  "Long $AAPL, Short $AMZN"              
2  "$AAPL earnings announcement soon"             
3  "$FB is releasing a new product. Will $FB's product be good?"
4  "$Fb doing good today"
5  "$AMZN high today. Will $amzn continue like this?"

I also have a list with all the hashtags: cashtags = ["$AAPL", "$FB", $AMZN"]

Basically, I want to go through all the lines in this column of the DataFrame and keep the rows with a unique cashtag, regardless if it is in caps or not, and delete all others. Desired Output:

           Desired
2  "$AAPL earnings announcement soon"             
3  "$FB is releasing a new product. Will $FB's product be good?"
4  "$Fb doing good today"
5  "$AMZN high today. Will $amzn continue like this?"

I've tried to basically count how many times the word appears in the string and add that value to a new column so that I can delete the rows based on the number.

for i in range(0,len(df)-1):
    print(i, end = "\r")
    tweet = df["Current"][i]
    count = 0
    for word in cashtags:
        count += str(tweet).count(word)
    df["Word_count"][i] = count

However if I do this I will be deleting rows that I don't want to. For example, rows where the unique cashtag is mentioned several times ([3],[5])

How can I achieve my desired output?

tomjn · Accepted Answer

If you ever want to generalise your question to any tag, then this is a good place for a regular expression. You want to match against (\$w+)(?!.*/1) see e.g. here for a detailed explanation, but the general structure is:

\$w+: find a dollar sign followed by one or more letters/numbers (or an _), if you just wanted to count how many tags you had this is all you need

e.g.

df.Current.str.count(r'\$\w+')

will print

but this will remove cases where you have the same element more than once so you need to add a negative lookahead meaning don't match

(?!.*/1): Is a negative lookahead, this means don't match if it is followed by the same match later on. This will mean that only the last tag is counted in the string.

Using this, you can then use pandas DataFrame.str methods, specifically DataFrame.str.count (the re.I does a case insensitive match)

import re
df[df.Current.str.count(r'(\$\w+)(?!.*\1)', re.I) == 1]

which will give you your desired output

                                             Current
2                   $AAPL earnings announcement soon
3  $FB is releasing a new product. Will $FB's pro...
4                               $Fb doing good today
5   $AMZN high today. Will $amzn continue like this?

Removing rows from a DataFrame based on words in a string

Answers (2)

Related Questions