user1906856
user1906856

Reputation: 151

How to remove user mentions and urls in a tweet/string using python

So I want to remove all user mentions and urls in a tweet/string.

For example, if I have a tweet like this:

@username1: some tweet here, http://www.url.com, aaaaa @username2

I want to get something like this:

some tweet here, aaaaa

I want to user regular expression but I'm really new to python and don't know how to do it.

Also, tweets are stored in a JSON file (a list of dictionaries), and each tweet (a dictionary) has a key called "entities" which stores information about "user_mentions", "urls", and "hashtags" in a format like the following:

{u'user_mentions': [{u'indices': [3, 18],
                     u'screen_name': u'username1',
                     u'id': 1234567,
                     u'name': u'user name 1',
                     u'id_str': u'1234567'},

                    {u'indices': [108, 116],
                     u'screen_name': u'username2',
                     u'id': 112233,
                     u'name': u'user name 2',
                     u'id_str': u'112233'}],

 u'hashtags': [],
 u'urls': [{u'url': u'http://www.url.com',
            u'indices': [83, 103],
            u'expanded_url': u'http://www.url.com',
            u'display_url': u'http://www.url.com'}]
}

Does anyone know how to remove user mentions and urls? Thanks so much!

Upvotes: 3

Views: 18605

Answers (5)

Amir Mostafa Ahmed
Amir Mostafa Ahmed

Reputation: 31

test = "@username1: some tweet here, http://www.url.com, aaaaa @username2"
import re
clean_text = re.sub(r'@\w+', '', text)

the output will be

: some tweet here, http://www.url.com, aaaaa

Upvotes: 3

nimbous
nimbous

Reputation: 1527

You can combine this onto a one-liner as well but here is a the break-down of the steps:

text = '@username1: some tweet here, http://www.url.com, aaaaa @username2'
processed_text = re.sub(r"(?:\@|http?\://|https?\://|www)\S+", "", text)
processed_text = " ".join(processed_text.split())
print(processed_text)

Output:

some tweet here, aaaaa

Upvotes: 3

Vishal
Vishal

Reputation: 143

I think for the first answer it should read "entities" not "entries". Also, don't forget urls within media if you are trying to exclude that as well.

https://dev.twitter.com/overview/api/entities-in-twitter-objects

For Python 3, stripping out media urls as well:

    from itertools import chain
    from functools import reduce

    result = []
    for text, entities in ((t["text"], t["entities"]) for t in user_timeline):
        urls = (e["url"] for e in entities["urls"])
        users = ("@" + e["screen_name"] for e in entities["user_mentions"])
        media_urls = ()
        if 'media' in entities:
            media_urls = (e["url"] for e in entities["media"])
        text = reduce(lambda t, s: t.replace(s, ""), chain(urls, media_urls, users), text)
        result.append(text)

Upvotes: 2

user3759098
user3759098

Reputation:

First off I hope you are able to access the tweets >>>

import json
import glob
for filename in glob.glob('*.json'):
with open("plain text - preprocess.txt",'a') as outfile ,open(filename, 'r') as f:
    for line in f:
        if line=='\n':
            pass
        else:
            tweet = json.loads(line) 
            ###NOW DO SOMETHING WITH tweet['text']

Use Regex to remove unwanted # or http links within the tweet. Here's how I did it >>>

import re
stringwithouthash = re.sub(r'#\w+ ?', '', tweet['text'])
stringwithoutlink = re.sub(r'http\S+', '', tweet['text'])

\S takes in all characters except whitespace.

\w takes in A-Z,a-z,0-9

Refer to this link for more on regex.

Upvotes: 1

jfs
jfs

Reputation: 414139

from itertools import chain

result = []
for text, entries in ((t["text"], t["entries"]) for t in tweets):
    urls = (e["url"] for e in entries["urls"])
    users = ("@"+e["screen_name"] for e in entries["user_mentions"])
    text = reduce(lambda t,s: t.replace(s, ""), chain(urls, users), text)
    result.append(text)

Or using a regex (it also removes trailing non-whitespace characters):

text = re.sub(r"(?:\@|https?\://)\S+", "", text)

Or a combination of the two methods:

text = re.sub(r"(?:%s)\S*" % "|".join(map(re.escape, chain(urls, users))), "", text)

Upvotes: 12

Related Questions