Reputation: 151
So I want to remove all user mentions and urls in a tweet/string.
For example, if I have a tweet like this:
@username1: some tweet here,, aaaaa @username2
I want to get something like this:
some tweet here, aaaaa
I want to user regular expression but I'm really new to python and don't know how to do it.
Also, tweets are stored in a JSON file (a list of dictionaries), and each tweet (a dictionary) has a key called "entities" which stores information about "user_mentions", "urls", and "hashtags" in a format like the following:
{u'user_mentions': [{u'indices': [3, 18],
u'screen_name': u'username1',
u'id': 1234567,
u'name': u'user name 1',
u'id_str': u'1234567'},
{u'indices': [108, 116],
u'screen_name': u'username2',
u'id': 112233,
u'name': u'user name 2',
u'id_str': u'112233'}],
u'hashtags': [],
u'urls': [{u'url': u'',
u'indices': [83, 103],
u'expanded_url': u'',
u'display_url': u''}]
Does anyone know how to remove user mentions and urls? Thanks so much!
Upvotes: 3
Views: 18605
Reputation: 31
test = "@username1: some tweet here,, aaaaa @username2"
import re
clean_text = re.sub(r'@\w+', '', text)
the output will be
: some tweet here,, aaaaa
Upvotes: 3
Reputation: 1527
You can combine this onto a one-liner as well but here is a the break-down of the steps:
text = '@username1: some tweet here,, aaaaa @username2'
processed_text = re.sub(r"(?:\@|http?\://|https?\://|www)\S+", "", text)
processed_text = " ".join(processed_text.split())
some tweet here, aaaaa
Upvotes: 3
Reputation: 143
I think for the first answer it should read "entities" not "entries". Also, don't forget urls within media if you are trying to exclude that as well.
For Python 3, stripping out media urls as well:
from itertools import chain
from functools import reduce
result = []
for text, entities in ((t["text"], t["entities"]) for t in user_timeline):
urls = (e["url"] for e in entities["urls"])
users = ("@" + e["screen_name"] for e in entities["user_mentions"])
media_urls = ()
if 'media' in entities:
media_urls = (e["url"] for e in entities["media"])
text = reduce(lambda t, s: t.replace(s, ""), chain(urls, media_urls, users), text)
Upvotes: 2
First off I hope you are able to access the tweets >>>
import json
import glob
for filename in glob.glob('*.json'):
with open("plain text - preprocess.txt",'a') as outfile ,open(filename, 'r') as f:
for line in f:
if line=='\n':
tweet = json.loads(line)
###NOW DO SOMETHING WITH tweet['text']
Use Regex to remove unwanted # or http links within the tweet. Here's how I did it >>>
import re
stringwithouthash = re.sub(r'#\w+ ?', '', tweet['text'])
stringwithoutlink = re.sub(r'http\S+', '', tweet['text'])
\S takes in all characters except whitespace.
\w takes in A-Z,a-z,0-9
Refer to this link for more on regex.
Upvotes: 1
Reputation: 414139
from itertools import chain
result = []
for text, entries in ((t["text"], t["entries"]) for t in tweets):
urls = (e["url"] for e in entries["urls"])
users = ("@"+e["screen_name"] for e in entries["user_mentions"])
text = reduce(lambda t,s: t.replace(s, ""), chain(urls, users), text)
Or using a regex (it also removes trailing non-whitespace characters):
text = re.sub(r"(?:\@|https?\://)\S+", "", text)
Or a combination of the two methods:
text = re.sub(r"(?:%s)\S*" % "|".join(map(re.escape, chain(urls, users))), "", text)
Upvotes: 12