Reputation: 151
So I want to remove all user mentions and urls in a tweet/string.
For example, if I have a tweet like this:
@username1: some tweet here, http://www.url.com, aaaaa @username2
I want to get something like this:
some tweet here, aaaaa
I want to user regular expression but I'm really new to python and don't know how to do it.
Also, tweets are stored in a JSON file (a list of dictionaries), and each tweet (a dictionary) has a key called "entities" which stores information about "user_mentions", "urls", and "hashtags" in a format like the following:
{u'user_mentions': [{u'indices': [3, 18],
u'screen_name': u'username1',
u'id': 1234567,
u'name': u'user name 1',
u'id_str': u'1234567'},
{u'indices': [108, 116],
u'screen_name': u'username2',
u'id': 112233,
u'name': u'user name 2',
u'id_str': u'112233'}],
u'hashtags': [],
u'urls': [{u'url': u'http://www.url.com',
u'indices': [83, 103],
u'expanded_url': u'http://www.url.com',
u'display_url': u'http://www.url.com'}]
}
Does anyone know how to remove user mentions and urls? Thanks so much!
Upvotes: 3
Views: 18605
Reputation: 31
test = "@username1: some tweet here, http://www.url.com, aaaaa @username2"
import re
clean_text = re.sub(r'@\w+', '', text)
the output will be
: some tweet here, http://www.url.com, aaaaa
Upvotes: 3
Reputation: 1527
You can combine this onto a one-liner as well but here is a the break-down of the steps:
text = '@username1: some tweet here, http://www.url.com, aaaaa @username2'
processed_text = re.sub(r"(?:\@|http?\://|https?\://|www)\S+", "", text)
processed_text = " ".join(processed_text.split())
print(processed_text)
Output:
some tweet here, aaaaa
Upvotes: 3
Reputation: 143
I think for the first answer it should read "entities" not "entries". Also, don't forget urls within media if you are trying to exclude that as well.
https://dev.twitter.com/overview/api/entities-in-twitter-objects
For Python 3, stripping out media urls as well:
from itertools import chain
from functools import reduce
result = []
for text, entities in ((t["text"], t["entities"]) for t in user_timeline):
urls = (e["url"] for e in entities["urls"])
users = ("@" + e["screen_name"] for e in entities["user_mentions"])
media_urls = ()
if 'media' in entities:
media_urls = (e["url"] for e in entities["media"])
text = reduce(lambda t, s: t.replace(s, ""), chain(urls, media_urls, users), text)
result.append(text)
Upvotes: 2
Reputation:
First off I hope you are able to access the tweets >>>
import json
import glob
for filename in glob.glob('*.json'):
with open("plain text - preprocess.txt",'a') as outfile ,open(filename, 'r') as f:
for line in f:
if line=='\n':
pass
else:
tweet = json.loads(line)
###NOW DO SOMETHING WITH tweet['text']
Use Regex to remove unwanted # or http links within the tweet. Here's how I did it >>>
import re
stringwithouthash = re.sub(r'#\w+ ?', '', tweet['text'])
stringwithoutlink = re.sub(r'http\S+', '', tweet['text'])
\S takes in all characters except whitespace.
\w takes in A-Z,a-z,0-9
Refer to this link for more on regex.
Upvotes: 1
Reputation: 414139
from itertools import chain
result = []
for text, entries in ((t["text"], t["entries"]) for t in tweets):
urls = (e["url"] for e in entries["urls"])
users = ("@"+e["screen_name"] for e in entries["user_mentions"])
text = reduce(lambda t,s: t.replace(s, ""), chain(urls, users), text)
result.append(text)
Or using a regex (it also removes trailing non-whitespace characters):
text = re.sub(r"(?:\@|https?\://)\S+", "", text)
Or a combination of the two methods:
text = re.sub(r"(?:%s)\S*" % "|".join(map(re.escape, chain(urls, users))), "", text)
Upvotes: 12