Reputation: 53
I am currently trying to clean the urls column in my dataframe of tweets_. I want to get rid of everything after the .com, org, .net, etc. I want to use regex to do so. The reason being when I go to build nodes and edges the urls will match up. I dont care about anything that connects to the certain web page. If a URL is from www.CNN.com I want them to be able to match up. How do I use regex properly to do this task?
import pandas as pd
import re,os
import numpy as np
# dictionary where values are list
from collections import defaultdict
# plotting
import matplotlib.pyplot as plt
import seaborn as sns
tweets_=pd.read_csv('TwitterLinksNWO.csv')
#dropping the tweets with no URLs.
tweets_['urls'].replace('[]', np.nan, inplace=True)
tweets_.dropna(subset=['urls'],inplace=True)
tweets_ = tweets_.astype({'urls': np.str}, copy=True)
tweets_.shape
tweets_["urls"] = tweets_["urls"].str.replace('http','')
tweets_["urls"] = tweets_["urls"].str.replace('https://','')
tweets_["urls"] = tweets_["urls"].str.replace('http://', '')
tweets_["urls"] = tweets_["urls"].str.replace('/','')
tweets_["urls"] = tweets_["urls"].str.replace(':','')
tweets_["urls"] = tweets_["urls"].str.replace('youtu.be', '4444')
tweets_["urls"] = tweets_["urls"].str.replace('youtube', '4444')
tweets_["urls"] = tweets_["urls"].str.replace('bitly', '4444')
tweets_["urls"] = tweets_["urls"].str.replace('bit.ly', '4444')
tweets_["urls"] = tweets_["urls"].str.replace('instagram', '4444')
tweets_["urls"] = tweets_["urls"].str.replace('youtube', '4444')
tweets_["urls"] = tweets_["urls"].str.replace('twitter', '4444')
tweets_["urls"] = tweets_["urls"].str.replace('facebook', '4444')
tweets_["urls"] = tweets_["urls"].str.replace(r'(?:www\.|https?:)\S*?(?:\.(?:com|org)|(?=\s)|$)', '') # re.IGNORECASE
tweets_ = tweets_[~tweets_.urls.str.contains("4444")]
#These are some examples of the urls.I have 54,000 of them
#['CelebVM.comKevinNash'] ['wpo.stLuyM2'] ['KevinNash']
#['wp.mep39Vlk-63t'] ['NoDQ.com'] ['seanwaltman']
#['sconservativedailypost.comanonymous-soros-ha...
#['tru.news2g0apKg'] ['individualProfile.asp?indid=977']
#['CelebVM.comKevinNash']
#['freedom-articles.toolsforfreedom.comsatanic-... ['KevinNash']
#['guillotinesmartial-lawf.beforeitsnews.comsel...
#['sblog20160515video-pope-francis-calls-worldw...
#['yournewswire.comworld-gets-behind-putins-vow...
#['crooked-hillary-received-25-million-from-new...
#['sblogsplum-linewp20161017revealed-the-vast-i...
#['2americananimals.com']
#['motherboard.vice.comreadwhy-havent-we-met-al...
#['thetruthdivision.com201609boom-jeff-sessions...['ringsid.ecWrestlingRingAccessories']
#['stributealbum64.bandcamp.comreleases']
#['jerusalem20160909leaked-memo-george-soros-fo... ['es.pn2bOaDVt']
#['foxs.pt2bTPjcI'] ['espnnow?nowId=21-0563477088191910668-4']
#['baseballthe-nwo-is-going-to-manage-an-indy-l...
#['mlbnewsmembers-of-nwo-to-manage-a-independen... ['yhoo.it2bvYYeT']
#['vigilantcitizen.comlatestnewscfr-releases-pr...
Upvotes: 0
Views: 54
Reputation: 16660
Your question looks a bit like XY problem.
I think the best way would be to use pandas groupby
with a grouping function that extracts only the domain part of the url. For that you can use the standard library's urlib.parse
that parses an URL into a named tuple of which netloc
is the domain part (roughly):
from urllib.parse import urlparse
grouped = tweets_["urls"].groupby(lambda x: urlparse(x).netloc)
Upvotes: 1