Steven Marsh
Steven Marsh

Reputation: 53

Cleaning ever value after the .com, .org, etc in URLs in python using regex

I am currently trying to clean the urls column in my dataframe of tweets_. I want to get rid of everything after the .com, org, .net, etc. I want to use regex to do so. The reason being when I go to build nodes and edges the urls will match up. I dont care about anything that connects to the certain web page. If a URL is from www.CNN.com I want them to be able to match up. How do I use regex properly to do this task?

import pandas as pd
import re,os
import numpy as np

# dictionary where values are list
from collections import defaultdict

# plotting
import matplotlib.pyplot as plt
import seaborn as sns

tweets_=pd.read_csv('TwitterLinksNWO.csv')
#dropping the tweets with no URLs.
tweets_['urls'].replace('[]', np.nan, inplace=True)
tweets_.dropna(subset=['urls'],inplace=True)
tweets_ = tweets_.astype({'urls': np.str}, copy=True)
tweets_.shape

tweets_["urls"] = tweets_["urls"].str.replace('http','')
tweets_["urls"] = tweets_["urls"].str.replace('https://','')
tweets_["urls"] = tweets_["urls"].str.replace('http://', '')
tweets_["urls"] = tweets_["urls"].str.replace('/','')
tweets_["urls"] = tweets_["urls"].str.replace(':','')
tweets_["urls"] = tweets_["urls"].str.replace('youtu.be', '4444')
tweets_["urls"] = tweets_["urls"].str.replace('youtube', '4444')
tweets_["urls"] = tweets_["urls"].str.replace('bitly', '4444')
tweets_["urls"] = tweets_["urls"].str.replace('bit.ly', '4444')
tweets_["urls"] = tweets_["urls"].str.replace('instagram', '4444')
tweets_["urls"] = tweets_["urls"].str.replace('youtube', '4444')
tweets_["urls"] = tweets_["urls"].str.replace('twitter', '4444')
tweets_["urls"] = tweets_["urls"].str.replace('facebook', '4444')

tweets_["urls"] = tweets_["urls"].str.replace(r'(?:www\.|https?:)\S*?(?:\.(?:com|org)|(?=\s)|$)', '') # re.IGNORECASE

tweets_ = tweets_[~tweets_.urls.str.contains("4444")]

#These are some examples of the urls.I have 54,000 of them

#['CelebVM.comKevinNash'] ['wpo.stLuyM2'] ['KevinNash']
#['wp.mep39Vlk-63t'] ['NoDQ.com'] ['seanwaltman']
#['sconservativedailypost.comanonymous-soros-ha...
#['tru.news2g0apKg'] ['individualProfile.asp?indid=977']
#['CelebVM.comKevinNash']
#['freedom-articles.toolsforfreedom.comsatanic-... ['KevinNash']
#['guillotinesmartial-lawf.beforeitsnews.comsel...
#['sblog20160515video-pope-francis-calls-worldw...
#['yournewswire.comworld-gets-behind-putins-vow...
#['crooked-hillary-received-25-million-from-new...
#['sblogsplum-linewp20161017revealed-the-vast-i...
#['2americananimals.com']
#['motherboard.vice.comreadwhy-havent-we-met-al...
#['thetruthdivision.com201609boom-jeff-sessions...['ringsid.ecWrestlingRingAccessories']
#['stributealbum64.bandcamp.comreleases']
#['jerusalem20160909leaked-memo-george-soros-fo... ['es.pn2bOaDVt']
#['foxs.pt2bTPjcI'] ['espnnow?nowId=21-0563477088191910668-4']
#['baseballthe-nwo-is-going-to-manage-an-indy-l...
#['mlbnewsmembers-of-nwo-to-manage-a-independen... ['yhoo.it2bvYYeT']
#['vigilantcitizen.comlatestnewscfr-releases-pr...

Upvotes: 0

Views: 54

Answers (1)

sophros
sophros

Reputation: 16660

Your question looks a bit like XY problem.

I think the best way would be to use pandas groupby with a grouping function that extracts only the domain part of the url. For that you can use the standard library's urlib.parse that parses an URL into a named tuple of which netloc is the domain part (roughly):

from urllib.parse import urlparse

grouped = tweets_["urls"].groupby(lambda x: urlparse(x).netloc)

Upvotes: 1

Related Questions