Reputation: 180
There is a csv file with following urls inside:
1;https://www.one.de
2;https://www.two.de
3;https://www.three.de
4;https://www.four.de
5;https://www.five.de
Then I load it to a pandas dataframe df.
cols = ['nr','url']
df = pd.read_csv("listing.csv", sep=';', encoding = "utf8", dtype=str, names=cols)
Then I like to add another col 'domain_name' corresponding to the nr.
def takedn(url):
m = urlsplit(url)
return m.netloc.split('.')[-2]
df['domain_name'] = takedn(df['url'].all())
print(df.head())
But it takes the last domain_name for all nr's.
Output:
nr url domain_name
0 1 https://www.one.de five
1 2 https://www.two.de five
2 3 https://www.three.de five
3 4 https://www.four.de five
4 5 https://www.five.de five
I try this to learn vectorizing. It will not work as I think. First line the domain_name should be one, second two and so on.
Upvotes: 0
Views: 35
Reputation: 323226
We have built-in function in tldextract
import tldextract
df['domain'] = df.url.map(lambda x : tldextract.extract(x).domain)
df
nr url domain_name domain
0 1 https://www.one.de five one
1 2 https://www.two.de five two
2 3 https://www.three.de five three
3 4 https://www.four.de five four
4 5 https://www.five.de five five
Upvotes: 1
Reputation: 30022
To operate on element, you can use apply()
.
def takedn(url):
m = urlsplit(url)
return m.netloc.split('.')[-2]
df['domain_name'] = df['url'].apply(takedn)
Upvotes: 1