hthomas
hthomas

Reputation: 23

how do I extract the Url data from my string

I have the following string which contains many Url values. How do I extract the Url after the DataUrl term in this string? So I get a list of Urls for example: americanexpress.com, vice.com, chegg.com

{'DataUrl': 'americanexpress.com', 'Country': {'Rank': '96', 'Reach': {'PerMillion': '7350'}, 'PageViews': {'PerMillion': '600.2', 'PerUser': '3.6'}}, 'Global': {'Rank': '362'}}, {'DataUrl': 'vice.com', 'Country': {'Rank': '97', 'Reach': {'PerMillion': '15703.61'}, 'PageViews': {'PerMillion': '489.97', 'PerUser': '1.38'}}, 'Global': {'Rank': '208'}}, {'DataUrl': 'chegg.com', 'Country': {'Rank': '98', 'Reach': {'PerMillion': '6280'}, 'PageViews': {'PerMillion': '882.3', 'PerUser': '6.2'}}, 'Global': {'Rank': '402'}}, {'DataUrl': 'mlb.com', 'Country': {'Rank': '99', 'Reach': {'PerMillion': '7280'}, 'PageViews': {'PerMillion': '564.1', 'PerUser': '3.42'}}, 'Global': {'Rank': '427'}}, {'DataUrl': 'xnxx.com', 'Country': {'Rank': '100', 'Reach': {'PerMillion': '5560'}, 'PageViews': {'PerMillion': '1271', 'PerUser': '10.1'}}, 'Global': {'Rank': '95'}

I have tried various FindAll expressions.

Upvotes: 0

Views: 944

Answers (2)

furas
furas

Reputation: 142641

It looks like part of JSON data so if you have complet JSON data then you could use module json to load it and search DataUrl in dictionary.

If you have incomplet JSON data then you can use regex

text = '''{'DataUrl': 'americanexpress.com', 'Country': {'Rank': '96', 'Reach': {'PerMillion': '7350'}, 'PageViews': {'PerMillion': '600.2', 'PerUser': '3.6'}}, 'Global': {'Rank': '362'}}, {'DataUrl': 'vice.com', 'Country': {'Rank': '97', 'Reach': {'PerMillion': '15703.61'}, 'PageViews': {'PerMillion': '489.97', 'PerUser': '1.38'}}, 'Global': {'Rank': '208'}}, {'DataUrl': 'chegg.com', 'Country': {'Rank': '98', 'Reach': {'PerMillion': '6280'}, 'PageViews': {'PerMillion': '882.3', 'PerUser': '6.2'}}, 'Global': {'Rank': '402'}}, {'DataUrl': 'mlb.com', 'Country': {'Rank': '99', 'Reach': {'PerMillion': '7280'}, 'PageViews': {'PerMillion': '564.1', 'PerUser': '3.42'}}, 'Global': {'Rank': '427'}}, {'DataUrl': 'xnxx.com', 'Country': {'Rank': '100', 'Reach': {'PerMillion': '5560'}, 'PageViews': {'PerMillion': '1271', 'PerUser': '10.1'}}, 'Global': {'Rank': '95'}'''

import re

urls = re.findall("'DataUrl': '([^']*)'", text)

print(urls)

Result

['americanexpress.com', 'vice.com', 'chegg.com', 'mlb.com', 'xnxx.com']

You can also try to do it with .split("{'DataUrl': '") and split("',")

text = '''{'DataUrl': 'americanexpress.com', 'Country': {'Rank': '96', 'Reach': {'PerMillion': '7350'}, 'PageViews': {'PerMillion': '600.2', 'PerUser': '3.6'}}, 'Global': {'Rank': '362'}}, {'DataUrl': 'vice.com', 'Country': {'Rank': '97', 'Reach': {'PerMillion': '15703.61'}, 'PageViews': {'PerMillion': '489.97', 'PerUser': '1.38'}}, 'Global': {'Rank': '208'}}, {'DataUrl': 'chegg.com', 'Country': {'Rank': '98', 'Reach': {'PerMillion': '6280'}, 'PageViews': {'PerMillion': '882.3', 'PerUser': '6.2'}}, 'Global': {'Rank': '402'}}, {'DataUrl': 'mlb.com', 'Country': {'Rank': '99', 'Reach': {'PerMillion': '7280'}, 'PageViews': {'PerMillion': '564.1', 'PerUser': '3.42'}}, 'Global': {'Rank': '427'}}, {'DataUrl': 'xnxx.com', 'Country': {'Rank': '100', 'Reach': {'PerMillion': '5560'}, 'PageViews': {'PerMillion': '1271', 'PerUser': '10.1'}}, 'Global': {'Rank': '95'}'''

urls = text.split("{'DataUrl': '")
urls = [item.split("',")[0] for item in urls if item]
print(urls)

Result

['americanexpress.com', 'vice.com', 'chegg.com', 'mlb.com', 'xnxx.com']

if you had complete and correctly formatted JSON - with " instead of ' - then you could use module json

Here I use complete JSON

text = '''[{'DataUrl': 'americanexpress.com', 'Country': {'Rank': '96', 'Reach': {'PerMillion': '7350'}, 'PageViews': {'PerMillion': '600.2', 'PerUser': '3.6'}}, 'Global': {'Rank': '362'}}, {'DataUrl': 'vice.com', 'Country': {'Rank': '97', 'Reach': {'PerMillion': '15703.61'}, 'PageViews': {'PerMillion': '489.97', 'PerUser': '1.38'}}, 'Global': {'Rank': '208'}}, {'DataUrl': 'chegg.com', 'Country': {'Rank': '98', 'Reach': {'PerMillion': '6280'}, 'PageViews': {'PerMillion': '882.3', 'PerUser': '6.2'}}, 'Global': {'Rank': '402'}}, {'DataUrl': 'mlb.com', 'Country': {'Rank': '99', 'Reach': {'PerMillion': '7280'}, 'PageViews': {'PerMillion': '564.1', 'PerUser': '3.42'}}, 'Global': {'Rank': '427'}}, {'DataUrl': 'xnxx.com', 'Country': {'Rank': '100', 'Reach': {'PerMillion': '5560'}, 'PageViews': {'PerMillion': '1271', 'PerUser': '10.1'}}, 'Global': {'Rank': '95'}}]'''
text = text.replace("'", '"')

import json

data = json.loads(text)
urls = [item['DataUrl'] for item in data]

print(urls)

Result

['americanexpress.com', 'vice.com', 'chegg.com', 'mlb.com', 'xnxx.com']

Upvotes: 1

LakshanSS
LakshanSS

Reputation: 125

Python has a built-in package called json, which can be used to work with JSON data.

You can convert your python object to a json object and then get DataUrl easily.

Please refer to https://www.w3schools.com/python/python_json.asp

Upvotes: 1

Related Questions