Reputation: 45
I want to purify my text by removing certain length of digits from it, so I define rule for it. I think isdigit
is good for dealing with, but if I used this it will discard all digits in the text. in my test, last 10 digits are not contributed to the text, so I could remove it. Here is that I tried:
urls = ['variety.com/2017/biz/news/tax-march-donald-trump-protest-1202031487/',
'variety.com/2018/film/news/list-2018-oscar-nominations-1202668757/']
cols = ['c1', 'c2', 'c3', 'c4']
make_me = []
for url in urls:
lst = url.split("/")
# your business rules go here
make_me.append([x for x in lst if not x.isdigit() and not x == ""])
df = pd.DataFrame(make_me, columns=cols)
df
res=[]
for i in df.c4:
lst=i.split("-")
res.append([''.join(x) for x in lst if not x.isdigit()])
my attempt discarded all digit in text. I simply want this kind of output:
tax march donald trump protest
list 2018 oscar nominations
how should I write the rule to get this output? Any idea?
Upvotes: 0
Views: 375
Reputation: 26047
A pure python way of doing without additional modules looks like this:
urls = ['variety.com/2017/biz/news/tax-march-donald-trump-protest-1202031487/',
'variety.com/2018/film/news/list-2018-oscar-nominations-1202668757/']
for x in urls:
print(' '.join(x.rsplit('/', 2)[-2].split('-')[:-1]))
# tax march donald trump protest
# list 2018 oscar nominations
If you need a list of output, use a list-comprehension:
[' '.join(x.rsplit('/', 2)[-2].split('-')[:-1]) for x in urls]
Upvotes: 1
Reputation: 789
In this case you have a very specific rule that would help you - just remove the last 10 characters from the last interesting element.
In this case lst[-2] = lst[-2][:-12]
right before the make_me.append
call would do the trick.
If you do want to make it with regex, I'd use the end-of-line marker, $, to make sure the digits were at the end. It would look like
lst = re.sub('[0-9]{10}/$','',url)
after importing re
, of course. This reads as:
re.sub is a substitution method in the regular expressions module, and it changes the matches to the regex in the first parameter with the content in the second parameter; the third parameter is the string where you want to make the substitution.
The regex I wrote matches "a sequence of 10 characters which match any of 0123456789, followed by a / and the end of the string".
Upvotes: 0
Reputation: 101
Assuming you want to extract urls of the same format, use regular expressions
import re
urls = ['variety.com/2017/biz/news/tax-march-donald-trump-protest-1202031487/',
'variety.com/2018/film/news/list-2018-oscar-nominations-1202668757/']
news = []
regex =re.compile(r'/news/(.*)-')
for url in urls:
extract_id = regex.search(url)
if extract_id:
data = extract_id.group(1)
news.append(data.replace('-',' '))
print(news)
Output
['tax march donald trump protest', 'list 2018 oscar nominations']
Edited format to suit the question.
Upvotes: 1
Reputation: 683
There can be many approaches to this. Use .rfind('-')
to get rightmost index of '-' and then slice your string. After that you can process the string further.
Upvotes: 0