woody
woody

Reputation: 45

How to remove certain length of digits from text?

I want to purify my text by removing certain length of digits from it, so I define rule for it. I think isdigit is good for dealing with, but if I used this it will discard all digits in the text. in my test, last 10 digits are not contributed to the text, so I could remove it. Here is that I tried:

urls = ['variety.com/2017/biz/news/tax-march-donald-trump-protest-1202031487/',
        'variety.com/2018/film/news/list-2018-oscar-nominations-1202668757/']

cols = ['c1', 'c2', 'c3', 'c4']
make_me = []
for url in urls:
    lst = url.split("/")
    # your business rules go here
    make_me.append([x for x in lst if not x.isdigit() and not x == ""])

df = pd.DataFrame(make_me, columns=cols)
df

res=[]
for i in df.c4: 
    lst=i.split("-") 
    res.append([''.join(x) for x in lst if not x.isdigit()])

my attempt discarded all digit in text. I simply want this kind of output:

tax march donald trump protest
list 2018 oscar nominations

how should I write the rule to get this output? Any idea?

Upvotes: 0

Views: 375

Answers (4)

Austin
Austin

Reputation: 26047

A pure python way of doing without additional modules looks like this:

urls = ['variety.com/2017/biz/news/tax-march-donald-trump-protest-1202031487/',
        'variety.com/2018/film/news/list-2018-oscar-nominations-1202668757/']

for x in urls:
    print(' '.join(x.rsplit('/', 2)[-2].split('-')[:-1]))

# tax march donald trump protest
# list 2018 oscar nominations

If you need a list of output, use a list-comprehension:

[' '.join(x.rsplit('/', 2)[-2].split('-')[:-1]) for x in urls]

Upvotes: 1

iajrz
iajrz

Reputation: 789

In this case you have a very specific rule that would help you - just remove the last 10 characters from the last interesting element. In this case lst[-2] = lst[-2][:-12] right before the make_me.append call would do the trick.

If you do want to make it with regex, I'd use the end-of-line marker, $, to make sure the digits were at the end. It would look like lst = re.sub('[0-9]{10}/$','',url)

after importing re, of course. This reads as:

re.sub is a substitution method in the regular expressions module, and it changes the matches to the regex in the first parameter with the content in the second parameter; the third parameter is the string where you want to make the substitution.

The regex I wrote matches "a sequence of 10 characters which match any of 0123456789, followed by a / and the end of the string".

Upvotes: 0

thatNLPguy
thatNLPguy

Reputation: 101

Assuming you want to extract urls of the same format, use regular expressions

import re

urls = ['variety.com/2017/biz/news/tax-march-donald-trump-protest-1202031487/',
        'variety.com/2018/film/news/list-2018-oscar-nominations-1202668757/']
news = []
regex =re.compile(r'/news/(.*)-')
for url in urls:
    extract_id = regex.search(url)
    if extract_id:
        data = extract_id.group(1)
        news.append(data.replace('-',' '))

print(news)

Output

['tax march donald trump protest', 'list 2018 oscar nominations']

Edited format to suit the question.

Upvotes: 1

Tojra
Tojra

Reputation: 683

There can be many approaches to this. Use .rfind('-') to get rightmost index of '-' and then slice your string. After that you can process the string further.

Upvotes: 0

Related Questions