pseudoepinephrine
pseudoepinephrine

Reputation: 141

Remove numbers from result; Python3

I'm creating a script that querys websites, and my results end up looking something like this

result = "
nameof1stlink
38
nameof2ndlink120
12
nameof3rdlink15
7
nameof4thlin...
k143
43
"

Basically, I want to remove the numbers that come after each line of text. That would be easy for me to do in a pattern, but there is the occasional long string that takes up two separate lines. There's also the matter of needing to keep the numbers in the actual text names.

I was thinking of checking each individual line for string length and just removing those w/o 5 or more letters / numbers, but I wasn't sure if that would work, and I wasn't too sure how to do it either.

Any help from you guys would be great.

Thanks! :)

Upvotes: 2

Views: 70

Answers (1)

Green Cloak Guy
Green Cloak Guy

Reputation: 24691

You could maybe use regex matching, looking for a link-like string (allowing for newlines) followed by a number and a newline, which you'd want to ignore. Then, to accommodate multi-line links, use simple str.replace() to remove any occurrences of the consistent ...\n that occurs when the link is split across multiple lines.

What I have in mind, given the example you've provided, is this:

import re

result = """nameof1stlink
38
nameof2ndlink120
12
nameof3rdlink15
7
nameof4thlin...
k143
43"""

matches = re.findall(r'([A-Za-z0-9\n/_.-]+?)[0-9\n]+[\n\b]', result, flags=re.M)
# match this group    '(                   )              '                   ^
# shortest possible   '                   ?               '          (multi-line
# at least one of     '                  +                '          string input)
# these characters    ' [A-Za-z0-9\n/_.-]                 '
# then, at least one  '                            +      '
# digit or newline    '                     [0-9\n]       '
# and ending with \n  '                             [\n\b]'
#   or end-of-string                     

# matches = ['nameof1stlink', 'nameof2ndlink', 'nameof3rdlink', 'nameof4thlin...\nk']

links = [link.replace('...\n', '') for link in matches]
# links = ['nameof1stlink', 'nameof2ndlink', 'nameof3rdlink', 'nameof4thlink']

I'm not sure what your links look like, but I assumed [A-Za-z0-9/_.-] (alphanumerics plus /, _, ., and -) covers all the standard parts of hyperlinks. And \n needs to be thrown somewhere in there to accommodate for multi-line entries. You can modify this character class depending on what you expect your links to look like.

Upvotes: 2

Related Questions