Reputation: 141
I'm creating a script that querys websites, and my results end up looking something like this
result = "
nameof1stlink
38
nameof2ndlink120
12
nameof3rdlink15
7
nameof4thlin...
k143
43
"
Basically, I want to remove the numbers that come after each line of text. That would be easy for me to do in a pattern, but there is the occasional long string that takes up two separate lines. There's also the matter of needing to keep the numbers in the actual text names.
I was thinking of checking each individual line for string length and just removing those w/o 5 or more letters / numbers, but I wasn't sure if that would work, and I wasn't too sure how to do it either.
Any help from you guys would be great.
Thanks! :)
Upvotes: 2
Views: 70
Reputation: 24691
You could maybe use regex matching, looking for a link-like string (allowing for newlines) followed by a number and a newline, which you'd want to ignore. Then, to accommodate multi-line links, use simple str.replace()
to remove any occurrences of the consistent ...\n
that occurs when the link is split across multiple lines.
What I have in mind, given the example you've provided, is this:
import re
result = """nameof1stlink
38
nameof2ndlink120
12
nameof3rdlink15
7
nameof4thlin...
k143
43"""
matches = re.findall(r'([A-Za-z0-9\n/_.-]+?)[0-9\n]+[\n\b]', result, flags=re.M)
# match this group '( ) ' ^
# shortest possible ' ? ' (multi-line
# at least one of ' + ' string input)
# these characters ' [A-Za-z0-9\n/_.-] '
# then, at least one ' + '
# digit or newline ' [0-9\n] '
# and ending with \n ' [\n\b]'
# or end-of-string
# matches = ['nameof1stlink', 'nameof2ndlink', 'nameof3rdlink', 'nameof4thlin...\nk']
links = [link.replace('...\n', '') for link in matches]
# links = ['nameof1stlink', 'nameof2ndlink', 'nameof3rdlink', 'nameof4thlink']
I'm not sure what your links look like, but I assumed [A-Za-z0-9/_.-]
(alphanumerics plus /
, _
, .
, and -
) covers all the standard parts of hyperlinks. And \n
needs to be thrown somewhere in there to accommodate for multi-line entries. You can modify this character class depending on what you expect your links to look like.
Upvotes: 2