Amir Afianian
Amir Afianian

Reputation: 2795

How to find an exact word in a string in python

I have a list of strings in the following format:

target:

'TLS 1.2 x67 DHE-RSA-AES128-SHA256 DH 2048 AES128 TLS_DHE_RSA_WITH_AES_128_CBC_SHA256'

'TLS 1 x67 DHE-RSA-AES128-SHA256 DH 2048 AES128 TLS_DHE_RSA_WITH_AES_128_CBC_SHA256'

'TLS 1.1 x67 DHE-RSA-AES128-SHA256 DH 2048 AES128 TLS_DHE_RSA_WITH_AES_128_CBC_SHA256'

I want to know if only the exact match of 'TLS 1' (and not TLS 1.1 or TLS 1.2) exist in a line.

I have tried solutions in similar post as following:

#returns all the lines including TLS 1.1, TLS 1.2 ...    
lines = []    
    for i in target:
        if re.match(r'\bTLS 1\b', i):
            lines.append(i)

also tried:

#returns nothing  
lines = []    
    for i in target:
        if re.match(r'^TLS 1$', i):
            lines.append(i)

and many other variations with search or findall etc. How can I only grab the lines with exact and only exact match of a given word?

Upvotes: 1

Views: 2446

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626802

You may consider the following approaches.

TLS as a whole word should have a word boundary right in front of it, so that part is covered in your pattern.

If there must be a whitespace right after 1, or end of string, it is more efficient to use a negative lookahead (?!\S): r'\bTLS 1(?!\S)'. Well, you may also use r'\bTLS 1(?:\s|$)'. See this regex demo.

If you just want to ensure there is no digit or a fractional part after 1 use

r'\bTLS 1(?!\.?\d)'

This will match TLS 1 that has no . or . + digit after it. See this regex demo.

Python demo:

import re
target = ['TLS 1.2 x67 DHE-RSA-AES128-SHA256 DH 2048 AES128 TLS_DHE_RSA_WITH_AES_128_CBC_SHA256', 'TLS 1 x67 DHE-RSA-AES128-SHA256 DH 2048 AES128 TLS_DHE_RSA_WITH_AES_128_CBC_SHA256', 
'TLS 1.1 x67 DHE-RSA-AES128-SHA256 DH 2048 AES128 TLS_DHE_RSA_WITH_AES_128_CBC_SHA256']
lines=[]
for i in target:
    if re.match(r'\bTLS 1(?!\.?\d)', i):
        lines.append(i)
print(lines)

Output:

['TLS 1 x67 DHE-RSA-AES128-SHA256 DH 2048 AES128 TLS_DHE_RSA_WITH_AES_128_CBC_SHA256']

Upvotes: 2

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521194

Wiktor commented before I posted this (not surprising), but the marker for an exact match in this case is actually a space following TLS 1. A word boundary is not specific enough, because that would also pick up things like TLS 1.1, which you don't want. So try this version:

#returns all the lines including TLS 1.1, TLS 1.2 ...    
lines = []    
    for i in target:
        if re.match(r'\bTLS 1\s', i):
            lines.append(i)

If the TLS text could possibly be the very last thing in a line, then we can try using this:

re.match(r'\bTLS 1(?=(\s|$))', i)

Upvotes: 2

Related Questions