Reputation: 43

Python extract tags from string by string array

I am new to python and looking help to extract tags from string by string array. Let's say I have string array of ['python', 'c#', 'java', 'f#' ]

And input string of "I love Java and python".

The output should be array ['java', 'python']

Thanks for any help.

Upvotes: 2

Answers (3)

Anton vBR

Reputation: 18906

Non-splittable by blankspace

Regex solution

import re

stringarray = ['python', 'c#', 'core java', 'f#' ]
string = "I love Core Java and python"

pattern = '|'.join(stringarray)    
output = re.findall(pattern, string.lower())
# ['core java', 'python']

Non-regex solution

stringarray = ['python', 'c#', 'core java', 'f#' ]
string = "I love Core Java and python"
output = [i for i in stringarray if i in string.lower()]
# ['core java', 'python']

Splittable by blankspace, or other char (quicker!)

Using set and intersection

stringarray = ['python', 'c#', 'java', 'f#' ]
string = "I love Java and python"

output = list(set(string.lower().split()).intersection(stringarray))
# ['java', 'python']

Short explanation: By doing string.lower().split() we split the words as lower-case in your inputstring by the default (blankspace). By converting it to a set we can access the set function intersection. Intersection will in turn find the occurences that are in both sets. Finally we wrap this around a list to get desired output. As commented by Joe Iddon this will not return repeated tags.

Counts

Are you interested in counts? Consider using collections counter and a dict comprehension:

from collections import Counter

count = {k:v for k,v in Counter(string.lower().split()).items() if k in stringarray}
print(count)
#{'java': 1, 'python': 1}

Upvotes: 4

sacuL

Reputation: 51335

You could use the following list comprehension, which turns your string into lowercase, then iterates through each word (after using split), and returns which ones are in your array:

arr = ['python', 'c#', 'java', 'f#' ]
s = "I love Java and python"

outp = [i for i in s.lower().split() if i in arr]

>>> outp
['java', 'python']

Or you could use regular expressions:

import re

arr = ['python', 'c#', 'java', 'f#' ]
s = "I love Java and python"

outp = re.findall('|'.join(arr),s.lower())

>>> outp 
['java', 'python']

Upvotes: 3

Joe Iddon

Reputation: 20414

Turn your tags list into a set, so lookup is average case O(1) lookup, and then use a list-comprehension to perform an O(1) tag search.

def extract(string, tags):
     tags = set(tags)
     return [w for w in string.lower().split() if w in tags]

and a test:

>>> extract('I love Java and python', ['python', 'c#', 'java', 'f#' ])
['java', 'python']