Reputation:
I have a string based on some text I have extracted and a list of keywords. I woud like to run through the string and extract only the sentence after the sentence where the keyword is found and remove the full stop too.
String
'Test string. removing data. keyword extraction. data number. 11123. final answer.'
Here is my list of key phrases:
lst= ['Test string', 'data number']
Desired output:
['removing data', '11123']
Please could someone help me out/ point in the right direction? Thanks
Upvotes: 1
Views: 934
Reputation: 10624
Here is my suggestion:
s='Test string. removing data. keyword extraction. data number. 11123. final answer.'
temp = [i.strip() for i in s.split('.')]
res = [temp[temp.index(i)+1] for i in lst]
print(res)
Output:
['removing data', '11123']
What it does:
temp = [i.strip() for i in s.split('.')]
s.split('.') converts your string in list of strings, split by dot. So you are getting each sentence separated:
['Test string', ' removing data', ' keyword extraction', ' data number', ' 11123', ' final answer', '']
This is put in a list comprehension, which creates a new list from the above one with stripped values (i.strip() removes the leading and trailing whitespaces). So you end up with:
['Test string', 'removing data', 'keyword extraction', 'data number', '11123', 'final answer', '']
On the last step there are two interesting things:
It is safer to make it straight forward:
res = [temp[idx+1] for idx, val in enumerate(temp) if val in lst]
For more information on enumerate, check the documentation.
Upvotes: 1
Reputation: 12337
Use list comprehension, re.split
and enumerate
:
import re
my_str = 'Test string. removing data. keyword extraction. data number. 11123. final answer.'
key_phrases = ['Test string', 'data number']
my_str_phrases = re.split(r'[.]\s*', my_str)
print([my_str_phrases[idx + 1] for idx, item in enumerate(my_str_phrases) if item in key_phrases])
# ['removing data', '11123']
Note:
[.]\s*
: Literal dot (needs to be either part of the character class []
or escaped like this: .), followed by 0 or more occurrences of whitespace.
Upvotes: 0
Reputation: 1789
Here's one solution. Essentially you split the input based on the dot and space to make a list. Then you iterate over and see if it exists. If it does, you add the next element to your output list.
Code:
input = 'Test string. removing data. keyword extraction. data number. 11123. final answer.'
input_as_list = input.split('. ')
lst = ['Test string', 'data number']
result = []
for i in range(0, len(input_as_list)):
for item in lst:
if input_as_list [i] == item :
result.append(input_as_list [i+1])
print(result)
Result:
['removing data', '11123']
Upvotes: 0