Reputation: 621
I have used nltk to obtain a list of tokenised keywords. The Output is
['Natural', 'Language', 'Processing', 'with', 'PythonNatural', 'Language', 'Processingwith', 'PythonNatural', 'Language', 'Processing', 'with', 'Python', 'Editor', ':', 'Production', 'Editor', ':', 'Copyeditor']
I have a text file keyword.txt which contains following keywords:
Processing
Editor
Pyscripter
Language
Registry
Python
How can i match the keywords obtained from tokenization with my keyword.txt file such that a third file is created for the matched keywords.
This is a program i have been working on, but it creates an union of these two files:
import os
with open(r'D:\file3.txt', 'w') as fout:
keywords_seen = set()
for filename in r'D:\File1.txt', r'D:\Keyword.txt':
with open(filename) as fin:
for line in fin:
keyword = line.strip()
if keyword not in keywords_seen:
fout.write(line + "\n")
keywords_seen.add(keyword)
Upvotes: 2
Views: 155
Reputation: 78690
How can i match the keywords obtained from tokenization with my keyword.txt file such that a third file is created for the matched keywords
Here's a simple solution, adjust the filenames as needed.
# these are the tokens:
tokens = set(['Natural', 'Language', 'Processing', 'with', 'PythonNatural', 'Language', 'Processingwith', 'PythonNatural', 'Language', 'Processing', 'with', 'Python', 'Editor', ':', 'Production', 'Editor', ':', 'Copyeditor'])
# create a set containing the keywords
with open('keywords.txt', 'r') as keywords:
keyset = set(keywords.read().split())
# write outputfile
with open('matches.txt', 'w') as matches:
for word in keyset:
if word in tokens:
matches.write(word + '\n')
This will produce a file matches.txt
with the words
Language
Processing
Python
Editor
Upvotes: 1