Reputation: 81
I have an array containing strings. I have a text file. I want to loop through the text file line by line. And check whether each element of my array is present or not. (they must be whole words and not substrings) I am stuck because my script only checks for the presence of the first array element. However, I would like it to return results with each array element and a note as to whether this array element is present in the entire file or not.
#!/usr/bin/python
with open("/home/all_genera.txt") as file:
generaA=[]
for line in file:
line=line.strip('\n')
generaA.append(line)
with open("/home/config/config2.cnf") as config_file:
counter = 0
for line in config_file:
line=line.strip('\n')
for part in line .split():
if generaA[counter]in part:
print (generaA[counter], "is -----> PRESENT")
else:
continue
counter += 1
Upvotes: 0
Views: 121
Reputation: 77912
If I understand correctly, you want a sequence of words that are in both files. If yes, set
is your friend:
def parse(f):
return set(word for line in f for word in line.strip().split())
with open("path/to/genera/file") as f:
source = parse(f)
with open("path/to/conf/file" as f:
conf = parse(f)
# elements that are common to both sets
common = conf & source
print(common)
# elements that are in `source` but not in `conf`
print(source - conf)
# elements that are in `conf` but not in `source`
print(conf - source)
So to answer "I would like it to return results with each array element and a note as to whether this array element is present in the entire file or not", you can use either common elements or the source - conf
difference to annotate your source
list:
# using common elements
common = conf & source
result = [(word, word in common) for word in source]
print(result)
# using difference
diff = source - conf
result = [(word, word not in diff) for word in source]
Both will yeld the same result and since set lookup is O(1) perfs should be similar too, so I suggest the first solution (positive assertions are easier to the brain than negative ones).
You can of course apply further cleaning / normalisation when building the sets, ie if you want case insensitive search:
def parse(f):
return set(word.lower() for line in f for word in line.strip().split())
Upvotes: 1
Reputation: 2945
the counter is not increasing because it's outside the for
loops.
with open("/home/all_genera.txt") as myfile: # don't use 'file' as variable, is a reserved word! use myfile instead
generaA=[]
for line in myfile: # use .readlines() if you want a list of lines!
generaA.append(line)
# if you just need to know if string are present in your file, you can use .read():
with open("/home/config/config2.cnf") as config_file:
mytext = config_file.read()
for mystring in generaA:
if mystring in mytext:
print mystring, "is -----> PRESENT"
# if you want to check if your string in line N is present in your file in the same line, you can go with:
with open("/home/config/config2.cnf") as config_file:
for N, line in enumerate(config):
if generaA[N] in line:
print "{0} is -----> PRESENT in line {1}".format(generaA[N], N)
I hope that everything is clear.
This code could be improved in many ways, but i tried to have it as similar as yours so it will be easier to understand
Upvotes: -1
Reputation: 113988
from collection import Counter
import re
#first normalize the text (lowercase everything and remove puncuation(anything not alphanumeric)
normalized_text = re.sub("[^a-z0-9 ]","",open("some.txt","rb").read().lower())
# note that this normalization is subject to the rules of the language/alphabet/dialect you are using, and english ascii may not cover it
#counter will collect all the words into a dictionary of [word]:count
words = Counter(normalized_text.split())
# create a new set of all the words in both the text and our word_list_array
set(my_word_list_array).intersection(words.keys())
Upvotes: 1