Reputation: 43
queries = open(sys.argv[1],"rU")
tweets = open(sys.argv[2],"rU")
for query in queries:
for tweet in tweets:
query_words = query.split()
tweet_words = tweet.split()
for qword in query_words:
for tword in tweet_words:
#Comparison
I'm trying to use python to iterate over two files with multiple lines in each of them. What I want to do is, to break down each line in both files into words, and then compare each word in the current line in the "query" file with each word in the current line in the "tweet" file. The above is what I did till now, but it's only working for the first line in the query file and skips over the rest of the lines in it. It does work for each line in the tweet file. Any help?
Edit for the duplicate_comment: I understand that after iterating over the queries file it the file handle will be positioned at EOF. But I don't get why it isn't processing the next line in the queries file, and just going directly to EOF.
Upvotes: 2
Views: 501
Reputation: 731
The problem is that, after you iterate through every line of a file, you're at EOF
. You either have to open it again or you ensure each line being processed as expected (split and compared in your example) before reading, or iterating, to the next line. In your example, since file tweets
is at EOF
after the first iteration of query
, it would seem like the file queries
"skipped" to EOF
starting the second iteration, simply because there is no more tweet
to iterate through in nested loop.
Also, although garbage collection handles file closing for you, it is still a better practice to explicitly close each opened file.
Refer to @Smac89's answer for modification.
Upvotes: 1
Reputation: 148880
You want to iterate second file for each line of first file. But look what happens :
So you have to rewind second file after each iteration of first file. You have two ways to do it :
load second file in memory as a list of lines with readlines
and iterate through this list. As it is a list (and not a file) iteration will start at first position instead of current one
queries = open(sys.argv[1],"rU")
tweets_file = open(sys.argv[2],"rU")
tweets = tweets_file.readlines() # tweets is now a list of lines
for query in queries:
for tweet in tweets:
query_words = query.split()
tweet_words = tweet.split()
for qword in query_words:
for tword in tweet_words:
#Comparison
explicitely rewind the file with skip
queries = open(sys.argv[1],"rU")
tweets = open(sys.argv[2],"rU")
for query in queries:
for tweet in tweets:
query_words = query.split()
tweet_words = tweet.split()
for qword in query_words:
for tword in tweet_words:
#Comparison
tweets.seek(0) # explicitely rewind tweets
First solution read second file only once but uses more memory. It should be prefered if second file if small (less than several hundreds of Mo on recent machines). Second solution uses less memory and should be prefered is second file is huge ... or if you have to save memory for any reason (embedded system, lower impact of a script ...)
Upvotes: 0
Reputation: 43078
Consider using file.seek:
with open(sys.argv[1],"rU") as queries:
with open(sys.argv[2],"rU") as tweets:
for query in queries:
query_words = query.split()
for tweet in tweets:
tweet_words = tweet.split()
for qword in query_words:
for tword in tweet_words:
#Comparison
tweets.seek(0) # go back to the start of the file
Upvotes: 1
Reputation: 534
Instead of doing for loops like that, use the function file.readline()
queries = open(sys.argv[1],"rU")
tweets = open(sys.argv[2],"rU")
query = queries.readline()
tweet = tweets.readline()
while (query != "" and tweet != ""):
query_words = query.split()
tweet_words = tweet.split()
#comparison
query = queries.readline()
tweet = tweets.readline()
mirosval provided an easier answer, use his
Upvotes: 1
Reputation: 6822
Essentially what happens is that you go through all the lines in one file while looking just at the first line in the other file. You cannot go through those lines in the next iteration, because you've already read them out.
Do it like this:
queries = open(sys.argv[1],"rU").readlines()
tweets = open(sys.argv[2],"rU").readlines()
for i in range(min(len(queries), len(tweets))):
tweet = tweets[i]
query = queries[i]
# comparison
Upvotes: 2