Reputation: 85

Extracting the sentences from one text file from another text file

I have two txt files, one which is very large (txt file 1) with 15000 sentences, all broken down in a set format (sentence index, word, tag) per line. I have another text file (txt file 2) with about 500 sentences broken down into the format (sentence index, word). I want to find the sentences from "txt file 2" that are in "txt file 1", but i also need to extract the tags as well.

format for txt file 1:

1   Flurazepam  O
2   thus    O
3   appears O
4   to  O
5   be  O
6   an  O
7   effective   O
8   hypnotic    O
9   drug    O
10  with    O

format for txt file 2:

1   More
2   importantly
3   ,
4   this
5   fusion
6   converted
7   a
8   less
9   effective
10  vaccine

Initially, i just tried something silly:

txtfile1=open("/Users/Desktop/Final.txt").read().split('\n')


with open ('/Users/Desktop/sentenceineed.txt','r') as txtfile2:

   whatineed=[]
   for line in txtfile2:
       for part in txtfile1:
           if line == part: 
               whatineed.append(part)

I'm getting nothing with this attempt, literally an empty list. any suggestions would be great.

Upvotes: 3

Answers (4)

Anidhya Bhatnagar

Reputation: 646

@Rory Daulton pointed out it correctly. Since your first file may got large enough to load it completely into the memory and you should rather iterate it.

Here I am writing my solution to the problem. You can make necessary/desired changes for your implementation.

Program

dict_one = {} # Creating empty dictionary for Second File
textfile2 = open('textfile2', 'r') 

# Reading textfile2 line by line and adding index and word to dictionary
for line in textfile2:
    values = line.split(' ')
    dict_one[values[0].strip()] = values[1].strip()

textfile2.close()

outfile = open('output', 'w') # Opening file for output
textfile1 = open('textfile1', 'r') # Opening first file

# Reading first file line by line
for line in textfile1:
    values = line.split(' ') 
    word = values[1].strip() # Extracting word from the line

    # Matching if word exists in dictionary
    if word in dict_one.values():
        # If word exists then writing index, word and tag to the output file
        outfile.write("{} {} {}\n".format(values[0].strip(), values[1].strip(), values [2].strip()))

outfile.close()
textfile1.close()

Text File 1

1 Flurazepam O
2 thus O
3 appears I
4 to O
5 be O
6 an O
7 effective B
8 hypnotic B
9 drug O
10 less O
11 converted I
12 maxis O
13 fusion I
14 grave O
15 public O
16 mob I
17 havoc I
18 boss O
19 less B
20 diggy I

Text File 2

1 More
2 importantly
3 ,
4 this
5 fusion
6 converted
7 a
8 less
9 effective
10 vaccine

Output File

7 effective B
10 less O
11 converted I
13 fusion I
19 less B

Here, less appears twice with different tags as it was there in data file. Hope this is what you were looking for.

Upvotes: 2

Iulian

Reputation: 300

You can't find the occurences of a indicated sentence beacuse you are looking at using the index of the sentence when comparing. Thus one sentence in the second file is presente in the first only when it compare with the same index like so

#file1
3 make tag
7 split tag

#file2
4 make 
6 split

You are comaring them in the following way if line == part: but obviously 4 make is not equal 3 make tag because you have 3 instead of 4 and in addition the tag part that will make fail the condition.

So simply changing the conditional you will can retrive the right sentences.

def selectSentence(string):
  """Based on the strings that you have in the example. 
  I assume that the elements are separated by one space char
  and that in the sentences aren't spaces"""
  elements = string.split(" ")
  return elements[1].strip()

txtfile1 = open("file1.txt").read().split('\n')
with open ('file2.txt','r') as txtfile2:

   whatineed=[]
   for line in txtfile2:
       for part in txtfile1:
         if selectSentence(line) == selectSentence(part): 
            whatineed.append(part)

print(whatineed)

My approach

Like @Rory Daulton pointend your file is very big so is a bad idea to load it all in the memory. A better idea is to iterate over it, while you can store the needed data of the little file (the second one).

txtfile2 = open("file2.txt").read().split('\n')
sentences_inf2 = {selectSentence(line) for line in txtfile2} #set to remove duplicates
with open ('file1.txt','r') as txtfile1:

   whatineed=[]
   for line in txtfile1:
         if selectSentence(line) in sentences_inf2: 
            whatineed.append(line.strip())

print(whatineed) #['7 effective O']

Upvotes: 0

Rory Daulton

Reputation: 22544

Since your first file is much larger than your second, you want to avoid putting the first file in memory all at once. Putting the second file in memory is no problem. A dictionary would be an ideal data type for this memory, since you can quickly find if a word exists in the dictionary and can quickly retrieve its index.

So think of your problem this way--find all the words in your first text file that are also in your second text file. So here is an algorithm in pseudo-code. You do not specify how the "output" is to be done, so I just generically called it "storage." You do not state if either "index" of the word is to be in the output, so I put it there. That would be trivial to remove, if you want.

Initialize a dictionary to empty
for each line in text_file_2:
    parse the index and the word
    Add the word as the key and the index as the value to the dictionary
Initialize the storage for the final result
for each line in text_file_1:
    parse the index, word, and tag
    if the word exists in the dictionary:
        retrieve the index from the dictionary
        store the word, tag, and both indices

Here is code for that algorithm. I left it "expanded" rather than using comprehensions, for ease of understanding and debugging.

dictfile2 = dict()
with open('txtfile2.txt') as txtfile2:
    for line2 in txtfile2:
        index2, word2 = line2.strip().split()
        dictfile2[word2] = index2
listresult = list()
with open('txtfile1.txt') as txtfile1:
    for line1 in txtfile1:
        index1, word1, tag1 = line1.strip().split()
        if word1 in dictfile2:
            index2 = dictfile2[word1]
            listresult.append((word1, tag1, int(index1), int(index2)))

Here is the result of that code, given your example data, for print(listresult). You may want a different format for the result.

[('effective', 'O', 7, 9)]

Upvotes: 1

Aman Raparia

Reputation: 491

Assuming that spacing in your text files remain consistent

import re

#open your files
text_file1 = open('txt file 1.txt', 'r')
text_file2 = open('txt file 2.txt', 'r')
#save each line content in a list like l = [[id, word, tag]]
text_file_1_list = [l.strip('\n') for l in text_file1.readlines()]
text_file_1_list = [" ".join(re.split("\s+", l, flags=re.UNICODE)).split('') for l in text_file_1_list] 
#similarly save all the words in text file in list
text_file_2_list = [l.strip('\n') for l in text_file2.readlines()]
text_file_2_list = [" ".join(re.split("\s+", l, flags=re.UNICODE)).split(' ')[1] for l in text_file_2_list]
print(text_file_2_list)  
# Now just simple search algo btw these two list
words_found = [[l[1], l[2]] for l in text_file_1_list if l[1] in text_file_2_list]
print(words_found)
# [['effective', 'O']]

I think should work.