StandardNerd
StandardNerd

Reputation: 4183

delete duplicate word combinations in textfile with python

with the help of eumiro Delete duplicate rows in textfile - except it contains a "{" or "}" i could successfully remove duplicate line in a large textfile. That's a huge step from 60MB to 3MB textfile.

But now i want delete duplicate words like this:

  @INBOOK{Miller1992,
  author = {Miller, Rowland S. und Mark R. Leary and Miller, Rowland S. und Mark
    R. Leary and Miller, Rowland S. und Mark R. Leary and Miller, Rowland
    S. und Mark R. Leary and Miller, Rowland S. und Mark R. Leary and
    Miller, Rowland S. und Mark R. Leary and Miller, Rowland S. und Mark
    Miller, Rowland S. und Mark R. Leary},
  year = {1992},
  editor = {Teun A. van Dijk and Teun A. van Dijk and Teun A. van Dijk and Teun
    A. van Dijk and Teun A. van Dijk and Teun A. van Dijk and Teun A.
    van Dijk and Teun A. van Dijk and Teun A. van Dijk and Teun A. van
    Dijk and Teun A. van Dijk and Teun A. van Dijk and Teun A. van Dijk
    and Teun A. van Dijk and Teun A. van Dijk and Teun A. van Dijk and
    Teun A. van Dijk and Teun A. van Dijk and Teun A. van Dijk and Teun
    and Teun A. van Dijk and Teun A. van Dijk and Teun A. van Dijk},
  title = {Handbook of discourse analysis (Bd. 3/4)},

the result should look like this:

  @INBOOK{Miller1992,
  author = {Miller,  Rowland S. und Mark R. Leary},
  year = {1992},
  editor = {Teun A. van Dijk},
  title = {Handbook of discourse analysis (Bd. 3/4)},

The textfile has 70000 Lines and the authornames could be used in multiple entries. Therefor only the duplicates between the curly brackets (over multiple lines) should removed:

  author = {Miller, Rowland S. und Mark R. Leary and Miller, Rowland S. und Mark
  R. Leary and Miller, Rowland S. und Mark R. Leary and Miller, Rowland
  S. und Mark R. Leary and Miller, Rowland S. und Mark R. Leary and
  Miller, Rowland S. und Mark R. Leary and Miller, Rowland S. und Mark
  Miller, Rowland S. und Mark R. Leary},

I tried to modify my Python-Skript which delete duplicate lines to delete duplicate words between the curly brackets but i'm stucked:

words_seen = set() # holds words already seen 
outfile = open("literatur_clean.txt", "w") 
for line in open("literatur_dupl.txt", "r"): 
    if ('{' in line or '}' in line
        # some code to check whether the words are duplicate
outfile.close() 

Upvotes: 0

Views: 223

Answers (1)

Hans Then
Hans Then

Reputation: 11322

Based on your current dataset it looks like it is not so much a question of duplicate words, but rather that sometimes the author or editor is repeated n-times.

You could try to split on the string " and ". Then you can see if the remaining items are all the same. (E.g. place all the strings in a set or as keys in a dictionary.) If the length of the set equals 1, you have removed all duplicates. If not, probably " and " was also part of the author or editor name. You have to merge the two again.

If that does not work (e.g. because your dataset is not as neat as suggested) you can find duplicate matches by finding subset matches:

Miller, Rowland S. und Mark R. Leary and Miller, Rowland S. und Mark R. Leary 
^                                        ^
1                                        2

Increment a pointer into the text string after the beginning of the string. For each position find the longest submatch to the beginning of the string. Save these submatches.

Upvotes: 1

Related Questions