Reputation: 189
I would like to read lines from a text file and build a distance matrix based on Wu-Palmer distance between the words. Eg:
House Grass Boat Cat
House x y .. ..
Grass x1 y1 .. ..
Boat x2 y2 .. ..
Cat x3 y3 .. ..
I would like to know if there is any existing functions I can use in python to read lines from a text file and output the lines as rows and columns of the distance Matrix?
Upvotes: 1
Views: 186
Reputation: 4343
If your input is simply whitespace-delimited words then you can easily iterate through them like this:
words = set()
with open("input.txt", "r") as fd:
for line in fd:
words.update(line.split())
The use of a set
ensures that each word is only ever recorded once - it sounded like this is what you were after.
If your input is running english text then things become a little harder because you want to catch things like "I'd" - you should also decide whether to class hyphenated words (e.g. "part-time") as a single word - my example here does, but it's easy to change. Much as I'm not a fan of them, this is somewhere where regular expressions are actually quite useful:
import re
import string
non_word_re = re.compile(r"[^-\w']+")
words = set()
with open("input.txt", "r") as fd:
for line in fd:
words.update(i for i in non_word_re.split(line) if i[0] in string.letters)
This will create a set
of words where a group of characters is anything consisting of one or more from the set [a-zA-Z0-9_-']
and where the first character is a letter.
After this, you can calculate the distance between each pair of words easily:
all_distances = {}
for word in words:
all_distances[word] = dict((i, calculate_distance(word, i)) for i in words)
There's probably a cleaner data structure than the nested dictionaries here, but it's simple enough that I think that would suffice.
Finally, you can output a tab-delimited matrix something like this:
with open("output.txt", "w") as fd:
fd.write("\t" + "\t".join(sorted(all_distances.keys())) + "\n")
for word1, distances in sorted(all_distances.iteritems()):
fd.write(word1 + "\t" + "\t".join(i[1] for i in sorted(distances.iteritems())))
If yuo wanted something closer to a pretty-formatted output matrix (i.e. where each column is automatically sized according to its contents) then that's still not hard per se, but it's a little fiddly and requires rather more code.
As an aside, if you want to read or write files in CSV format then take a look at the Python csv module, it handles tedious things like quoting for you.
Was that the sort of thing you were after?
Upvotes: 1