How to have same column and row headings using python?

Question

I would like to read lines from a text file and build a distance matrix based on Wu-Palmer distance between the words. Eg:

           House    Grass   Boat   Cat
House       x        y       ..    ..
Grass       x1       y1      ..    ..
Boat        x2       y2      ..    ..
Cat         x3       y3      ..    ..

I would like to know if there is any existing functions I can use in python to read lines from a text file and output the lines as rows and columns of the distance Matrix?

Cartroo · Accepted Answer

If your input is simply whitespace-delimited words then you can easily iterate through them like this:

words = set()
with open("input.txt", "r") as fd:
    for line in fd:
        words.update(line.split())

The use of a set ensures that each word is only ever recorded once - it sounded like this is what you were after.

If your input is running english text then things become a little harder because you want to catch things like "I'd" - you should also decide whether to class hyphenated words (e.g. "part-time") as a single word - my example here does, but it's easy to change. Much as I'm not a fan of them, this is somewhere where regular expressions are actually quite useful:

import re
import string

non_word_re = re.compile(r"[^-\w']+")
words = set()
with open("input.txt", "r") as fd:
    for line in fd:
        words.update(i for i in non_word_re.split(line) if i[0] in string.letters)

This will create a set of words where a group of characters is anything consisting of one or more from the set [a-zA-Z0-9_-'] and where the first character is a letter.

After this, you can calculate the distance between each pair of words easily:

all_distances = {}
for word in words:
    all_distances[word] = dict((i, calculate_distance(word, i)) for i in words)

There's probably a cleaner data structure than the nested dictionaries here, but it's simple enough that I think that would suffice.

Finally, you can output a tab-delimited matrix something like this:

with open("output.txt", "w") as fd:
    fd.write("	" + "	".join(sorted(all_distances.keys())) + "
")
    for word1, distances in sorted(all_distances.iteritems()):
        fd.write(word1 + "	" + "	".join(i[1] for i in sorted(distances.iteritems())))

If yuo wanted something closer to a pretty-formatted output matrix (i.e. where each column is automatically sized according to its contents) then that's still not hard per se, but it's a little fiddly and requires rather more code.

As an aside, if you want to read or write files in CSV format then take a look at the Python csv module, it handles tedious things like quoting for you.

Was that the sort of thing you were after?

How to have same column and row headings using python?

Answers (1)

Related Questions