user1992696
user1992696

Reputation: 189

How to have same column and row headings using python?

I would like to read lines from a text file and build a distance matrix based on Wu-Palmer distance between the words. Eg:

           House    Grass   Boat   Cat
House       x        y       ..    ..
Grass       x1       y1      ..    ..
Boat        x2       y2      ..    ..
Cat         x3       y3      ..    ..

I would like to know if there is any existing functions I can use in python to read lines from a text file and output the lines as rows and columns of the distance Matrix?

Upvotes: 1

Views: 186

Answers (1)

Cartroo
Cartroo

Reputation: 4343

If your input is simply whitespace-delimited words then you can easily iterate through them like this:

words = set()
with open("input.txt", "r") as fd:
    for line in fd:
        words.update(line.split())

The use of a set ensures that each word is only ever recorded once - it sounded like this is what you were after.

If your input is running english text then things become a little harder because you want to catch things like "I'd" - you should also decide whether to class hyphenated words (e.g. "part-time") as a single word - my example here does, but it's easy to change. Much as I'm not a fan of them, this is somewhere where regular expressions are actually quite useful:

import re
import string

non_word_re = re.compile(r"[^-\w']+")
words = set()
with open("input.txt", "r") as fd:
    for line in fd:
        words.update(i for i in non_word_re.split(line) if i[0] in string.letters)

This will create a set of words where a group of characters is anything consisting of one or more from the set [a-zA-Z0-9_-'] and where the first character is a letter.

After this, you can calculate the distance between each pair of words easily:

all_distances = {}
for word in words:
    all_distances[word] = dict((i, calculate_distance(word, i)) for i in words)

There's probably a cleaner data structure than the nested dictionaries here, but it's simple enough that I think that would suffice.

Finally, you can output a tab-delimited matrix something like this:

with open("output.txt", "w") as fd:
    fd.write("\t" + "\t".join(sorted(all_distances.keys())) + "\n")
    for word1, distances in sorted(all_distances.iteritems()):
        fd.write(word1 + "\t" + "\t".join(i[1] for i in sorted(distances.iteritems())))

If yuo wanted something closer to a pretty-formatted output matrix (i.e. where each column is automatically sized according to its contents) then that's still not hard per se, but it's a little fiddly and requires rather more code.

As an aside, if you want to read or write files in CSV format then take a look at the Python csv module, it handles tedious things like quoting for you.

Was that the sort of thing you were after?

Upvotes: 1

Related Questions