Reputation: 1
I am new to python and need help finding averages, such as: average words per sentence, average characters per word, total word and sentence count, etc. I have three text files containing large samples of text. This is what I have so far.
sampleText1 = open("textFile1.txt")
split1 = sampleText1.split(" ")
words1 = len(split1)
That's really all I have. I was thinking that since I would have to reuse that code for the other two text documents, that maybe I should create a function or something like that. I found this code posted by Inbar Rose on Stack Overflow. Should I use the following code similarly?
def clean_up(word, punctuation="!\"',;:.-?)([]<>*#\n\\"):
return word.lower().strip(punctuation) # you don't really need ".lower()"
def average_word_length(text):
cleaned_words = [clean_up(w) for w in (w for l in text for w in l.split())]
return sum(map(len, cleaned_words))/len(cleaned_words) # Python2 use float
>>> average_word_length(['James Fennimore Cooper\n', 'Peter, Paul and Mary\n'])
I am thinking I need to do something like this. Could anyone help me find these averages? Also, if anyone knows of any good resources for learning Python, then please let me know. I am currently using http://learnpythonthehardway.org/book/, Khan Academy Python videos, and some videos on Python on Lynda.com.
Upvotes: 0
Views: 140
Reputation: 2511
The question as stated is asking for advice on coding rather than finding a concrete bug. But advice in this case is kind of hard to give because the structure of your code (should you have one function to read the data or more than one?) really depends on a lot of other things that you haven't specified, such as: how much text (can it easily fit in memory? do you want to avoid looping over the corpus a bunch of times or is that no big deal?), how many times you're going to be doing the calculation, what you're using it for, etc.
The larger the amount of text, the more these questions get more delicate and can have more delicate responses.
Now, somewhat related to "how to get this to work" is "what do I want this to do". As a data scientist, my advice would be to get something working first on a small sample and see if that's useful.
But if you want to compute average number of words per sentence, try this on a small sample and iterate on it until it gets you what you want:
for sentence in sampleText1.split("."):
print sentence
Does this look ok? Maybe you want to worry about ellipses...or not? If it looks fine, then try looking at:
for sentence in sampleText1.split("."):
print sentence.split(" ")
How does this work? Do you want to worry about double spaces or not? What about hyphens? etc.? If that does look ok, then at
sentence_lengths = [len(sentence.split(" ")) for sentence in sampleText1.split(".")]
the_mean = 1. * sentence_lengths.sum() / len(sentence_lengths)
print "average sentence length: %s"%(the_mean)
Upvotes: 1