Reputation: 141

Remove punctuation from a list

I'm working on taking a sample of the Declaration of Independence and calculating the frequency of the length of words in it.

Sample text from file:

"When in the Course of human events it becomes necessary for one people to dissolve the political bands which have connected them with another and to assume among the powers of the earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle them, a decent respect to the opinions of mankind requires 
that they should declare the causes which impel them to the separation."

Note: The word length cannot include any punctuation e.g. anything from string.punctuation.

Expected Outcome (sample):

Length Count
1 16
2 267
3 267
4 169
5 140
6 112
7 99
8 68
9 61
10 56
11 35
12 13
13 9
14 7
15 2

I'm currently stuck on removing punctuation from the file that I've converted into a list.

Here is what I've tried so far:

import sys
import string

def format_text(fname):
        punc = set(string.punctuation)
        words = fname.read().split()
        return ''.join(word for word in words if word not in punc)

try:
    with open(sys.argv[1], 'r') as file_arg:
        file_arg.read()
except IndexError:
    print('You need to provide a filename as an arguement.')
    sys.exit()

fname = open(sys.argv[1], 'r')
formatted_text = format_text(fname)
print(formatted_text)

Upvotes: 1

Answers (3)

dting

Reputation: 39287

You can use translate to strip the punctuation:

import string

words = fname.read().translate(None, string.punctuation).split()

Best way to strip punctuation from a string in Python

py2.7:

import string
from collections import defaultdict
from collections import Counter

def s1():
    with open("myfile.txt", "r") as f:
        counts = defaultdict(int)
        for line in f:
            words = line.translate(None, string.punctuation).split()
            for length in map(len, words):
                counts[length] += 1
    return counts

def s2():
    with open("myfile.txt", "r") as f:
        counts = Counter(length for line in f for length in map(len, line.translate(None, string.punctuation).split()))
    return counts

print s1()
defaultdict(<type 'int'>, {1: 111, 2: 1169, 3: 1100, 4: 1470, 5: 1425, 6: 1318, 7: 1107, 8: 875, 9: 938, 10: 108, 11: 233, 12: 146})

print s2()
Counter({4: 1470, 5: 1425, 6: 1318, 2: 1169, 7: 1107, 3: 1100, 9: 938, 8: 875, 11: 233, 12: 146, 1: 111, 10: 108})

In python 2.7 using Counter is slower than building up a dictionary manually because the way Counter's update is implemented.

%timeit s1()
100 loops, best of 3: 4.42 ms per loop

%timeit s2()
100 loops, best of 3: 9.27 ms per loop

py3:

I think in python 3.2 Counter was updated and became equal or faster than manually building the counter dictionary.

also python3's translate changed to be less verbose:

import string
from collections import defaultdict
from collections import Counter

strip_punct = str.maketrans('','',string.punctuation)

def s1():
    with open("myfile.txt", "r") as f:
        counts = defaultdict(int)
        for line in f:
            words = line.translate(strip_punct).split()
            for length in map(len, words):
                counts[length] += 1
    return counts

def s2():
    with open("myfile.txt", "r") as f:
        counts = Counter(length for line in f for length in map(len, line.translate(strip_punct).split()))
    return counts

print(s1())
defaultdict(<class 'int'>, {1: 111, 2: 1169, 3: 1100, 4: 1470, 5: 1425, 6: 1318, 7: 1107, 8: 875, 9: 938, 10: 108, 11: 233, 12: 146})

print(s2())
Counter({4: 1470, 5: 1425, 6: 1318, 2: 1169, 7: 1107, 3: 1100, 9: 938, 8: 875, 11: 233, 12: 146, 1: 111, 10: 108})

%timeit s1()
100 loops, best of 3: 11.4 ms per loop

%timeit s2()
100 loops, best of 3: 11.2 ms per loop

Upvotes: 2

Padraic Cunningham

Reputation: 180401

You can strip the punctuation from the words and also avoid reading all the file into memory:

punc = string.punctuation
return ' '.join(word.strip(punc) for line in fname for word in line.split())

If you want to remove the ' from Nature's then you will need translate:

from string import punctuation

# use ord of characters you want to replace as keys and what you want to replace them with as values
tbl = {ord(k):"" for k in punctuation}
return ' '.join(line.translate(tbl) for line in fname)

To get the frequency, use a Counter dict:

from collections import Counter
freq = Counter(len(word.translate(tbl)) for line in fname for word in line.split())

Or depending on your approach:

freq = Counter(len(word.strip(punc)) for line in fname for word in line.split())

Using the lines in your question above as an example:

lines =""""When in the Course of human events it becomes necessary for one people to dissolve the political bands which have connected them with another and to assume among the powers of the earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle them, a decent respect to the opinions of mankind requires
that they should declare the causes which impel them to the separation."""

from collections import Counter
freq = Counter(len(word.strip(punctuation)) for line in lines.splitlines() for word in line.split())
print(freq.most_common())

Outputs tuples of key/value pairings starting with the word length seen the most all the way down to the least, the key is the length and the second element is the frequency:

[(3, 15), (2, 12), (4, 9), (5, 9), (6, 9), (7, 7), (8, 5), (9, 3), (1, 1), (10, 1)]

If you want to output the frequency starting from 1 letter words up without sorting and in order:

mx = max(freq.values())
for i in range(1, mx+1):
    v = freq[i]
    if v:
        print("length {} words appeared {} time/s.".format(i, v) )

Output:

length 1 words appeared 1 time/s.
length 2 words appeared 12 time/s.
length 3 words appeared 15 time/s.
length 4 words appeared 9 time/s.
length 5 words appeared 9 time/s.
length 6 words appeared 9 time/s.
length 7 words appeared 7 time/s.
length 8 words appeared 5 time/s.
length 9 words appeared 3 time/s.
length 10 words appeared 1 time/s.

For a missing key a Counter dict unlike a normal dict will not return a keyError but return a value of 0 so if v will only be True for word lengths that appeared in the file.

If you want to print the cleaned data putting all the logic in fucntions:

def clean_text(fname):
    punc = string.punctuation
    return [word.strip(punc) for line in fname for word in line.split()]


def get_freq(cleaned):
    return Counter(len(word) for word in cleaned)


def freq_output(d):
    mx = max(d.values())
    for i in range(1, mx + 1):
        v = d[i]
        if v:
            print("length {} words appeared {} time/s.".format(i, v))

try:
    with open(sys.argv[1], 'r') as file_arg:
        file_arg.read()
except IndexError:
    print('You need to provide a filename as an arguement.')
    sys.exit()

fname = open(sys.argv[1], 'r')
formatted_text = clean_text(fname)

print(" ".join(formatted_text))
print()
freq = get_freq(formatted_text)

freq_output(freq)

Which run on your question snippet outputs:

~$ python test.py test.txt
When in the Course of human events it becomes necessary for one people  
to dissolve the political bands which have connected them with another
and to assume among the powers of the earth the separate and equal station 
 to which the Laws of Nature and of Nature's God entitle them a decent 
respect to the opinions of mankind requires that they should declare 
the causes which impel them to the separation

length 1 words appeared 1 time/s.
length 2 words appeared 12 time/s.
length 3 words appeared 15 time/s.
length 4 words appeared 9 time/s.
length 5 words appeared 9 time/s.
length 6 words appeared 9 time/s.
length 7 words appeared 7 time/s.
length 8 words appeared 5 time/s.
length 9 words appeared 3 time/s.
length 10 words appeared 1 time/s.

If you only care about the frequency output, do it all in one pass:

import sys
import string


def freq_output(fname):
    from string import punctuation

    tbl = {ord(k): "" for k in punctuation}
    d = Counter(len(word.strip(punctuation)) for line in fname for word in line.split())
    d = Counter(len(word.translate(tbl)) for line in fname for word in line.split())
    mx = max(d.values())
    for i in range(1, mx + 1):
        v = d[i]
        if v:
            print("length {} words appeared {} time/s.".format(i, v))


try:
    with open(sys.argv[1], 'r') as file_arg:
        file_arg.read()
except IndexError:
    print('You need to provide a filename as an arguement.')
    sys.exit()

fname = open(sys.argv[1], 'r')

freq_output(fname)

using whichever approach is correct for d.

Upvotes: 4

user3842449

Reputation:

You can use regular expressions:

import re

def format_text(fname, pattern):
    words = fname.read()
    return re.sub(p, '', words)

p = re.compile(r'[!&:;",.]')
fh = open('C:/Projects/ExplorePy/test.txt')
text = format_text(fh, p)

Apply split() as you like, and the pattern can be refined.

Upvotes: 0

Remove punctuation from a list

Answers (3)

Related Questions