newbie
newbie

Reputation: 11

Reading text from a file, then writing to another file with repetitions in text marked

I'm a beginner to both Python and to this forum, so please excuse any vague descriptions or mistakes.

I have a problem regarding reading/writing to a file. What I'm trying to do is to read a text from a file and then find the words that occur more than one time, mark them as repeated_word and then write the original text to another file but with the repeated words marked with star signs around them.

I find it difficult to understand how I'm going to compare just the words (without punctuation etc) but still be able to write the words in its original context to the file.

I have been recommended to use regex by some, but I don't know how to use it. Another approach is to iterate through the textstring and tokenize and normalize, sort of by going through each character, and then make some kind av object or element out of each word.

I am thankful to anyone who might have ideas on how to solve this. The main problem is not how to find which words that are repeated but how to mark them and then write them to the file in their context. Some help with the coding would be much appreciated, thanks.

EDIT I have updated the code with what I've come up with so far. If there is anything you would consider "bad coding", please comment on it.

To explain the Whitelist class, the assignment has two parts, one of where I am supposed to mark the words and one regarding a whitelist, containing words that are "allowed repetitions", and shall therefore not be marked.

I have read heaps of stuff about regex but I still can't get my head around how to use it.

Upvotes: 1

Views: 133

Answers (3)

David Z
David Z

Reputation: 131640

Basically, you need to do two things: find which words are repeated, and then transform each of these words into something else (namely, the original word with some marker around it). Since there's no way to know which words are repeated without going through the entire file, you will need to make two passes.

For the first pass, all you need to do is extract the words from the text and count how many times each one occurs. In order to determine what the words are, you can use a regular expression. A good starting point might be

regex = re.compile(r"[\w']+")

The function re.compile creates a regular expression from a string. This regular expression matches any sequence of one or more word characters (\w) or apostrophes, so it will catch contractions but not punctuation, and I think in many "normal" English texts this should capture all the words.

Once you have created the regular expression object, you can use its finditer method to iterate over all matches of this regular expression in your text.

for word in regex.finditer(text):

You can use the Counter class to count how many times each word occurs. (I leave the implementation as an exercise. :-P The documentation should be quite helpful.)

After you've gotten a count of how many times each word occurs, you will have to pick out those whose counts are 2 or more, and come up with some way to identify them in the input text. I think a regular expression will also help you here. Specifically, you can create a regular expression object which will match any of a selected set of words, by compiling the string consisting of the words joined by |.

regex = re.compile('|'.join(words))

where words is a list or set or some iterable. Since you're new to Python, let's not get too fancy (although one can); just code up a way to go through your Counter or whatever and create a list of all words which have a count of 2 or more, then create the regular expression as I showed you.

Once you have that, you'll probably benefit from the sub method, which takes a string and replaces all matches of the regular expression in it with some other text. In your case, the replacement text will be the original word with asterisks around it, so you can do this:

new_text = regex.sub(text, r'*\0*')

In a regular expression replacement, \0 refers to whatever was matched by the regex.

Finally, you can write new_text to a file.

Upvotes: 1

TimGJ
TimGJ

Reputation: 1654

OK. I presume that this is a homework assignment, so I'm not going to give you a complete solution. But, you really need to do a number of things.

The first is to read the input file in to memory. Then split it in to its component words (tokenize it) probably contained in a list, suitably cleaned up to remove stray punctuation. You seem to be well on your way to doing that, but I would recommend you look at the split() and strip() methods available for strings.

You need to consider whether you want the count to be case sensitive or not, and so you might want to convert each word in the list to (say) lowercase to keep this consistent. So you could do this with a for loop and the string lower() method, but a list-comprehension is probably better.

You then need to go through the list of words and count how many times each one appears. If you check out collections.Counter you will find that this does the heavy lifting for your or, alternatively, you will need to build a dictionary which has the words as keys and the count of the words. (You might also want to check out the collections.defaultdict class here as well).

Finally, you need to go through the text you've read from the file and for each word it contains which has more than one match (i.e. the count in the dictionary or counter is > 1) mark it appropriately. Regular expressions are designed to do exactly this sort of thing. So I recommend you look at the re library.

Having done that, you simply then write the result to a file, which is simple enough.

Finally, with respect to your file operations (reading and writing) I would recommend you consider replacing the try ... except construct with a with ... as one.

Upvotes: 0

Nathan Villaescusa
Nathan Villaescusa

Reputation: 17649

If you know that the text only contains alphabetic characters, it may be easier to just ignore characters that are outside of a-z than to try to remove all the punctuation.

Here is one way to remove all characters that are not a-z or space:

file = ''.join(c for c in file if 97 <= ord(c) <= 122 or c == ' ')

This works because ord() returns the ASCII code for a given character, and ASCII 97-122 represent a-z (in lowercase).

Then you'll want to split those into words, you can accomplish that like:

words = file.split()

If you pass this to the Counter data structure it will count the occurrences of each word.

counter = Counter(file.split)

Then counter.items() will contain a mapping from word to number of occurrences.

Upvotes: 0

Related Questions