Reputation: 11
I'm a beginner to both Python and to this forum, so please excuse any vague descriptions or mistakes.
I have a problem regarding reading/writing to a file. What I'm trying to do is to read a text from a file and then find the words that occur more than one time, mark them as repeated_word and then write the original text to another file but with the repeated words marked with star signs around them.
I find it difficult to understand how I'm going to compare just the words (without punctuation etc) but still be able to write the words in its original context to the file.
I have been recommended to use regex by some, but I don't know how to use it. Another approach is to iterate through the textstring and tokenize and normalize, sort of by going through each character, and then make some kind av object or element out of each word.
I am thankful to anyone who might have ideas on how to solve this. The main problem is not how to find which words that are repeated but how to mark them and then write them to the file in their context. Some help with the coding would be much appreciated, thanks.
EDIT I have updated the code with what I've come up with so far. If there is anything you would consider "bad coding", please comment on it.
To explain the Whitelist class, the assignment has two parts, one of where I am supposed to mark the words and one regarding a whitelist, containing words that are "allowed repetitions", and shall therefore not be marked.
I have read heaps of stuff about regex but I still can't get my head around how to use it.
Upvotes: 1
Views: 133
Reputation: 131640
Basically, you need to do two things: find which words are repeated, and then transform each of these words into something else (namely, the original word with some marker around it). Since there's no way to know which words are repeated without going through the entire file, you will need to make two passes.
For the first pass, all you need to do is extract the words from the text and count how many times each one occurs. In order to determine what the words are, you can use a regular expression. A good starting point might be
regex = re.compile(r"[\w']+")
The function re.compile
creates a regular expression from a string. This regular expression matches any sequence of one or more word characters (\w
) or apostrophes, so it will catch contractions but not punctuation, and I think in many "normal" English texts this should capture all the words.
Once you have created the regular expression object, you can use its finditer
method to iterate over all matches of this regular expression in your text.
for word in regex.finditer(text):
You can use the Counter
class to count how many times each word occurs. (I leave the implementation as an exercise. :-P The documentation should be quite helpful.)
After you've gotten a count of how many times each word occurs, you will have to pick out those whose counts are 2 or more, and come up with some way to identify them in the input text. I think a regular expression will also help you here. Specifically, you can create a regular expression object which will match any of a selected set of words, by compiling the string consisting of the words joined by |
.
regex = re.compile('|'.join(words))
where words
is a list
or set
or some iterable. Since you're new to Python, let's not get too fancy (although one can); just code up a way to go through your Counter
or whatever and create a list
of all words which have a count of 2 or more, then create the regular expression as I showed you.
Once you have that, you'll probably benefit from the sub
method, which takes a string and replaces all matches of the regular expression in it with some other text. In your case, the replacement text will be the original word with asterisks around it, so you can do this:
new_text = regex.sub(text, r'*\0*')
In a regular expression replacement, \0
refers to whatever was matched by the regex.
Finally, you can write new_text
to a file.
Upvotes: 1
Reputation: 1654
OK. I presume that this is a homework assignment, so I'm not going to give you a complete solution. But, you really need to do a number of things.
The first is to read the input file in to memory. Then split it in to its component words (tokenize it) probably contained in a list, suitably cleaned up to remove stray punctuation. You seem to be well on your way to doing that, but I would recommend you look at the split()
and strip()
methods available for strings.
You need to consider whether you want the count to be case sensitive or not, and so you might want to convert each word in the list to (say) lowercase to keep this consistent. So you could do this with a for
loop and the string lower()
method, but a list-comprehension is probably better.
You then need to go through the list of words and count how many times each one appears. If you check out collections.Counter
you will find that this does the heavy lifting for your or, alternatively, you will need to build a dictionary which has the words as keys and the count of the words. (You might also want to check out the collections.defaultdict
class here as well).
Finally, you need to go through the text you've read from the file and for each word it contains which has more than one match (i.e. the count in the dictionary or counter is > 1) mark it appropriately. Regular expressions are designed to do exactly this sort of thing. So I recommend you look at the re
library.
Having done that, you simply then write the result to a file, which is simple enough.
Finally, with respect to your file operations (reading and writing) I would recommend you consider replacing the try ... except
construct with a with ... as
one.
Upvotes: 0
Reputation: 17649
If you know that the text only contains alphabetic characters, it may be easier to just ignore characters that are outside of a-z than to try to remove all the punctuation.
Here is one way to remove all characters that are not a-z or space:
file = ''.join(c for c in file if 97 <= ord(c) <= 122 or c == ' ')
This works because ord()
returns the ASCII code for a given character, and ASCII 97-122 represent a-z (in lowercase).
Then you'll want to split those into words, you can accomplish that like:
words = file.split()
If you pass this to the Counter data structure it will count the occurrences of each word.
counter = Counter(file.split)
Then counter.items()
will contain a mapping from word to number of occurrences.
Upvotes: 0