Reputation: 87
I have a text file with a huge text written in paragraphs.
I need to count certain punctuation symbols:
regex
,
and ;
also needs to count '
and -
, but only under certain circumstances. Specifically:
'
marks, but only when they appear as apostrophes surrounded by letters, i.e. indicating a contraction such as "shouldn't" or "won't". (Apostrophe is being included as an indication of more informal writing, perhaps direct speech.)-
signs, but only when they are surrounded by letters, indicating a compound-word, such as "self-esteem".Any other punctuation or letters, e.g. digits, should be regarded as white space, so serve to end words.
--
. This is to be regarded as a space character.I first created a string and stored some punctuations inside it for example punctuation_string = ";./'-"
but it is giving me the total; what I need is count for individual punctuation.
Because of that I have to change certain_cha
variable number of times.
with open("/Users/abhishekabhishek/downloads/l.txt") as f:
text_lis = f.read().split()
punctuation_count = {}
certain_cha = "/"
freq_coun = 0
for word in text_lis:
for char in word:
if char in certain_char:
freq_coun += 1
punctuation_count[certain_char] = freq_count
I need values to be displayed like this:
; 40
. 10
/ 5
' 16
etc. but what I get is total (71).
Upvotes: 0
Views: 1906
Reputation: 8576
You will need to create a dictionary where each entry stores the count of each of those punctuation characters.
For commas and semicolons, we can simply do a string search to count the number of occurences in a word. But we'll need to handle '
and -
slightly differently.
This should take care of all the cases:
with open("/Users/abhishekabhishek/downloads/l.txt") as f:
text_words = f.read().split()
punctuation_count = {}
punctuation_count[','] = 0
punctuation_count[';'] = 0
punctuation_count["'"] = 0
punctuation_count['-'] = 0
def search_for_single_quotes(word):
single_quote = "'"
search_char_index = word.find(single_quote)
search_char_count = word.count(single_quote)
if search_char_index == -1 and search_char_count != 1:
return
index_before = search_char_index - 1
index_after = search_char_index + 1
# Check if the characters before and after the quote are alphabets,
# and the alphabet after the quote is the last character of the word.
# Will detect `won't`, `shouldn't`, but not `ab'cd`, `y'ess`
if index_before >= 0 and word[index_before].isalpha() and \
index_after == len(word) - 1 and word[index_after].isalpha():
punctuation_count[single_quote] += 1
def search_for_hyphens(word):
hyphen = "-"
search_char_index = word.find(hyphen)
if search_char_index == -1:
return
index_before = search_char_index - 1
index_after = search_char_index + 1
# Check if the character before and after hyphen is an alphabet.
# You can also change it check for characters as well as numbers
# depending on your use case.
if index_before >= 0 and word[index_before].isalpha() and \
index_after < len(word) and word[index_after].isalpha():
punctuation_count[hyphen] += 1
for word in text_words:
for search_char in [',', ';']:
search_char_count = word.count(search_char)
punctuation_count[search_char] += search_char_count
search_for_single_quotes(word)
search_for_hyphens(word)
print(punctuation_count)
Upvotes: 1
Reputation: 106
Because you don't want to import anything this will be slow and will take some time, but it should work:
file = open() # enter your file path as parameter
lines = file.readline() # enter the number of lines in your document as parameter
search_chars = [',', ';', "'", '-'] # store the values to be searched
search_values = {',':0, ';':0, "'":0, '-':0} # a dictionary saves the number of occurences
whitespaces = [' ', '--', '1', '2', ...] # you can add to this list whatever you need
for line in lines:
for search in search_chars:
if search in line and (search in search_chars):
chars = line.split()
for ch_index in chars:
if chars [ch_index] == ',':
search_values [','] += 1
elif chars [ch_index] == ';':
search_values [';'] += 1
elif chars[ch_index] == "'" and not(chars[ch_index-1] in whitespaces) and not(chars[ch_index+1] in whitespaces):
search_values ["'"] += 1
elif chars[ch_index] == "-" and not(chars[ch_index-1] in whitespaces) and not(chars[ch_index+1] in whitespaces):
search_values ["-"] += 1
for key in range(search_values.keys()):
print(str(key) + ': ' + search_values[key])
This is obviously not optimal and it is better to use regex here, but it should work.
Feel free to ask if any questions should arise.
Upvotes: 0
Reputation: 2636
following should work:
text = open("/Users/abhishekabhishek/downloads/l.txt").read()
text = text.replace("--", " ")
for symbol in "-'":
text = text.replace(symbol + " ", "")
text = text.replace(" " + symbol, "")
for symbol in ".,/'-":
print (symbol, text.count(symbol))
Upvotes: 0