Riya
Riya

Reputation: 11

Remove special characters from the start and end of a word while counting the words in a file

I need to count words in a huge text file but before that, I have to clean the file of special characters in a specific way.

For example -

;xyz        -->      xyz      
xyz:        -->     xyz          
xyz!)       -->     xyz!

I am using flatMap() to split all the words on space. And then I am trying to remove the special characters which is not working. Please help!

Here is the code I am using ---

The characters to remove are - : ; ! ? ( ) .

   >>> input = sc.textFile("file:///home/<...>/Downloads/file.txt")
   >>> input2 = input.flatMap(lambda x: x.split())
   >>> def remove(x):
           if x.endsWith(':'):
                x.replace(':','')
                return x
           elif x.endsWith('.'):
               x.replace('.','')
               return x

. .

      >>> input3 = input2.map(lambda x: remove(x))

Upvotes: 0

Views: 1908

Answers (4)

Riya
Riya

Reputation: 11

This is the code that worked for me-
def removefromstart(x):
... for i in [':','!','?','.',')','(',';',',']:
... if x.startswith(i):
... token = x.replace(i,'')
... return token
... return x
...

def removefromend(x):  
...          for i in [':','!','?','.',')','(',';',',']:  
...                  if x.endswith(i):  
...                          token = x.replace(i,'')  
...                          return token  
...         return x

Upvotes: 0

Saleem
Saleem

Reputation: 8978

Try getting help of regex:

import re

with open('input.txt','r') as fp:
    rx = "[;:\)]+"
    for line in fp:
        data = re.sub(rx, "", line.strip())
        print(data)

Code above will read file line by line and emit sanitized content. Depending on content of file it will print:

xyz
xyz
xyz!

Upvotes: 0

zondo
zondo

Reputation: 20336

You can write a function that sees if a character is valid, then use filter():

def is_valid(char):
    return char.isalpha() or char in "!,." # Whatever extras you want to include

new_string = ''.join(filter(is_valid, old_string)) # No need to ''.join() in Python 2

Upvotes: 0

Avinash Raj
Avinash Raj

Reputation: 174706

Use re.sub

re.sub(r'(?<!\S)[^\s\w]+|[^\s\w]+(?!\S)', '', f.read())

DEMO

Upvotes: 1

Related Questions