Reputation: 11
I need to count words in a huge text file but before that, I have to clean the file of special characters in a specific way.
For example -
;xyz --> xyz
xyz: --> xyz
xyz!) --> xyz!
I am using flatMap() to split all the words on space. And then I am trying to remove the special characters which is not working. Please help!
Here is the code I am using ---
The characters to remove are - : ; ! ? ( ) .
>>> input = sc.textFile("file:///home/<...>/Downloads/file.txt")
>>> input2 = input.flatMap(lambda x: x.split())
>>> def remove(x):
if x.endsWith(':'):
x.replace(':','')
return x
elif x.endsWith('.'):
x.replace('.','')
return x
. .
>>> input3 = input2.map(lambda x: remove(x))
Upvotes: 0
Views: 1908
Reputation: 11
This is the code that worked for me-
def removefromstart(x):
... for i in [':','!','?','.',')','(',';',',']:
... if x.startswith(i):
... token = x.replace(i,'')
... return token
... return x
...
def removefromend(x): ... for i in [':','!','?','.',')','(',';',',']: ... if x.endswith(i): ... token = x.replace(i,'') ... return token ... return x
Upvotes: 0
Reputation: 8978
Try getting help of regex:
import re
with open('input.txt','r') as fp:
rx = "[;:\)]+"
for line in fp:
data = re.sub(rx, "", line.strip())
print(data)
Code above will read file line by line and emit sanitized content. Depending on content of file it will print:
xyz
xyz
xyz!
Upvotes: 0
Reputation: 20336
You can write a function that sees if a character is valid, then use filter()
:
def is_valid(char):
return char.isalpha() or char in "!,." # Whatever extras you want to include
new_string = ''.join(filter(is_valid, old_string)) # No need to ''.join() in Python 2
Upvotes: 0
Reputation: 174706
Use re.sub
re.sub(r'(?<!\S)[^\s\w]+|[^\s\w]+(?!\S)', '', f.read())
Upvotes: 1