A. Domni
A. Domni

Reputation: 21

How to remove all non-alphabetic characters from a string?

I have been working on a program which will take a hex file, and if the file name starts with "CID", then it should remove the first 104 characters, and after that point there is a few words. I also want to remove everything after the words, but the problem is the part I want to isolate varies in length.

My code is currently like this:

y = 0
import os
files = os.listdir(".")

filenames = []
for names in files:
    if names.endswith(".uexp"):
        filenames.append(names)
        y +=1
        print(y)
print(filenames)

for x in range(1,y):
    filenamestart = (filenames[x][0:3])
    print(filenamestart)
    if filenamestart == "CID":
        openFile = open(filenames[x],'r')
        fileContents = (openFile.read())
        ItemName = (fileContents[104:])
        print(ItemName)

Input Example file (pulled from HxD):

.........................ýÿÿÿ................E.................!...1AC9816A4D34966936605BB7EFBC0841.....Sun Tan Specialist.................9.................!...9658361F4EFF6B98FF153898E58C9D52.....Outfit.................D.................!...F37BE72345271144C16FECAFE6A46F2A.....Don't get burned............................................................................................................................Áƒ*ž

I have got it working to remove the first 104 characters, but I would also like to remove the characters after 'Sun Tan Specialist', which will differ in length, so I am left with only that part.

I appreciate any help that anyone can give me.

Upvotes: 2

Views: 16824

Answers (3)

Anonymous
Anonymous

Reputation: 754

You can use filter:

import string
print(''.join(filter(lambda character: character in string.ascii_letters + string.digits, '(ABC), DEF!'))) # => ABCDEF

Upvotes: 1

Cédric Van Rompay
Cédric Van Rompay

Reputation: 2979

One way to remove non-alphabetic characters in a string is to use regular expressions [1].

>>> import re
>>> re.sub(r'[^a-z]', '', "lol123\t")
'lol'

EDIT

The first argument r'[^a-z]' is the pattern that captures what will removed (here, by replacing it by an empty string ''). The square brackets are used to denote a category (the pattern will match anything in this category), the ^ is a "not" operator and the a-z denotes all the small caps alphabetiv characters. More information here:

https://docs.python.org/3/library/re.html#regular-expression-syntax

So for instance, to keep also capital letters and spaces it would be:

>>> re.sub(r'[^a-zA-Z ]', '', 'Lol !this *is* a3 -test\t12378')
'Lol this is a test'

However from the data you give in your question the exact process you need seems to be a bit more complicated than just "getting rid of non-alphabetical characters".

Upvotes: 5

Bill S.
Bill S.

Reputation: 96

You mentioned in a comment that you got the string down to Sun Tan SpecialistFEFFBFFECDOutfitDFBECFECAFEAFADont get burned

Essentially your goal at this point is to remove any uppercase letter that isn't immediately followed by a lowercase letter because Upper Lower indicates the start of a phrase. You can use a for loop to do this.

import re

h =  "Sun Tan SpecialistFEFFBFFECDOutfitDFBECFECAFEAFADont get burned"

output = ""
for i in range(0, len(h)):
    # Keep spaces
    if h[i] is " ":
        output += h[i]
    # Start of a phrase found, so separate with space and store character
    elif h[i].isupper() and h[i+1].islower():
        output += " " + h[i]
    # We want all lowercase characters
    elif h[i].islower():
        output += h[i]

# [1:] because we appended a space to the start of every word
 print output[1:]
 # If you dont care about Outfit since it is always there, remove it
 print output[1:].replace("Outfit", "")

Output:

Sun Tan Specialist Outfit Dont get burned

Sun Tan Specialist Dont get burned

Upvotes: 0

Related Questions