TheAmazingHAzza
TheAmazingHAzza

Reputation: 116

Remove unwanted characters from set of strings in python

I am trying to clean a set of strings to remove unwanted characters.

Input

Lethal Lunch t5+ 0 0 D 10 t5+ Michael Bell . Alex Jary7 .
Muscika 1 v5+ W5+ 0 0 D 5 v5+ W5+ D O'Meara . Cam Hardie . C5
Typhoon Ten 1 0 0 D 13 R Hannon . Luke Catton7 .
Wentworth Falls 1 cp5+ 0 0 C D 45 cp5+ G Harker . Connor Beasley .
One Night Stand 0 0 D 34 W Jarvis . Silvestre De Sousa . 30 C1 C5
Dancinginthewoods 1 0 0 D 24 D Ivory . 14 Jamie Spencer . 30
Case Key 1 v3 0 0 D 13 v3 M Appleby . Andrew Mullen . 14

Wanted Output

Lethal Lunch
Muscika
Typhoon Ten
Wentworth Falls
One Night Stand
Dancinginthewoods 
Case Key

I have tried this

re.findall('([a-zA-Z ]*)\d*.*',final_df.loc[index, 'Horse'])

This removes everything after a number but it leaves the t on the first entry. I was wondering if there is a better way?

Upvotes: 0

Views: 66

Answers (3)

fsimonjetz
fsimonjetz

Reputation: 5802

I'd use re.split instead:

for d in data.splitlines():
    print(re.split(r'\s+t?[0-9]\+?', d)[0])
Result
Lethal Lunch 
Muscika 
Typhoon Ten 
Wentworth Falls 
One Night Stand 
Dancinginthewoods 
Case Key 

Explanation: It splits the string at places where the specified pattern matches, then takes the first part. You probably want to tweak it so that other patterns also match.

In Pandas

I just noticed you seem to be using Pandas – assuming your df looks like this:

                                               Horse
0  Lethal Lunch t5+ 0 0 D 10 t5+ Michael Bell . A...
1  Muscika 1 v5+ W5+ 0 0 D 5 v5+ W5+ D O'Meara . ...
2  Typhoon Ten 1 0 0 D 13 R Hannon . Luke Catton7 .
3  Wentworth Falls 1 cp5+ 0 0 C D 45 cp5+ G Harke...
4  One Night Stand 0 0 D 34 W Jarvis . Silvestre ...
5  Dancinginthewoods 1 0 0 D 24 D Ivory . 14 Jami...
6  Case Key 1 v3 0 0 D 13 v3 M Appleby . Andrew M...

You can do

from operator import itemgetter

df["name"] = df.Horse.str.split('\s+t?[0-9]\+?').map(itemgetter(0))

to get this:

                                               Horse               name
0  Lethal Lunch t5+ 0 0 D 10 t5+ Michael Bell . A...       Lethal Lunch
1  Muscika 1 v5+ W5+ 0 0 D 5 v5+ W5+ D O'Meara . ...            Muscika
2  Typhoon Ten 1 0 0 D 13 R Hannon . Luke Catton7 .        Typhoon Ten
3  Wentworth Falls 1 cp5+ 0 0 C D 45 cp5+ G Harke...    Wentworth Falls
4  One Night Stand 0 0 D 34 W Jarvis . Silvestre ...    One Night Stand
5  Dancinginthewoods 1 0 0 D 24 D Ivory . 14 Jami...  Dancinginthewoods
6  Case Key 1 v3 0 0 D 13 v3 M Appleby . Andrew M...           Case Key

Upvotes: 1

C K
C K

Reputation: 16

Would something like this suffice?

input = [
    "Lethal Lunch t5+ 0 0 D 10 t5+ Michael Bell . Alex Jary7 .",
    "Muscika 1 v5+ W5+ 0 0 D 5 v5+ W5+ D O'Meara . Cam Hardie . C5",
    "Typhoon Ten 1 0 0 D 13 R Hannon . Luke Catton7 .",
    "Wentworth Falls 1 cp5+ 0 0 C D 45 cp5+ G Harker . Connor Beasley .",
    "One Night Stand 0 0 D 34 W Jarvis . Silvestre De Sousa . 30 C1 C5",
    "Dancinginthewoods 1 0 0 D 24 D Ivory . 14 Jamie Spencer . 30",
    "Case Key 1 v3 0 0 D 13 v3 M Appleby . Andrew Mullen . 14",
]

for inp in input:
    print(re.findall(r'\b[a-zA-Z ]+\b', inp)[0])

We basically ignore a word with a number or weird symbol. The output:

Lethal Lunch 
Muscika 
Typhoon Ten 
Wentworth Falls 
One Night Stand 
Dancinginthewoods 
Case Key 

Upvotes: 0

Stefan Schulz
Stefan Schulz

Reputation: 529

something like this should work:

filtered_text = list()

for line in text:
    part = ""
    for word in text.split(" "):
        if len(word) <= 3:
            break
        else:
            part = str(part) + " " + str(word)

    part = part[1:] # skip first space
    filtered_text.append(part)

Upvotes: 0

Related Questions