Reputation: 89
the text input is something like this
West Team 4, Eastern 3\n
-------Update--------
the input is a txt file containing team name and scores like a football game the whole text file will be something like this, two names and scores:
West Team 4, Eastern 5
Nott Team 2, Eastern 3
West wood 1, Eathan 2
West Team 4, Eas 5
I am using with open
to read file line by line therefore there will be \n
at the end of the line.
I would like to extract this line of text in to something like:
['West Team', 'Eastern']
What I currently have in mind is to use regex
result = re.sub("[\n^\s$\d]", "", text).split(",")
this code results in this:
['WestTeam','Eastern']
I'm sure that my regex is not correct. I want to remove '\n' and any number including the space in front of the number but not the space in the middle of the name.
Open to any suggestion that to achieve this result, doesn't necessarily use regex.
Upvotes: 1
Views: 66
Reputation: 627507
You can use a non-regex approach to keep any letters/spaces after splitting with a comma:
text = "West Team 4, Eastern 3\n"
print( ["".join(c for c in x if c.isalpha() or c.isspace()).strip() for x in text.split(',')] )
# => ['West Team', 'Eastern']
Or a regex approach to remove any chars other than ASCII letters and spaces matched with the [^a-zA-Z\s]+
pattern:
import re
rx = re.compile(r'[^a-zA-Z\s]+')
print( [rx.sub("", x).strip() for x in text.split(',')] )
# => ['West Team', 'Eastern']
Another similar solution can be used to extract one or more non-digit char chunks after an optional comma + whitespaces:
print(re.findall(r',?\s*(\D*[^\d\s])', text))
See the Python demo.
In case there are consecutive non-letter chunks you can use
import re
text = "West Team 4, Eastern 3\n, test 23 99 test"
rx = re.compile(r'[^\W\d_]+')
print( [" ".join(rx.findall(x)) for x in text.split(',')] )
See the Python demo yielding ['West Team', 'Eastern', 'test test']
. The [^\W\d_]+
pattern matches any one or more Unicode letters.
Upvotes: 1
Reputation: 75990
So many ways this can be done, but looking at your data you could use rstrip()
quite nicely:
s = 'West Team 4, Eastern 3\n'
lst = [x.rstrip('\n 0123456789') for x in s.split(', ')]
print(lst)
Or maybe rather use:
from string import digits
s = 'West Team 4, Eastern 3\n'
lst = [x.rstrip(digits+'\n ') for x in s.split(', ')]
print(lst)
Both options print:
['West Team', 'Eastern']
Upvotes: 1
Reputation: 27404
You haven't clearly defined the rules for getting the required output from your sample input. However, this will give what you've asked for but may not cover all eventualities:
in_string = 'West Team 4, Eastern 3\n'
result = [' '.join(t.split()[:-1]) for t in in_string.split(',')]
print(result)
Output:
['West Team', 'Eastern']
Upvotes: 0
Reputation: 163632
You can remove the digits and replace possible double spaced gaps with a single space.
Then split on a comma, do not keep empty values and trim the output:
import re
s = "West Team 4 , Eastern 3, test 23 99 test\n,"
res = [
m.strip() for m in re.sub(r"[^\S\n]{2,}", " ", re.sub(r"\d+", "", s)).split(",") if m
]
print(res)
Output
['West Team', 'Eastern', 'test test']
See a Python demo.
Upvotes: 0
Reputation: 9418
You want to:
Functions to use:
str.replace()
.re.sub()
.str.strip()
to remove leading and trailing whitespaces like \n
.import re
input = "West Team 4, Eastern 3\n"
cleaned = re.sub(r'\s+\d', '', input) # remove numbers with leading spaces
cleaned = cleaned.strip() # remove surrounding whitespace like \n
print(cleaned)
output = cleaned.split(",")
print(output)
Prints:
West Team, Eastern
['West Team', 'Eastern']
Upvotes: 0
Reputation: 49
import re
text = 'West Team 4, Eastern 3\n'
result = re.sub("[\n^$\d]", "", text).split(",")
# REMOVE THE LEADING AND TRAILING SPACES:
result = [x.strip() for x in result]
print(result)
# result: ['West Team', 'Eastern']
Upvotes: 0
Reputation: 522751
Actually re.findall
might work well here:
inp = "West Team 4, Eastern 3\n"
matches = re.findall(r'(\w+(?: \w+)*) \d+', inp)
print(matches) # ['West Team', 'Eastern']
The split version, using re.split
:
inp = "West Team 4, Eastern 3\n"
matches = [x for x in re.split(r'\s+\d+\s*,?\s*', inp) if x != '']
print(matches) # ['West Team', 'Eastern']
Upvotes: 0