Andrew Tulip
Andrew Tulip

Reputation: 161

How to delimit letters and numbers from a string in an array in Python?

I need to delimit the letters and the numbers from a string in an array in Python.

My array is this one:

kid_count_list = ['Sofia 1', 'Claire Ann 1', 'Joe 3', 'Betty 2', 'Archie 1', 'Phil 1', 'Luke 1']

And I want to make two arrays like these:

names = ['Sofia', 'Claire Ann', 'Joe', 'Betty', 'Archie', 'Phil', 'Luke']
counts = [1, 1, 3, 2, 1, 1, 1]

My approach is this one (being inspired from other questions):

import re

kid_count_list = ['Sofia 1', 'Claire Ann 1', 'Joe 3', 'Betty 2', 'Archie 1', 'Phil 1', 'Luke 1']
names = []
count = []

for element in kid_count_list:
    name = " ".join(re.split("[^a-zA-Z]*", element))
    occurence = int(element.match('/\d+/g').join(""))
    names.append(name)
    counts.append(occurence)
    

How to make this work? Thanks a lot!!!

Upvotes: 1

Views: 89

Answers (3)

Mustafa Aydın
Mustafa Aydın

Reputation: 18306

What about:

names, counts = zip(*[re.fullmatch(r"(\D+)\s(\d+)", s).groups() for s in kid_count_list])

to get the names as

('Sofia', 'Claire Ann', 'Joe', 'Betty', 'Archie', 'Phil', 'Luke')

and counts as

('1', '1', '3', '2', '1', '1', '1')

They are not lists, but can be easily cast so:

names = list(names)
counts = list(map(int, counts))  # convert the counts to int, too

to get

>>> names
['Sofia', 'Claire Ann', 'Joe', 'Betty', 'Archie', 'Phil', 'Luke']

>>> counts
[1, 1, 3, 2, 1, 1, 1]

We form a regex to match some non-digits (\D+) and a space after it \s and some digits at the end (\d+); and we require this to be a full match i.e. from beginning to end (same as if there were ^ and $ anchors). Then take out the matched groups for each string. At this point we have:

[('Sofia', '1'), ('Claire Ann', '1'), ('Joe', '3'), ('Betty', '2'), ('Archie', '1'), ('Phil', '1'), ('Luke', '1')]

To take out two lists from this, we use zip(*...) construct.

Upvotes: 3

Aswin A
Aswin A

Reputation: 99

You can use below code

import re

kid_count_list = ['Sofia 1', 'Claire Ann 1', 'Joe 3', 'Betty 2', 'Archie 1', 'Phil 1', 'Luke 1']
names, counts = [], []

split_list = [re.split(r'\s+(?=\d+$)', item) for item in kid_count_list]

for item in split_list:
   names.append(item[0])
   counts.append(int(item[1]))
print(names)
print(counts)

The output will be

['Sofia', 'Claire Ann', 'Joe', 'Betty', 'Archie', 'Phil', 'Luke']
[1, 1, 3, 2, 1, 1, 1]

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626893

If no Pandas is in use, you can use

import re
kid_count_list = ['Sofia 1', 'Claire Ann 1', 'Joe 3', 'Betty 2', 'Archie 1', 'Phil 1', 'Luke 1']
rx = re.compile(r'\s+(?=\d+$)')
l = [rx.split(x) for x in kid_count_list]
names, counts = zip(*l)
print(list(names))  # => ['Sofia', 'Claire Ann', 'Joe', 'Betty', 'Archie', 'Phil', 'Luke']
print(list(counts)) # => ['1', '1', '3', '2', '1', '1', '1']

See the Python demo and the regex demo.

Here, re.split(r'\s+(?=\d+$)', x) will split each string with any one or more whitespace chars that are followed with one or more digits at the end of string.

Details:

  • \s+ - one or more whitespaces
  • (?=\d+$) - a positive lookahead that assures there are one or more digits at the end of string immediately to the right of the current location.

Since originally your question contained references to Pandas, here is a Pandas version

You can use

import re
import pandas as pd
import numpy as np

kid_count_list = ['Sofia 1', 'Claire Ann 1', 'Joe 3', 'Betty 2', 'Archie 1', 'Phil 1', 'Luke 1']
cols = [re.split(r'\s+(?=\d+$)', x) for x in kid_count_list]
df = pd.DataFrame(cols, columns=['names', 'counts'])
## >>> df
##         names counts
## 0       Sofia      1
## 1  Claire Ann      1
## 2         Joe      3
## 3       Betty      2
## 4      Archie      1
## 5        Phil      1
## 6        Luke      1

Alternatively, you can use a no-regex solution:

kid_count_list = ['Sofia 1', 'Claire Ann 1', 'Joe 3', 'Betty 2', 'Archie 1', 'Phil 1', 'Luke 1']
df = pd.DataFrame({'data':kid_count_list})
df[['names', 'counts']] = df.pop('data').str.rsplit(r' ', n=1, expand=True)

Here, you just initialize the dataframe with kid_count_list values, then the .pop('data') part will remove the initial column from the dataframe and return it for processing and then it will right-split each value with a space only once.

Upvotes: 1

Related Questions