Reputation: 326

split string into number and text with pandas

The Setup

I have a pandas dataframe that contains a column 'iso' containing chemical isotope symbols, such as '4He', '16O', '197Au'. I want to label many (but not all) isotopes on a plot using the annotate() function in matplotlib. The label format should have the atomic mass in superscript. I can do this with the LaTeX style formatting:

axis.annotate('$^{4}$He', xy=(x, y), xycoords='data')

I could write dozens of annotate() statements like the one above for each isotope I want to label, but I'd rather automate.

The Question

How can I extract the isotope number and name from my iso column?

With those pieces extracted I can make the labels. Lets say we dump them into the variables Num and Sym. Now I can loop over my isotopes and do something like this:

for i in list_of_isotopes:
  (Num, Sym) = df[df.iso==i].iso.str.MISSING_STRING_METHOD(???)
  axis.annotate('$^{%s}$%s' %(Num, Sym), xy=(x[Num], y[Num]), xycoords='data')

Presumably, there is a pandas string methods that I can drop into the above. But I'm having trouble coming up with a solution. I've been trying split() and extract() with a few different patterns, but can't get the desired effect.

Upvotes: 5

Answers (5)

Simon

Reputation: 552

The accepted answer gave me the right direction, but I think the right pandas function to use is extract. Like this only the matched regular expressions are returned, eliminating the use to slice afterwards.

df = pd.DataFrame({'iso': ['4He', '16O', '197Au']})
df[['num', 'element']] = df['iso'].str.extract('(\d+)([A-Za-z]+)', expand=True)
print(df)

gives

     iso  num element
0    4He    4      He
1    16O   16       O
2  197Au  197      Au

Upvotes: 1

Romain

Reputation: 21958

This is my answer using split. The regexp used can be improved, I'm very bad at that sort of things :-)

(\d+) stands for the integers, and ([A-Za-z]+) stands for the strings.

df = pd.DataFrame({'iso': ['4He', '16O', '197Au']})
result = df['iso'].str.split('(\d+)([A-Za-z]+)', expand=True)
result = result.loc[:,[1,2]]
result.rename(columns={1:'x', 2:'y'}, inplace=True)
print(result)

Produces

Upvotes: 12

Fei Yuan

Reputation: 82

Did you tried strip(), maybe you can consider this:

import string

for i in list_of_isotopes:
  Num = df[df.iso==i].iso.str.strip(string.ascii_letters)
  Sym = df[df.iso==i].iso.str.strip(string.digits)
  axis.annotate('$^%s$%s' %(Num, Sym), xy=(x[Num], y[Num]), xycoords='data')

Upvotes: 0

albert

Reputation: 8623

To extract the number and the element of an isotope symbol you can use a regular expression (short: regex) in combination with Python's re module. The regex looks for number digits and after that it looks for characters which are grouped and accessible using the group's name. If the regex matches you can extract the data and .format() the desired annotation string:

#!/usr/bin/env python3
# coding: utf-8

import re

iso_num = '16O'

preg = re.compile('^(?P<num>[0-9]*)(?P<element>[A-Za-z]*)$')
m = preg.match(iso_num)

if m:
    num = m.group('num')
    element = m.group('element')

    note = '$^{}${}'.format(num, element)

    # axis.annotate(note, xy=(x, y), xycoords='data')

Upvotes: 0

taesu

Reputation: 4580

I'd use simple string manipulation, without the hassle of regex.

isotopes = ['4He', '16O', '197Au']
def get_num(isotope):
    return filter(str.isdigit, isotope)

def get_sym(isotope):
    return isotope.replace(get_num(isotope),'')

def get_num_sym(isotope):
    return (get_num(isotope),get_sym(isotope))


for isotope in isotopes:
    num,sym = get_num_sym(isotope)
    print num,sym

Upvotes: 0

split string into number and text with pandas

The Setup

The Question

Answers (5)

Related Questions