Pandas: how to split or extract multiple values from column

Question

This is my first Stack post and I'm new in the worlds of Python and Pandas. I'm trying to get 3 separate values from a column. I tried both .str.split and .str.extract but can't get the result. The df I have:

     wow
0    1000100011-DT000111-1111 Hellostreet 45  Town
1    1000100012  DT000122-1222-Hellostrasse 56 Place
2    1000100013-DT000111-1133
3    1000106789 DT000111-1144 Street 45
4    DT000111-1441 Hellostreet 100

The result I want:

     number1      number2          street3
0    1000100011   DT000111-1111    Hellostreet 45  Town
1    1000100012   DT000122-1222    Hellostrasse 56 Place
2    1000100013   DT000111-1133    None
3    1000106789   DT000111-1144    Street 45
4    None         DT000111-1441    Hellostreet 100

[number1] - numbers are random but always 10 digits [number2] - numbers are random but always 6 first '-' and then 4 digits. Characters are always 'DT'[street3] can consist of a street name, number and/or town. Those do not have to be separated.

Thank you in advance!

Сергей Кох · Accepted Answer

To solve it, you can use the .str.extract method, splitting the string into three groups using regular expressions.

import pandas as pd


df = pd.DataFrame({'wow': ['1000100011-DT000111-1111 Hellostreet 45  Town',
                           '1000100012  DT000122-1222-Hellostrasse 56 Place',
                           '1000100013-DT000111-1133',
                           '1000106789 DT000111-1144 Street 45',
                           'DT000111-1441 Hellostreet 100']})

df = df['wow'].str.extract(
    r"(?P\d{0,10}).*?(?PDT\d{6}-\d{4}).*?(?P\w.*|$)")
df.replace('', None, inplace=True)
print(df)

      number1        number2                street3
0  1000100011  DT000111-1111   Hellostreet 45  Town
1  1000100012  DT000122-1222  Hellostrasse 56 Place
2  1000100013  DT000111-1133                   None
3  1000106789  DT000111-1144              Street 45
4        None  DT000111-1441        Hellostreet 100

"
(?P\d{0,10}).*?(?PDT\d{6}-\d{4}).*?(?P\w.*|$)
"
g
Named Capture Group number1 (?P\d{0,10})
\d matches a digit (equivalent to [0-9])
{0,10} matches the previous token between 0 and 10 times, as many times as possible, giving back as needed (greedy)
. matches any character (except for line terminators)
*? matches the previous token between zero and unlimited times, as few times as possible, expanding as needed (lazy)
Named Capture Group number2 (?PDT\d{6}-\d{4})
DT matches the characters DT literally (case sensitive)
\d matches a digit (equivalent to [0-9])
{6} matches the previous token exactly 6 times
- matches the character - with index 4510 (2D16 or 558) literally (case sensitive)
\d matches a digit (equivalent to [0-9])
{4} matches the previous token exactly 4 times
. matches any character (except for line terminators)
*? matches the previous token between zero and unlimited times, as few times as possible, expanding as needed (lazy)
Named Capture Group street3 (?P\w.*|$)
1st Alternative \w.*
\w matches any word character (equivalent to [a-zA-Z0-9_])
. matches any character (except for line terminators)
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
2nd Alternative $
$ asserts position at the end of the string, or before the line terminator right at the end of the string (if any)
Global pattern flags 
g modifier: global. All matches (don't return after first match)

Pandas: how to split or extract multiple values from column

Answers (1)

Related Questions