Reputation: 123
This is my first Stack post and I'm new in the worlds of Python and Pandas. I'm trying to get 3 separate values from a column. I tried both .str.split and .str.extract but can't get the result. The df I have:
wow
0 1000100011-DT000111-1111 Hellostreet 45 Town
1 1000100012 DT000122-1222-Hellostrasse 56 Place
2 1000100013-DT000111-1133
3 1000106789 DT000111-1144 Street 45
4 DT000111-1441 Hellostreet 100
The result I want:
number1 number2 street3
0 1000100011 DT000111-1111 Hellostreet 45 Town
1 1000100012 DT000122-1222 Hellostrasse 56 Place
2 1000100013 DT000111-1133 None
3 1000106789 DT000111-1144 Street 45
4 None DT000111-1441 Hellostreet 100
[number1] - numbers are random but always 10 digits [number2] - numbers are random but always 6 first '-' and then 4 digits. Characters are always 'DT'[street3] can consist of a street name, number and/or town. Those do not have to be separated.
Thank you in advance!
Upvotes: 1
Views: 384
Reputation: 1723
To solve it, you can use the .str.extract method, splitting the string into three groups using regular expressions.
import pandas as pd
df = pd.DataFrame({'wow': ['1000100011-DT000111-1111 Hellostreet 45 Town',
'1000100012 DT000122-1222-Hellostrasse 56 Place',
'1000100013-DT000111-1133',
'1000106789 DT000111-1144 Street 45',
'DT000111-1441 Hellostreet 100']})
df = df['wow'].str.extract(
r"(?P<number1>\d{0,10}).*?(?P<number2>DT\d{6}-\d{4}).*?(?P<street3>\w.*|$)")
df.replace('', None, inplace=True)
print(df)
number1 number2 street3
0 1000100011 DT000111-1111 Hellostreet 45 Town
1 1000100012 DT000122-1222 Hellostrasse 56 Place
2 1000100013 DT000111-1133 None
3 1000106789 DT000111-1144 Street 45
4 None DT000111-1441 Hellostreet 100
"
(?P<number1>\d{0,10}).*?(?P<number2>DT\d{6}-\d{4}).*?(?P<street3>\w.*|$)
"
g
Named Capture Group number1 (?P<number1>\d{0,10})
\d matches a digit (equivalent to [0-9])
{0,10} matches the previous token between 0 and 10 times, as many times as possible, giving back as needed (greedy)
. matches any character (except for line terminators)
*? matches the previous token between zero and unlimited times, as few times as possible, expanding as needed (lazy)
Named Capture Group number2 (?P<number2>DT\d{6}-\d{4})
DT matches the characters DT literally (case sensitive)
\d matches a digit (equivalent to [0-9])
{6} matches the previous token exactly 6 times
- matches the character - with index 4510 (2D16 or 558) literally (case sensitive)
\d matches a digit (equivalent to [0-9])
{4} matches the previous token exactly 4 times
. matches any character (except for line terminators)
*? matches the previous token between zero and unlimited times, as few times as possible, expanding as needed (lazy)
Named Capture Group street3 (?P<street3>\w.*|$)
1st Alternative \w.*
\w matches any word character (equivalent to [a-zA-Z0-9_])
. matches any character (except for line terminators)
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
2nd Alternative $
$ asserts position at the end of the string, or before the line terminator right at the end of the string (if any)
Global pattern flags
g modifier: global. All matches (don't return after first match)
Upvotes: 2