JAG2024
JAG2024

Reputation: 4317

Separate string from numeric in single Pandas Dataframe column and create two new columns

I'm shocked that no one has asked this on SO before.. since it seems like a simple enough of a problem.

I have a single column in a pandas Dataframe that looks like this:

df = pd.DataFrame(data=[['APPLEGATE WINERY    455.292049'],['AMAND FARM  849.827192'],['COBB FARM ST    1039.49357'],['DIRIGIA 2048.947284']], columns = ['Col1'])

    Col1
0   APPLEGATE WINERY 455.292049
1   AMAND FARM 849.827192
2   COBB FARM ST 1039.49357
3   DIRIGIA 2048.947284

And I just want to separate the string characters from the numeric, so the result should look like this

Name                Area
APPLEGATE WINERY    455.292049
AMAND FARM          849.827192
COBB FARM ST        1039.49357
DIRIGIA             2048.947284

I know I can use Regular Expressions in python, but this seems like overkill since a) it's just a separation of data types and b) the strings have different lengths and the numerics have different numbers of digits.

So one result would start to look like this:

df['Name'] = df.Col1.str.extract('([A-Z]\w{0,})', expand=True)
df['Area'] = df.Col1.str.extract('(\d)', expand=True)

But is there a nice, clean solution out there to solve this problem without going through the hassle of using RegEx and instead separating strings from numerics into two columns?

Upvotes: 5

Views: 2857

Answers (4)

Quang Hoang
Quang Hoang

Reputation: 150735

Try this regex:

df.Col1.str.extract('(.*\S)\s+([\d\.]+)')

Output:

                  0            1
0  APPLEGATE WINERY   455.292049
1        AMAND FARM   849.827192
2      COBB FARM ST   1039.49357
3           DIRIGIA  2048.947284

Upvotes: 0

jose_bacoy
jose_bacoy

Reputation: 12684

You can use rsplit. It will split the string starting from the right.

pd.DataFrame(df.Col1.str.rsplit(' ',1).tolist(), columns = ['Name','Area'])

Result:
    Name                Area
0   APPLEGATE WINERY    455.292049
1   AMAND FARM          849.827192
2   COBB FARM ST       1039.49357
3   DIRIGIA            2048.947284

Upvotes: 1

BENY
BENY

Reputation: 323226

Feel like you can just do str.rsplit

df.Col1.str.rsplit(' ',1,expand=True).apply(lambda x : x.str.strip(),1)
Out[314]: 
                  0            1
0  APPLEGATE WINERY   455.292049
1        AMAND FARM   849.827192
2      COBB FARM ST   1039.49357
3           DIRIGIA  2048.947284

Upvotes: 2

cs95
cs95

Reputation: 402293

Use a single extract call. You'll also want to strip trailing whitespaces from the result if you use this regex.

df2 = (df['Col1'].str.extract(r'(?P<Name>.*?)(?P<Area>\d+(?:\.\d+)?)')
                 .applymap(str.strip))
df2
               Name         Area
0  APPLEGATE WINERY   455.292049
1        AMAND FARM   849.827192
2      COBB FARM ST   1039.49357
3           DIRIGIA  2048.947284

Regex Breakdown

(?P<Name>   # first named capture group - "Name"
    .*?     # match anything (non-greedy)
)
(?P<Area>   # second named group - "Area"
    \d+     # match one or more digits,
    (?:     
       \.   # decimal
       \d+  # trailing digits
    )?      # the `?` indicates floating point is optional
)

PS, to convert the "Area" column to numeric, use pd.to_numeric.

Upvotes: 7

Related Questions