Reputation: 3963
I have a data frame with one (string) column and I'd like to split it into two (string) columns, with one column header as 'fips'
and the other 'row'
My dataframe df
looks like this:
row
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
I do not know how to use df.row.str[:]
to achieve my goal of splitting the row cell. I can use df['fips'] = hello
to add a new column and populate it with hello
. Any ideas?
fips row
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
Upvotes: 396
Views: 933217
Reputation: 15177
For the simple case of:
The simplest solution is:
df[['A', 'B']] = df['AB'].str.split(' ', n=1, expand=True)
You must use expand=True
if your strings have a non-uniform number of splits and you want None
to replace the missing values.
Notice how, in either case, the .tolist()
method is not necessary. Neither is zip()
.
Andy Hayden's solution is most excellent in demonstrating the power of the str.extract()
method.
But for a simple split over a known separator (like, splitting by dashes, or splitting by whitespace), the .str.split()
method is enough1. It operates on a column (Series) of strings, and returns a column (Series) of lists:
>>> import pandas as pd
>>> df = pd.DataFrame({'AB': ['A1-B1', 'A2-B2']})
>>> df
AB
0 A1-B1
1 A2-B2
>>> df['AB_split'] = df['AB'].str.split('-')
>>> df
AB AB_split
0 A1-B1 [A1, B1]
1 A2-B2 [A2, B2]
1: If you're unsure what the first two parameters of .str.split()
do,
I recommend the docs for the plain Python version of the method.
But how do you go from:
to:
Well, we need to take a closer look at the .str
attribute of a column.
It's a magical object that is used to collect methods that treat each element in a column as a string, and then apply the respective method in each element as efficient as possible:
>>> upper_lower_df = pd.DataFrame({"U": ["A", "B", "C"]})
>>> upper_lower_df
U
0 A
1 B
2 C
>>> upper_lower_df["L"] = upper_lower_df["U"].str.lower()
>>> upper_lower_df
U L
0 A a
1 B b
2 C c
But it also has an "indexing" interface for getting each element of a string by its index:
>>> df['AB'].str[0]
0 A
1 A
Name: AB, dtype: object
>>> df['AB'].str[1]
0 1
1 2
Name: AB, dtype: object
Of course, this indexing interface of .str
doesn't really care if each element it's indexing is actually a string, as long as it can be indexed, so:
>>> df['AB'].str.split('-', 1).str[0]
0 A1
1 A2
Name: AB, dtype: object
>>> df['AB'].str.split('-', 1).str[1]
0 B1
1 B2
Name: AB, dtype: object
Then, it's a simple matter of taking advantage of the Python tuple unpacking of iterables to do
>>> df['A'], df['B'] = df['AB'].str.split('-', n=1).str
>>> df
AB AB_split A B
0 A1-B1 [A1, B1] A1 B1
1 A2-B2 [A2, B2] A2 B2
Of course, getting a DataFrame out of splitting a column of strings is so useful that the .str.split()
method can do it for you with the expand=True
parameter:
>>> df['AB'].str.split('-', n=1, expand=True)
0 1
0 A1 B1
1 A2 B2
So, another way of accomplishing what we wanted is to do:
>>> df = df[['AB']]
>>> df
AB
0 A1-B1
1 A2-B2
>>> df.join(df['AB'].str.split('-', n=1, expand=True).rename(columns={0:'A', 1:'B'}))
AB A B
0 A1-B1 A1 B1
1 A2-B2 A2 B2
The expand=True
version, although longer, has a distinct advantage over the tuple unpacking method. Tuple unpacking doesn't deal well with splits of different lengths:
>>> df = pd.DataFrame({'AB': ['A1-B1', 'A2-B2', 'A3-B3-C3']})
>>> df
AB
0 A1-B1
1 A2-B2
2 A3-B3-C3
>>> df['A'], df['B'], df['C'] = df['AB'].str.split('-')
Traceback (most recent call last):
[...]
ValueError: Length of values does not match length of index
>>>
But expand=True
handles it nicely by placing None
in the columns for which there aren't enough "splits":
>>> df.join(
... df['AB'].str.split('-', expand=True).rename(
... columns={0:'A', 1:'B', 2:'C'}
... )
... )
AB A B C
0 A1-B1 A1 B1 None
1 A2-B2 A2 B2 None
2 A3-B3-C3 A3 B3 C3
Upvotes: 821
Reputation: 558
FutureWarning: In a future version of pandas all arguments of StringMethods.split except for the argument 'pat' will be keyword-only.
Just a small update to the first answer.
If you encounter the warning message above, (I am currently using the pd.__version__
= 1.5.3) that in future versions will become an error, you can avoid it by adding the name of the arguments to your arguments. So, the "simplest solution" that LeoRochael posted:
# Will raise a warning or error
df[['A', 'B']] = df['AB'].str.split(' ', 1, expand=True)
should be:
# For newer pandas versions
df[['A', 'B']] = df['AB'].str.split(' ', n=1, expand=True)
Other examples also posted here will need the name of the argument as well.
# For instance, if you want to split and remove the splitted columns
df[['A','B']] = df.pop('AB').str.split(n=1, expand=True)
I hope this can help.
Upvotes: 0
Reputation: 6455
Use df.assign
to create a new df. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html
split = df_selected['name'].str.split(',', 1, expand=True)
df_split = df_selected.assign(first_name=split[0], last_name=split[1])
df_split.drop('name', 1, inplace=True)
Or in method chain form:
df_split = (df_selected
.assign(list_col=lambda df: df['name'].str.split(',', 1, expand=False),
first_name=lambda df: df.list_col.str[0],
last_name=lambda df: df.list_col.str[1])
.drop(columns=['list_col']))
Upvotes: 14
Reputation: 80346
There might be a better way, but this here's one approach:
row
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
df = pd.DataFrame(df.row.str.split(' ',1).tolist(),
columns = ['fips','row'])
fips row
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
Upvotes: 187
Reputation: 1942
I saw that no one had used the slice method, so here I put my 2 cents here.
df["<col_name>"].str.slice(stop=5)
df["<col_name>"].str.slice(start=6)
This method will create two new columns.
Upvotes: 3
Reputation: 402263
Surprised I haven't seen this one yet. If you only need two splits, I highly recommend. . .
Series.str.partition
partition
performs one split on the separator, and is generally quite performant.
df['row'].str.partition(' ')[[0, 2]]
0 2
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
If you need to rename the rows,
df['row'].str.partition(' ')[[0, 2]].rename({0: 'fips', 2: 'row'}, axis=1)
fips row
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
If you need to join this back to the original, use join
or concat
:
df.join(df['row'].str.partition(' ')[[0, 2]])
pd.concat([df, df['row'].str.partition(' ')[[0, 2]]], axis=1)
row 0 2
0 00000 UNITED STATES 00000 UNITED STATES
1 01000 ALABAMA 01000 ALABAMA
2 01001 Autauga County, AL 01001 Autauga County, AL
3 01003 Baldwin County, AL 01003 Baldwin County, AL
4 01005 Barbour County, AL 01005 Barbour County, AL
Upvotes: 12
Reputation: 863
df[['fips', 'row']] = df['row'].str.split(' ', n=1, expand=True)
Upvotes: 56
Reputation: 75
I prefer exporting the corresponding pandas series (i.e. the columns I need), using the apply function to split the column content into multiple series and then join the generated columns to the existing DataFrame. Of course, the source column should be removed.
e.g.
col1 = df["<col_name>"].apply(<function>)
col2 = ...
df = df.join(col1.to_frame(name="<name1>"))
df = df.join(col2.toframe(name="<name2>"))
df = df.drop(["<col_name>"], axis=1)
To split two words strings function should be something like that:
lambda x: x.split(" ")[0] # for the first element
lambda x: x.split(" ")[-1] # for the last element
Upvotes: 2
Reputation: 862471
You can use str.split
by whitespace (default separator) and parameter expand=True
for DataFrame
with assign to new columns:
df = pd.DataFrame({'row': ['00000 UNITED STATES', '01000 ALABAMA',
'01001 Autauga County, AL', '01003 Baldwin County, AL',
'01005 Barbour County, AL']})
print (df)
row
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
df[['a','b']] = df['row'].str.split(n=1, expand=True)
print (df)
row a b
0 00000 UNITED STATES 00000 UNITED STATES
1 01000 ALABAMA 01000 ALABAMA
2 01001 Autauga County, AL 01001 Autauga County, AL
3 01003 Baldwin County, AL 01003 Baldwin County, AL
4 01005 Barbour County, AL 01005 Barbour County, AL
Modification if need remove original column with DataFrame.pop
df[['a','b']] = df.pop('row').str.split(n=1, expand=True)
print (df)
a b
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
What is same like:
df[['a','b']] = df['row'].str.split(n=1, expand=True)
df = df.drop('row', axis=1)
print (df)
a b
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
If get error:
#remove n=1 for split by all whitespaces
df[['a','b']] = df['row'].str.split(expand=True)
ValueError: Columns must be same length as key
You can check and it return 4 column DataFrame
, not only 2:
print (df['row'].str.split(expand=True))
0 1 2 3
0 00000 UNITED STATES None
1 01000 ALABAMA None None
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
Then solution is append new DataFrame
by join
:
df = pd.DataFrame({'row': ['00000 UNITED STATES', '01000 ALABAMA',
'01001 Autauga County, AL', '01003 Baldwin County, AL',
'01005 Barbour County, AL'],
'a':range(5)})
print (df)
a row
0 0 00000 UNITED STATES
1 1 01000 ALABAMA
2 2 01001 Autauga County, AL
3 3 01003 Baldwin County, AL
4 4 01005 Barbour County, AL
df = df.join(df['row'].str.split(expand=True))
print (df)
a row 0 1 2 3
0 0 00000 UNITED STATES 00000 UNITED STATES None
1 1 01000 ALABAMA 01000 ALABAMA None None
2 2 01001 Autauga County, AL 01001 Autauga County, AL
3 3 01003 Baldwin County, AL 01003 Baldwin County, AL
4 4 01005 Barbour County, AL 01005 Barbour County, AL
With remove original column (if there are also another columns):
df = df.join(df.pop('row').str.split(expand=True))
print (df)
a 0 1 2 3
0 0 00000 UNITED STATES None
1 1 01000 ALABAMA None None
2 2 01001 Autauga County, AL
3 3 01003 Baldwin County, AL
4 4 01005 Barbour County, AL
Upvotes: 31
Reputation: 1311
If you want to split a string into more than two columns based on a delimiter you can omit the 'maximum splits' parameter.
You can use:
df['column_name'].str.split('/', expand=True)
This will automatically create as many columns as the maximum number of fields included in any of your initial strings.
Upvotes: 12
Reputation: 375395
You can extract the different parts out quite neatly using a regex pattern:
In [11]: df.row.str.extract('(?P<fips>\d{5})((?P<state>[A-Z ]*$)|(?P<county>.*?), (?P<state_code>[A-Z]{2}$))')
Out[11]:
fips 1 state county state_code
0 00000 UNITED STATES UNITED STATES NaN NaN
1 01000 ALABAMA ALABAMA NaN NaN
2 01001 Autauga County, AL NaN Autauga County AL
3 01003 Baldwin County, AL NaN Baldwin County AL
4 01005 Barbour County, AL NaN Barbour County AL
[5 rows x 5 columns]
To explain the somewhat long regex:
(?P<fips>\d{5})
\d
) and names them "fips"
.The next part:
((?P<state>[A-Z ]*$)|(?P<county>.*?), (?P<state_code>[A-Z]{2}$))
Does either (|
) one of two things:
(?P<state>[A-Z ]*$)
*
) of capital letters or spaces ([A-Z ]
) and names this "state"
before the end of the string ($
),or
(?P<county>.*?), (?P<state_code>[A-Z]{2}$))
.*
) thenstate_code
before the end of the string ($
).In the example:
Note that the first two rows hit the "state" (leaving NaN in the county and state_code columns), whilst the last three hit the county, state_code (leaving NaN in the state column).
Upvotes: 80
Reputation: 536
If you don't want to create a new dataframe, or if your dataframe has more columns than just the ones you want to split, you could:
df["flips"], df["row_name"] = zip(*df["row"].str.split().tolist())
del df["row"]
Upvotes: 26