Reputation: 1342
Very new to pandas so any explanation with a solution is appreciated.
I have a dataframe such as
Company Zip State City
1 *CBRE San Diego, CA 92101
4 1908 Brands Boulder, CO 80301
7 1st Infantry Division Headquarters Fort Riley, KS
10 21st Century Healthcare, Inc. Tempe 85282
15 AAA Jefferson City, MO 65101-9564
I want to split the Zip State city column in my data into 3 different columns. Using the answer from this post Pandas DataFrame, how do i split a column into two I could accomplish this task if I didn't have my first column. Writing a regex to captures all companies just leads to me capturing everything in my data.
I also tried
foo = lambda x: pandas.Series([i for i in reversed(x.split())])
data_pretty = data['Zip State City'].apply(foo)
but this causes me to loose the company column and splits the names of the cities that are more than one word into separate columns.
How can I split my last column while keeping the company column data?
Upvotes: 3
Views: 882
Reputation: 210842
you can use extract() method:
In [110]: df
Out[110]:
Company Zip State City
1 *CBRE San Diego, CA 92101
4 1908 Brands Boulder, CO 80301
7 1st Infantry Division Headquarters Fort Riley, KS
10 21st Century Healthcare, Inc. Tempe 85282
15 AAA Jefferson City, MO 65101-9564
In [112]: df[['City','State','ZIP']] = df['Zip State City'].str.extract(r'([^,\d]+)?[,]*\s*([A-Z]{2})?\s*([\d\-]{4,11})?', expand=True)
In [113]: df
Out[113]:
Company Zip State City City State ZIP
1 *CBRE San Diego, CA 92101 San Diego CA 92101
4 1908 Brands Boulder, CO 80301 Boulder CO 80301
7 1st Infantry Division Headquarters Fort Riley, KS Fort Riley KS NaN
10 21st Century Healthcare, Inc. Tempe 85282 Tempe NaN 85282
15 AAA Jefferson City, MO 65101-9564 Jefferson City MO 65101-9564
From docs:
Series.str.extract(pat, flags=0, expand=None)
For each subject string in the Series, extract groups from the first match of regular expression pat.
New in version 0.13.0.
Parameters:
pat : string
Regular expression pattern with capturing groups
flags : int, default 0 (no flags)
re module flags, e.g. re.IGNORECASE .. versionadded:: 0.18.0
expand : bool, default False
If True, return DataFrame.
If False, return Series/Index/DataFrame.
Returns: DataFrame with one row for each subject string, and one column for each group. Any capture group names in regular expression pat will be used for column names; otherwise capture group numbers will be used. The dtype of each result column is always object, even when no match is found. If expand=True and pat has only one capture group, then return a Series (if subject is a Series) or Index (if subject is an Index).
Upvotes: 8