Reputation: 21
My question is more about the methodology/syntax described into a previous post which addresses different approaches to meet the same objective of splitting string values into lists and assigning each list item to a new column. Here's the post: Pandas DataFrame, how do i split a column into two
df:
GDP
Date
Mar 31, 2017 19.03 trillion
Dec 31, 2016 18.87 trillion
script 1 + ouput:
>>> df['GDP'], df['Units'] = df['GDP'].str.split(' ', 1).str
>>> print(df)
GDP Units
Date
Mar 31, 2017 19.03 trillion
Dec 31, 2016 18.87 trillion
script 2 + output:
>>> df[['GDP', 'Units']] = df['GDP'].str.split(' ', 1, expand=True)
>>> print(df)
GDP Units
Date
Mar 31, 2017 19.03 trillion
Dec 31, 2016 18.87 trillion
script 3 + output:
>>> df['GDP'], df['Units'] = df['GDP'].str.split(' ', 1, expand=True)
>>> print(df)
GDP Units
Date
Mar 31, 2017 0 1
Dec 31, 2016 0 1
Can anyone explain what is going on? Why does script 3 produce these values in the output?
Upvotes: 2
Views: 3305
Reputation: 294586
Let's start by looking at this
df['GDP'].str.split(' ', 1)
0 [19.03, trillion]
1 [18.87, trillion]
Name: GDP, dtype: object
It produces a series of lists. However, the pd.Series.str
, aka string accessor allows us to access the first, second, ... parts of these embedded lists via intuitive python list indexing.
df['GDP'].str.split(' ', 1).str[0]
Date
Mar 31, 2017 19.03
Dec 31, 2016 18.87
Name: GDP, dtype: object
Or
df['GDP'].str.split(' ', 1).str[1]
Date
Mar 31, 2017 trillion
Dec 31, 2016 trillion
Name: GDP, dtype: object
So, if we split into two element lists, split(' ', 1)
we can treat the return object from an additional str
as an iterable
a, b = df['GDP'].str.split(' ', 1).str
a
Date
Mar 31, 2017 19.03
Dec 31, 2016 18.87
Name: GDP, dtype: object
And
b
Date
Mar 31, 2017 trillion
Dec 31, 2016 trillion
Name: GDP, dtype: object
Ok, we can short-cut the creation of two new columns by leveraging this iterable unpacking
df['GDP'], df['Units'] = df['GDP'].str.split(' ', 1).str
However, we can pass a parameter to expand
our new lists into new dataframe columns
df['GDP'].str.split(' ', 1, expand=True)
0 1
Date
Mar 31, 2017 19.03 trillion
Dec 31, 2016 18.87 trillion
Now we can assign a dataframe to new columns of another dataframe like so
df[['GDP', 'Units']] = df['GDP'].str.split(' ', 1, expand=True)
However, when we do
df['GDP'], df['Units'] = df['GDP'].str.split(' ', 1, expand=True)
The return value of df['GDP'].str.split(' ', 1, expand=True)
gets unpacked and those results are simply the column values. If you see just above, you notice they are 0
and 1
. So in this case, 0
is assigned to the column df['GDP']
and 1
is assigned to the column df['Units']
Upvotes: 5