Reputation: 8455

Shorter notation for columns in pandas DataFrame

Take a random DataFrame:

df = pd.DataFrame(np.random.rand(3, 2), columns=['a', 'b'])

Pandas allows defining new columns in two ways:

df['c'] = df.a + df.b
df['c'] = df['a'] + df['b']

As the DataFrame name gets longer, this notation becomes less readable.

And then there's the query function:

df.query('a > b')

It returns the slices of the df that match the condition.

Is there a way to run something like DataFrame.query() but for operations on the frame?

Upvotes: 1

Answers (3)

Anton Tarasenko

Reputation: 8455

Function DataFrame.eval() does exactly this:

df.eval('c = a + b')

And warning-free assignment:

df.eval('c = a + b', inplace=True)

More generally, pandas.eval():

The following arithmetic operations are supported: +, -, *, /, **, %, // (python engine only) along with the following boolean operations: | (or), & (and), and ~ (not). Additionally, the 'pandas' parser allows the use of and, or, and not with the same semantics as the corresponding bitwise operators.

Pandas docs say that eval supports only Python expression statements (e.g., a == b), but pandas silently supports abs(a - b) and maybe other statements. The rest throw an error. For example:

df.eval('del(a)')

returns NotImplementedError: 'Delete' nodes are not implemented.

Upvotes: 2

piRSquared

Reputation: 294478

Consider the dataframe named my_obnoxiously_long_dataframe_name

np.random.seed([3,1415])
my_obnoxiously_long_dataframe_name = pd.DataFrame(
    np.random.randint(10, size=(10, 10)),
    columns=list('ABCDEFGHIJ')
)

my_obnoxiously_long_dataframe_name

   A  B  C  D  E  F  G  H  I  J
0  0  2  7  3  8  7  0  6  8  6
1  0  2  0  4  9  7  3  2  4  3
2  3  6  7  7  4  5  3  7  5  9
3  8  7  6  4  7  6  2  6  6  5
4  2  8  7  5  8  4  7  6  1  5
5  2  8  2  4  7  6  9  4  2  4
6  6  3  8  3  9  8  0  4  3  0
7  4  1  5  8  6  0  8  7  4  6
8  3  5  8  5  1  5  1  4  3  9
9  5  5  7  0  3  2  5  8  8  9

If you want cleaner code, create a temp variable name that's smaller

d_ = my_obnoxiously_long_dataframe_name

d_['K'] = abs(d_.J - d_.D)
d_['L'] = d_.A + d_.B

del d_

my_obnoxiously_long_dataframe_name

   A  B  C  D  E  F  G  H  I  J  K   L
0  0  2  7  3  8  7  0  6  8  6  3   2
1  0  2  0  4  9  7  3  2  4  3  1   2
2  3  6  7  7  4  5  3  7  5  9  2   9
3  8  7  6  4  7  6  2  6  6  5  1  15
4  2  8  7  5  8  4  7  6  1  5  0  10
5  2  8  2  4  7  6  9  4  2  4  0  10
6  6  3  8  3  9  8  0  4  3  0  3   9
7  4  1  5  8  6  0  8  7  4  6  2   5
8  3  5  8  5  1  5  1  4  3  9  4   8
9  5  5  7  0  3  2  5  8  8  9  9  10

Upvotes: 1

Scott Boston

Reputation: 153500

Here's a way using assign and add:

df.assign(c=df.a.add(df.b))

          a         b         c
0  0.086468  0.978044  1.064512
1  0.270727  0.789762  1.060489
2  0.150097  0.662430  0.812527

Note: The assign creates a copy of your dataframe, therefore you aren't distorting the original data. You'll need to reassign to a different variable or back to df.

Upvotes: 1

Shorter notation for columns in pandas DataFrame

Answers (3)

Related Questions