Reputation: 55
![enter image description here][1]
I have a weird requirement to add a column similar to the below oracle query to apply on python dataframe. A pandas dataframe column to pass as an argument of row level function to apply each value of the column to its respective row.
select employee_id, first_name, last_name, substr(first_name, employee_id/employee_id, 3) test1 from hr.EMPLOYEES
The result has been attached in the below csv file. I want to add a new column. So I wrote my pandas statement as below
EMPLOYEE.assign(test1=EMPLOYEE.FIRST_NAME.str.slice(EMPLOYEE.EMPLOYEE_ID.apply(lambda x: x/x),4))
In this code I am using EMPLOYEE.EMPLOYEE_ID inside the slice function of EMPLOYEE.FIRST_NAME column.
my intention is if we think this as a variable
FIRST_NAME="Steven"
EMPLOYEE_ID=100
FIRST_NAME[int(EMPLOYEE_ID/EMPLOYEE_ID):4]
the output of this is
'tev'
I am not getting the substring, instead I am getting NaN. My Python output is as below
https://i.sstatic.net/uZN8x.png
The below is the output from SQL query.
EMPLOYEE_ID FIRST_NAME LAST_NAME TEST1
0 100 Steven King Ste
1 101 Neena Kochhar Nee
2 102 Lex De Haan Lex
3 103 Alexander Hunold Ale
4 104 Bruce Ernst Bru
5 105 David Austin Dav
6 106 Valli Pataballa Val
7 107 Diana Lorentz Dia
8 108 Nancy Greenberg Nan
9 109 Daniel Faviet Dan
10 110 John Chen Joh
11 111 Ismael Sciarra Ism
12 112 Jose Manuel Urman Jos
13 113 Luis Popp Lui
We can do it by creating a new column, but, as in SQL we can do it directly. I am expecting the same through pandas. Kindly help me on this.
My intended output is the same as above. But, I am writing a generic script that converts SQL query into pandas dataframe. I am stuck in achieving such scenarios where there is a column is being passed as a parameter of a row level function.
Upvotes: 1
Views: 676
Reputation: 55
df[['EMPLOYEE_ID', 'FIRST_NAME', 'LAST_NAME']].apply(lambda i: i.FIRST_NAME[int(i.EMPLOYEE_ID/i.EMPLOYEE_ID):4], axis=1)
The above is giving me the answer. But it is slower than the other one
%timeit EMPLOYEE.apply(lambda i: i.FIRST_NAME[int(i.EMPLOYEE_ID/i.EMPLOYEE_ID):3], axis=1)
1.38 ms ± 58.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit EMPLOYEE[['EMPLOYEE_ID', 'FIRST_NAME', 'LAST_NAME']].assign(pre=EMPLOYEE['FIRST_NAME'].str[1:4])
948 µs ± 22.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Upvotes: 0
Reputation: 61910
Try:
res = df[['EMPLOYEE_ID', 'FIRST_NAME', 'LAST_NAME']].assign(pre=df['FIRST_NAME'].str[:3])
print(res)
Output
EMPLOYEE_ID FIRST_NAME LAST_NAME pre
0 100 Steven King Ste
1 101 Neena Kochhar Nee
2 102 Lex De Haan Lex
3 103 Alexander Hunold Ale
4 104 Bruce Ernst Bru
5 105 David Austin Dav
6 106 Valli Pataballa Val
7 107 Diana Lorentz Dia
8 108 Nancy Greenberg Nan
9 109 Daniel Faviet Dan
10 110 John Chen Joh
11 111 Ismael Sciarra Ism
12 112 Jose Manuel Urman Jos
13 113 Luis Popp Lui
To use it with str.slice do:
res = df[['EMPLOYEE_ID', 'FIRST_NAME', 'LAST_NAME']].assign(pre=df['FIRST_NAME'].str.slice(stop=3))
where, stop is:
stop int, optional Stop position for slice operation.
basically iterate one character at a time and stop in position 3 (non inclusive).
Upvotes: 1