Reputation: 1833
I used pd.GetDummies
to one hot encode my categorical variables to be used as predictors. For some of my columns that had many unique values, I have many new columns and I am trying to find a fast way to create interaction terms for these. (I only want interactions for a subset of my columns, so PolynomialFeatures()
won't work...or will it?)
Here is what I am trying to do:
Step 1: Create lists of column names for each of the subset I want to multiply:
channel = [col for col in df if col.startswith('channel')]
quote = [col for col in df if col.startswith('quote')]
print(channel[:1])
Out: 'channel_A'
'channel_B'
Step 2: for loop:
cols = 'channel quote'.split()
for col in cols:
for i in col:
colname = 'value_X_'+i
df[colname] = df['value_days']*df[i]+0
The problem is that the inner loop does not recognize col
as an object: it recognizes it as a string (error = 'c'
, evidenced by:
for col in cols:
for i in col:
print i
Out[1]:
c
h
.
.
.
o
t
e
Goal: My desired outcome is to get a new column that is named for the two columns were originally multiplied and has values for the multiplication.
For example, the first element in channel is channel_A
, so I want to get a new column named value_X_channel_A
and it should have values that are equivalent to the product of value_days
*channel_A
.
value_days | channel_A | value_X_channel_A
-------------------------------------------
5 |5 |25
This works perfectly fine if I just run the inner loop and replace col
with channel
.
How can I get this to work?
Thanks in advance.
Upvotes: 1
Views: 715
Reputation: 1518
Oh I see, in your function you are basically calling 'channel' string. But to loop through value from channel variable, you need to convert string to variable with vars function first.
Example:
channel=['channel_A','channel_B']
quote=['quote_A','quote_B']
cols = 'channel quote'.split()
for col in cols:
var=vars()[col]
for ele in var:
print(ele)
Output:
channel_A
channel_B
quote_A
quote_B
for your function, change it to:
cols = 'channel quote'.split()
for col in cols:
for i in vars()[col]:
colname = 'value_X_'+i
df[colname] = df['value_days']*df[i]+0
Feel free to ask if you are still not clear.
Upvotes: 1
Reputation: 5759
Your question is worded in a way that is hard to understand (for me at least). If I'm right about what you want, you wish to multiply each column with a name starting with "channel" or "quote" by the column "value_days" stored in your df, and then store that in a new column named value_X_{i} where {i} is the name of the column that was multiplied. You're close, but you're code is awkward. Use another data structure (Dictionary) to make the code straightforward and readable:
d = {
'quote' : [col for col in df if col.startswith('quote')],
'channel' : [col for col in df if col.startswith('channel')]
}
for columns_string, columns in d.items():
for col_string in columns:
colname = 'value_X_'+col_string
df[colname] = df['value_days'] * df[i] + 0
Explanation:
d = ...
- Creates a dictionary with two key value pairs 'quote' and 'channel' with values equal to a list of the desired column names.
for column_string, columns in d.items():
- .items() returns an iterator to a dictionaries key/value pairs, we then loop through this naming each key 'column_string' and the column-names-list is stored in the variable 'columns'.
You can quickly realize that something is wrong with your code by noticing that you create variables channel
and quote
and set them to there corresponding values, but you never actually use either of those lists in your code.
Upvotes: 1