Reputation: 79
I have the following dataframe with multiple values for a certain column:
my column
0 - ["A", "B"]
1 - ["B", "C", "D"]
2 - ["B", "D"]
How Can I get a dataframe like this : (where every column takes the name of the values in "my column")
"A" "B" "C" "D"
0 - 1 1 0 0
1 - 0 1 1 1
2 - 0 1 0 1
Upvotes: 2
Views: 1744
Reputation: 184
You can use CountVectorizer
, It is specialy designed for this purpose. It takes corpus of text and do One-Hot Encoding
for it.
Note : I am using 'Cat', 'Dog', 'Cow', 'Tiger' Instead of 'A', 'B', 'C', 'D'
Code :
Imports :
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
Method to convert elemrnts of list to string :
def get_string(listt):
return ' '.join(listt)
Creating DataFrame from List :
my_column = pd.Series([['Cat','Dog'],['Dog','Cow','Tiger'],['Dog','Tiger']])
df = pd.DataFrame(my_column, columns=['my_column'])
print(df)
df['text_data'] = df.my_column.apply(get_string)
print(df)
Performing Text Vectorization : tf_vectorizer = CountVectorizer( stop_words=None) vectorized_data = tf_vectorizer.fit_transform(df.text_data)
Preparing final DataFrame :
final_df = pd.DataFrame(vectorized_data.toarray(),columns=tf_vectorizer.get_feature_names())
print(final_df)
Out Put :
Our DataFrame :
my_column
0 [Cat, Dog]
1 [Dog, Cow, Tiger]
2 [Dog, Tiger]
DataFrame with text column:
my_column text_data
0 [Cat, Dog] Cat Dog
1 [Dog, Cow, Tiger] Dog Cow Tiger
2 [Dog, Tiger] Dog Tiger
Expected Result :
cat cow dog tiger
0 1 0 1 0
1 0 1 1 1
2 0 0 1 1
Upvotes: 1
Reputation: 1299
Just for fun here's a naive implementation of dummies:
import pandas as pd
my_column = pd.Series([['A','B'],['B','C','D'],['B','D']])
frameA = pd.DataFrame(my_column, columns=['my_column'])
#extract all new headers from the DataFrame rows, in order:
headers = sorted(list(set([x for y in frame['my_column'] for x in y])))
#make a list of the DataFrame rows (stored as lists):
rows = [y for x in range(len(frame)) for y in frame.loc[x]]
builder = {} #construct a dictionary to build a new DataFrame from
for header in headers:
column = []
for row in rows:
if header in row:
column.append(1)
else:
column.append(0)
builder.update({header:column})
frameB = pd.DataFrame(builder)
print(frameB)
results in:
A B C D
0 1 1 0 0
1 0 1 1 1
2 0 1 0 1
Upvotes: 1
Reputation: 862761
If there are lists in column use Series.str.join
with Series.str.get_dummies
:
df = df['my column'].str.join('|').str.get_dummies()
print (df)
A B C D
0 1 1 0 0
1 0 1 1 1
2 0 1 0 1
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df['my column']),columns=mlb.classes_)
print (df)
A B C D
0 1 1 0 0
1 0 1 1 1
2 0 1 0 1
If there are strings use Series.str.strip
with str.get_dummies
and last if necessary remove "
from columns names:
df = (df['my column'].str.strip('[]')
.str.get_dummies(', ')
.rename(columns=lambda x: x.strip('"')))
print (df)
A B C D
0 1 1 0 0
1 0 1 1 1
2 0 1 0 1
Upvotes: 3
Reputation: 840
I think what you're looking for is theget_dummies()
function in pandas which you can find the documentation for here
From the documentation:
s = pd.Series(list('abca'))
pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
Upvotes: 0