Reputation: 183
I have data in the format as follows
data = [["a", "b", "c"],
["b", "c"],
["d", "e", "f", "c"]]
and I would like to have a DataFrame with all unique strings as columns and binary values of occurrence as such
a b c d e f
0 1 1 1 0 0 0
1 0 1 1 0 0 0
2 0 0 1 1 1 1
I have a working code using list comprehensions but it's pretty slow for large data.
# vocab_list contains all the unique keys, which is obtained when reading in data from file
df = pd.DataFrame([[1 if word in entry else 0 for word in vocab_list] for entry in data])
Is there any way to optimise this task? Thanks.
EDIT (a small sample of actual data):
[['a', 'about', 'absurd', 'again', 'an', 'associates', 'writes', 'wrote', 'x', 'york', 'you', 'your'], ['a', 'abiding', 'age', 'aggravated', 'aggressively', 'all', 'almost', 'alone', 'already', 'also', 'although']]
Upvotes: 2
Views: 710
Reputation: 2094
Just for the sake of curiosity I timed the suggested solutions:
from string import ascii_letters
from random import choice, randint
from datetime import datetime
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
data = []
for _ in range(10000):
data.append([choice(ascii_letters) for _ in range(randint(25, 65))])
print("Time using 'pd.Series' and 'str.get_dummies()':")
startTime = datetime.now()
df = pd.Series(['|'.join(x) for x in data]).str.get_dummies()
print(datetime.now() - startTime)
print("Time using 'pd.get_dummies()':")
startTime = datetime.now()
df2 = pd.get_dummies(pd.DataFrame(data), prefix='', prefix_sep='').max(level=0, axis=1)
print(datetime.now() - startTime)
print("Time using 'MultiLabelBinarizer()':")
startTime = datetime.now()
mlb = MultiLabelBinarizer()
df3 = pd.DataFrame(mlb.fit_transform(data),columns=mlb.classes_)
print(datetime.now() - startTime)
Although the results are going to be different each time as I'm using random lengths of lists, the differences will be more or less the same:
Time using 'pd.Series' and 'str.get_dummies()':
0:00:00.450311
Time using 'pd.get_dummies()':
0:00:00.498003
Time using 'MultiLabelBinarizer()':
0:00:00.083955
So, indeed, with the use of sklearn you can get a lot faster results.
Upvotes: 1
Reputation: 862681
For better performance use MultiLabelBinarizer
:
data = [["a", "b", "c"],
["b", "c"],
["d", "e", "f", "c"]]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(data),columns=mlb.classes_)
print (df)
a b c d e f
0 1 1 1 0 0 0
1 0 1 1 0 0 0
2 0 0 1 1 1 1
EDIT:
data = [['a', 'about', 'absurd', 'again', 'an', 'associates', 'writes', 'wrote', 'x', 'york', 'you', 'your'], ['a', 'abiding', 'age', 'aggravated', 'aggressively', 'all', 'almost', 'alone', 'already', 'also', 'although']]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(data),columns=mlb.classes_)
print (df)
a abiding about absurd again age aggravated aggressively all \
0 1 0 1 1 1 0 0 0 0
1 1 1 0 0 0 1 1 1 1
almost ... also although an associates writes wrote x york you \
0 0 ... 0 0 1 1 1 1 1 1 1
1 1 ... 1 1 0 0 0 0 0 0 0
your
0 1
1 0
[2 rows x 22 columns]
Pure pandas solution is possible, but I guess it should be slowier:
df = pd.get_dummies(pd.DataFrame(data), prefix='', prefix_sep='').max(level=0, axis=1)
print (df)
a b d c e f
0 1 1 0 1 0 0
1 0 1 0 1 0 0
2 0 0 1 1 1 1
df = pd.get_dummies(pd.DataFrame(data), prefix='', prefix_sep='').max(level=0, axis=1)
print (df)
a abiding about absurd age again aggravated aggressively an all \
0 1 0 1 1 0 1 0 0 1 0
1 1 1 0 0 1 0 1 1 0 1
... writes alone wrote already x also york although you your
0 ... 1 0 1 0 1 0 1 0 1 1
1 ... 0 1 0 1 0 1 0 1 0 0
[2 rows x 22 columns]
Upvotes: 6
Reputation: 18647
You could use join
in a list comprehension, and str.get_dummies
:
df = pd.Series(['|'.join(x) for x in data]).str.get_dummies()
[out]
a b c d e f
0 1 1 1 0 0 0
1 0 1 1 0 0 0
2 0 0 1 1 1 1
Upvotes: 1