Reputation: 20107
I have a set of sentences, and I want to group them such all the rows in a group should share one particular word. However a sentence can belong to many groups because it has many words in it.
So in the example below, there should be a groups like this:
import pandas as pd
# An example data set
df = pd.DataFrame({"sentences": [
"two long pieces of metal fixed together, each of which bends a different amount when they are both heated to the same temperature",
"the temperature at which a liquid boils",
"a system for measuring temperature that is part of the metric system, in which water freezes at 0 degrees and boils at 100 degrees",
"a unit for measuring temperature. Measurements are often expressed as a number followed by the symbol °",
"a system for measuring temperature in which water freezes at 32º and boils at 212º"
]})
# Create a new series which is a list of words in each "sentences" column
df['words'] = df['sentences'].apply(lambda sentence: sentence.split(" "))
# Try to group by this new column
df.groupby('words').count()
# TypeError: unhashable type: 'list'
However my code throws an error as shown. (see below)
Since my task is a bit complicated I know it probably involves more than just calling groupby(). Can someone help me to make word groups with pandas?
edit After solving the error by returning tuple(sentence.split())
(thanks ethan-furman), I try printing the result, but it doesn't seem to have done anything. I think it probably just put each row in a group:
print(df.groupby('words').count())
# sentences 5
# dtype: int64
Upvotes: 4
Views: 1418
Reputation: 20107
My current solution uses pandas' MultiIndex feature. I'm sure it can be improved with some more efficient use of numpy, but I believe this will perform significantly better than the other python-only answer:
import pandas as pd
import numpy as np
# An example data set
df = pd.DataFrame({"sentences": [
"two long pieces of metal fixed together, each of which bends a different amount when they are both heated to the same temperature",
"the temperature at which a liquid boils",
"a system for measuring temperature that is part of the metric system, in which water freezes at 0 degrees and boils at 100 degrees",
"a unit for measuring temperature. Measurements are often expressed as a number followed by the symbol °",
"a system for measuring temperature in which water freezes at 32º and boils at 212º"
]})
# Create a new series which is a list of words in each "sentences" column
df['words'] = df['sentences'].apply(lambda sentence: sentence.split(" "))
# This is all the words in the dataset. Each word will be its own index (level of the MultiIndex)
names = np.unique(df['words'].sum())
# Create an array of tuples, one tuple for each row of data
# Each tuple contains True if the row has that word in it, and False if it does not
values = df['words'].map(
lambda words: np.vectorize(
lambda word:
True if word in words else False)(names)
)
# Make a multindex
index = pd.MultiIndex.from_tuples(values, names=names)
# Add the MultiIndex without creating a new data frame
df.set_index(index, inplace=True)
# Find all the rows that have the word 'temperature'
xs = df.xs(True, level='temperature')
print(xs.to_string(index=False))
Upvotes: 1
Reputation: 109528
You can use a set collection so that each word is unique. First, we need to get a list of all words in all of the sentences. To do this, we initialize words to an empty set, then use a list comprehension to add each lower case word in each sentence (after splitting the sentence).
Next, we use a dictionary comprehension to build a dictionary keyed off of each word in the word set. The value is the dataframe containing each sentence that contains that word. These were obtained by grouping on a function, groupby(df.sentences.str.contains(word, case=False))
, and then getting each group where this condition is True
.
words = set()
_ = [words.add(word.lower()) for sentence in df.sentences for word in sentence.split()]
word_dict = {word: df.groupby(df.sentences.str.contains(word, case=False)).get_group(True)
for word in words}
>>> word_dict['temperature']
sentences
0 two long pieces of metal fixed together, each ...
1 the temperature at which a liquid boils
2 a system for measuring temperature that is par...
3 a unit for measuring temperature. Measurements...
4 a system for measuring temperature in which wa...
>>> word_dict['freezes']
sentences
2 a system for measuring temperature that is par...
4 a system for measuring temperature in which wa...
>>> words
{'0',
'100',
'212\xc2\xba',
'32\xc2\xba',
'a',
'amount',
'and',
'are',
'as',
'at',
'bends',
...
To get a dictionary of index values for each word:
>>> {word: word_dict[word].index.tolist() for word in word_dict}
{'0': [2],
'100': [2],
'212\xc2\xba': [4],
'32\xc2\xba': [4],
'a': [0, 1, 2, 3, 4],
'amount': [0],
'and': [2, 4],
'are': [0, 3],
'as': [2, 3, 4],
'at': [0, 1, 2, 3, 4],
'bends': [0],
'boils': [1, 2, 4],
'both': [0],
'by': [3],
'degrees': [2],
'different': [0],
'each': [0],
'expressed': [3],
'fixed': [0],
'followed': [3],
'for': [2, 3, 4],
'freezes': [2, 4],
...
Or a matrix of boolean indicators.
>>> [df.sentences.str.contains(word, case='lower').tolist() for word in word_dict]
[[False, False, True, False, True],
[False, False, False, True, False],
[True, False, False, False, False],
[False, False, True, False, False],
...
Upvotes: 1
Reputation: 69031
To fix your TypeError
you can change your lambda
to
lambda sentence: tuple(sentence.split())
which will return a tuple
instead of a list
(and tuples
and hashable).
Upvotes: 0