Reputation: 28437
I have a dataframe that consists of columns node, component, and preceding word. Node contains many identical values (alphabetically sorted), component contains many identical values as well, but scrambled, and preceding word can be all kind of words - but some identical as well.
What I want to do now is create some sort of cross-section/frequency list that shows the frequency of the component and the preceding word, linked to the node.
Let's say this is my df:
node precedingWord comp
banana the lel
banana a lel
banana a lal
coconut some lal
coconut few lil
coconut the lel
I am expecting a frequency list that shows each unique node, and the times that some value is found in the other columns given matching criteria, e.g.
det1 = a
det2 = the
comp1 = lel
comp2 = lil
comp 3 = lal
expected output:
node det1 det2 unspecified comp1 comp2 comp3
banana 2 1 0 2 0 1
coconut 0 1 0 1 1 1
I already did it for one variable, but I don't know how to get the comp columns in place:
det1 = ["a"]
det2 = ["the"]
df.loc[df.preceding_word.isin(det1), "determiner"] = "det1"
df.loc[df.preceding_word.isin(det2), "determiner"] = "det2"
df.loc[df.preceding_word.isin(det1 + det2) == 0, "determiner"] = "unspecified"
# Create crosstab of the node and gender
freqDf = pd.crosstab(df.node, df.determiner)
I got this answer from here. If anyone could explain what the loc
does, that would be a lot of help as well.
With Andy's answer in mind, I tried the following. Note that "precedingWord" has been replaced by "gender", which holds only values neuter, non_neuter, gender.
def frequency_list():
# Define content of gender classes
neuter = ["het"]
non_neuter = ["de"]
# Add `gender` column to df
df.loc[df.preceding_word.isin(neuter), "gender"] = "neuter"
df.loc[df.preceding_word.isin(non_neuter), "gender"] = "non_neuter"
df.loc[df.preceding_word.isin(neuter + non_neuter) == 0, "gender"] = "unspecified"
g = df.groupby("node")
# Create crosstab of the node, and gender and component
freqDf = pd.concat([g["component"].value_counts().unstack(1), g["gender"].value_counts().unstack(1)])
# Reset indices, starting from 1, not the default 0!
""" Crosstabs don't come with index, so we first set the index with
`reset_index` and then alter it. """
freqDf.reset_index(inplace=True)
freqDf.index = np.arange(1, len(freqDf) + 1)
freqDf.to_csv("dataset/py-frequencies.csv", sep="\t", encoding="utf-8")
The output is close to what I want, but not exactly:
component
), then the same nodes are displayed for gender
.0
.What I would want, then, is this:
Do note that I'm looking for the most efficient answer. I'm actually working with loads, and loads of data so each second per loop counts!
Upvotes: 4
Views: 2786
Reputation: 13116
One can split your question in at least three:
loc
doing?Pandas provides speed up for some operations, so try library implementation before resorting to loops (see below)
1. With plain pandas:
df = pd.DataFrame({"det":["a","the","a","a","a", "the"], "word":["cat", "pet", "pet", "cat","pet", "pet"]})
"you will need a dummy variable:"
df["counts"] = 1
"you probably need to reset the index"
df_counts = df.groupby(["det","word"]).agg("count").reset_index()
# det word counts
#0 a cat 2
#1 a pet 3
#2 the pet 1
"and pivot it"
df_counts.pivot( index = "word", columns = "det", values="counts").fillna(0)
#det a the
#word
#cat 2 0
#pet 3 1
Example with two columns:
df = pd.DataFrame([['idee', 'het', 'lel', 1],
['idee', 'het', 'lel', 1],
['idee', 'de', 'lal', 1],
['functie', 'de', 'lal', 1],
['functie', 'de', 'lal', 1],
['functie', 'en', 'lil', 1],
['functie', 'de', 'lel', 1],
['functie', 'de', 'lel', 1]],
columns = ['node', 'precedingWord', 'comp', 'counts'])
df["counts"] = 1
df_counts = df.groupby(["node","precedingWord", "comp"]).agg("count").reset_index()
df_counts
# node precedingWord comp counts
#0 functie de lal 2
#1 functie de lel 1
#2 functie de lil 1
#3 functie en lil 1
#4 idee de lal 1
#5 idee het lel 2
2. Using Counter
df = pd.DataFrame({"det":["a","the","a","a","a", "a"], "word":["cat", "pet", "pet", "cat","pet", "pet"]})
acounter = Counter( (tuple(x) for x in df.as_matrix()) )
#Counter({('a', 'cat'): 2, ('a', 'pet'): 2, ('the', 'pet'): 2})
df_counts = pd.DataFrame(list(zip([y[0] for y in acounter.keys()], [y[1] for y in acounter.keys()], acounter.values())), columns=["det", "word", "counts"])
# det word counts
#0 a cat 2
#1 the pet 1
#2 a pet 3
df_counts.pivot( index = "word", columns = "det", values="counts").fillna(0)
#det a the
#word
#cat 2 0
#pet 3 1
This one is a bit faster in my case than pure pandas
(52.6 µs vs 92.9 µs per loop for grouping; not counting pivoting)
3. As I understand it is a natural language processing problem. You can try combining all your data into a string and using CountVectorizer
from sklearn
and setting ngram_range=(1, 2)
. Something like:
df = pd.DataFrame({"det":["a","the","a","a","a", "a"], "word":["cat", "pet", "pet", "cat","pet", "pet"]})
from sklearn.feature_extraction.text import CountVectorizer
listofpairs = []
for _, row in df.iterrows():
listofpairs.append(" ".join(row))
countvect = CountVectorizer(ngram_range=(2,2), min_df = 0.0, token_pattern='(?u)\\b\\w+\\b')
sparse_counts = countvect.fit_transform(listofpairs)
print("* input list:\n",listofpairs)
print("* array of counts:\n",sparse_counts.toarray())
print("* vocabulary [order of columns in the sparse array]:\n",countvect.vocabulary_)
counter_keys = [x[1:] for x in sorted([ tuple([v] + k.split(" ")) for k,v in countvect.vocabulary_.items()])]
counter_values = np.sum(sparse_counts.toarray(), 0)
df_counts = pd.DataFrame([(x[0], x[1], y) for x,y in zip(counter_keys, counter_values)], columns=["det", "word", "counts"])
Two options:
1. concat
df1.set_index("word")
df2.set_index("word")
dfout = pd.concat([df1, df2], axis = 1)
2.merge
loc
It indexes rows (with one argument) or row,column
with two arguments. It uses row/column name or boolean indexing (as in your case for rows).
If you have only one article per gender, you can use direct comparison instead of in
operation, which might bring some speed up:
df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"
with
indices_neutral = df["precedingWord"]=="de"
df.loc[indices, "gender"] = "neuter"
or shorter but less readable
df.loc[df["precedingWord"]=="de", "gender"] = "neuter"
Upvotes: 1
Reputation: 375415
Update: This is a crosstab
:
In [11]: df1 = pd.crosstab(df['node'], df['precedingWord'])
In [12]: df1
Out[12]:
precedingWord a few some the
node
banana 2 0 0 1
coconut 0 1 1 1
In [13]: df2 = pd.crosstab(df['node'], df['comp'])
which is clearly a cleaner (and more efficient algorithm for large data).
Then glue them up with a concat with axis=1 (i.e. add as more columns rather add as more rows).
In [14]: pd.concat([df1, df2], axis=1, keys=['precedingWord', 'comp'])
Out[14]:
precedingWord comp
a few some the lal lel lil
node
banana 2 0 0 1 1 2 0
coconut 0 1 1 1 1 1 1
I would probably leave it like this (as a MultiIndex), if you want it flattened just don't pass the keys (though there may be an issue with duplicated words):
In [15]: pd.concat([df1, df2], axis=1)
Out[15]:
a few some the lal lel lil
node
banana 2 0 0 1 1 2 0
coconut 0 1 1 1 1 1 1
Aside: it would be nice if concat didn't require that the column name be passed in explicitly (as the key kwarg) when they exist...
You can use for value_counts
:
In [21]: g = df.groupby("node")
In [22]: g["comp"].value_counts()
Out[22]:
node comp
banana lel 2
lal 1
coconut lal 1
lel 1
lil 1
dtype: int64
In [23]: g["precedingWord"].value_counts()
Out[23]:
node precedingWord
banana a 2
the 1
coconut few 1
some 1
the 1
dtype: int64
Putting this in a single frame is a little trickier:
In [24]: pd.concat([g["comp"].value_counts().unstack(1), g["precedingWord"].value_counts().unstack(1)])
Out[24]:
a few lal lel lil some the
node
banana NaN NaN 1 2 NaN NaN NaN
coconut NaN NaN 1 1 1 NaN NaN
banana 2 NaN NaN NaN NaN NaN 1
coconut NaN 1 NaN NaN NaN 1 1
In [25]: pd.concat([g["comp"].value_counts().unstack(1), g["precedingWord"].value_counts().unstack(1)]).fillna(0)
Out[25]:
a few lal lel lil some the
node
banana 0 0 1 2 0 0 0
coconut 0 0 1 1 1 0 0
banana 2 0 0 0 0 0 1
coconut 0 1 0 0 0 1 1
You can map the columns to det1, det2, etc. before doing the concat, for example if you had the mapping as a dictionary.
In [31]: res = g["comp"].value_counts().unstack(1)
In [32]: res
Out[32]:
comp lal lel lil
node
banana 1 2 NaN
coconut 1 1 1
In [33]: res.columns = res.columns.map({"lal": "det1", "lel": "det2", "lil": "det3"}.get)
In [34]: res
Out[34]:
det1 det2 det3
node
banana 1 2 NaN
coconut 1 1 1
Alternatively you could use a list comprehension (if you didn't have the dict or have particular labels in mind):
In [41]: res = g["comp"].value_counts().unstack(1)
In [42]: res.columns = ['det%s' % i for i, _ in enumerate(df.columns)]
Upvotes: 3