Pandas: count unique value in each column, by looping through them?

Question

I have a very large dataframe and I want to generate unique values from each column. This is just a sample-- there are over 20 columns in total.

          CRASH_DT        CRASH_MO_NO     CRASH_DAY_NO
          1/1/2013        01              01    
          1/1/2013        01              01
          1/5/2013        03              05

My desired output is like so:

I have been trying to use the .sum() or .unique() functions, as suggested by many other questions about this topic that I have already looked at.

None of them seem to apply to this problem, and all of them say that in order to generate unique values from every column, you should either use a groupby function, or select individual columns. I have a very large number of columns (over 20), so it doesn't really make sense to group them together just by writing out df.unique['col1','col2'...'col20']

I have tried .unique(), .value_counts(), and .count, but I can't figure out how to apply any of those to work across multiple columns, rather than a groupby function or anything that was suggested in the above links.

My question is: how can I generate a count of unique values from each of the columns in a truly massive dataframe, preferably by looping through the columns themselves? (I apologize if this is a duplicate, I have looked through a whole lot of questions on this topic and while they seem like they should work for my problem as well, I can't figure out exactly how to tweak them to get them to work for me.)

This is my code so far:

import pyodbc
import pandas.io.sql

conn = pyodbc.connect('DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=C:\Users\.accdb')

sql_crash = "SELECT * FROM CRASH"
df_crash = pandas.io.sql.read_sql(sql_crash, conn)
df_c_head = df_crash.head()
df_c_desc = df_c_head.describe()

for k in df_c_desc:
   df_c_unique = df_c_desc[k].unique()
   print(df_c_unique.value_counts()) #Generates the error "numpy.ndarray object has no attribute .value_counts()

DSM · Accepted Answer

I would loop over value_counts().items() per column:

>>> df["CRASH_DAY_NO"].value_counts()
01    2
05    1
dtype: int64
>>> df["CRASH_DAY_NO"].value_counts().items()

>>> for value, count in df["CRASH_DAY_NO"].value_counts().items():
...     print(value, count)
...     
01 2
05 1

So something like

def vc_xml(df):
    for col in df:
        yield ''.format(col)
        for k,v in df[col].value_counts().items():
            yield "   {}".format(k)
            yield "   {}".format(v)
        yield ''

with open("out.xml", "w") as fp:
    for line in vc_xml(df):
        fp.write(line + "
")

gives me

Pandas: count unique value in each column, by looping through them?

Answers (2)

Related Questions