Reputation: 1367
I have a very large dataframe and I want to generate unique values from each column. This is just a sample-- there are over 20 columns in total.
CRASH_DT CRASH_MO_NO CRASH_DAY_NO
1/1/2013 01 01
1/1/2013 01 01
1/5/2013 03 05
My desired output is like so:
<variable = "CRASH_DT">
<code>1/1/2013</code>
<count>2</count>
<code>1/5/2013</code>
<count>1</count>
</variable>
<variable = "CRASH_MO_NO">
<code>01</code>
<count>2</count>
<code>03</code>
<count>1</count>
</variable>
<variable = "CRASH_DAY_NO">
<code>01</code>
<count>2</count>
<code>05</code>
<count>1</count>
</variable>
I have been trying to use the .sum() or .unique() functions, as suggested by many other questions about this topic that I have already looked at.
None of them seem to apply to this problem, and all of them say that in order to generate unique values from every column, you should either use a groupby function, or select individual columns. I have a very large number of columns (over 20), so it doesn't really make sense to group them together just by writing out df.unique['col1','col2'...'col20']
I have tried .unique(), .value_counts(), and .count, but I can't figure out how to apply any of those to work across multiple columns, rather than a groupby function or anything that was suggested in the above links.
My question is: how can I generate a count of unique values from each of the columns in a truly massive dataframe, preferably by looping through the columns themselves? (I apologize if this is a duplicate, I have looked through a whole lot of questions on this topic and while they seem like they should work for my problem as well, I can't figure out exactly how to tweak them to get them to work for me.)
This is my code so far:
import pyodbc
import pandas.io.sql
conn = pyodbc.connect('DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=C:\\Users\\<filename>.accdb')
sql_crash = "SELECT * FROM CRASH"
df_crash = pandas.io.sql.read_sql(sql_crash, conn)
df_c_head = df_crash.head()
df_c_desc = df_c_head.describe()
for k in df_c_desc:
df_c_unique = df_c_desc[k].unique()
print(df_c_unique.value_counts()) #Generates the error "numpy.ndarray object has no attribute .value_counts()
Upvotes: 2
Views: 3697
Reputation: 21888
Here is an answer inspired by the answer to this question. But I don't know if it will be scalable enough in your case.
df = pd.DataFrame({'CRASH_DAY_NO': [1, 1, 5, 2, 2],
'CRASH_DT': ['10/2/2014 5:00:08 PM',
'5/28/2014 1:29:28 PM',
'5/28/2014 1:29:28 PM',
'7/14/2014 5:42:03 PM',
'6/3/2014 10:33:22 AM'],
'CRASH_ID': [1486150, 1486152, 1486224, 1486225, 1486226],
'SEG_PT_LRS_MEAS': [79.940226960000004,
297.80989999000002,
140.56460290999999,
759.43600000000004,
102.566036],
'SER_NO': [1, 3, 4, 5, 6]})
df = df.apply(lambda x: x.value_counts(sort=False))
df.index = df.index.astype(str)
# Transforming to XML by hand ...
def func(row):
xml = ['<variable = "{0}">'.format(row.name)]
for field in row.index:
if not pd.isnull(row[field]):
xml.append(' <code>{0}</code>'.format(field))
xml.append(' <count>{0}</count>'.format(row[field]))
xml.append('</variable>')
return '\n'.join(xml)
print('\n'.join(df.apply(func, axis=0)))
<variable = "CRASH_DAY_NO">
<code>1</code>
<count>2.0</count>
<code>2</code>
<count>2.0</count>
<code>5</code>
<count>1.0</count>
</variable>
<variable = "CRASH_DT">
<code>5/28/2014 1:29:28 PM</code>
<count>2.0</count>
<code>7/14/2014 5:42:03 PM</code>
<count>1.0</count>
<code>10/2/2014 5:00:08 PM</code>
<count>1.0</count>
<code>6/3/2014 10:33:22 AM</code>
<count>1.0</count>
</variable>
....
Upvotes: 2
Reputation: 353059
I would loop over value_counts().items()
per column:
>>> df["CRASH_DAY_NO"].value_counts()
01 2
05 1
dtype: int64
>>> df["CRASH_DAY_NO"].value_counts().items()
<zip object at 0x7fabf49f05c8>
>>> for value, count in df["CRASH_DAY_NO"].value_counts().items():
... print(value, count)
...
01 2
05 1
So something like
def vc_xml(df):
for col in df:
yield '<variable = "{}">'.format(col)
for k,v in df[col].value_counts().items():
yield " <code>{}</code>".format(k)
yield " <count>{}</count>".format(v)
yield '</variable>'
with open("out.xml", "w") as fp:
for line in vc_xml(df):
fp.write(line + "\n")
gives me
<variable = "CRASH_DAY_NO">
<code>01</code>
<count>2</count>
<code>05</code>
<count>1</count>
</variable>
<variable = "CRASH_DT">
<code>1/1/2013</code>
<count>2</count>
<code>1/5/2013</code>
<count>1</count>
</variable>
<variable = "CRASH_MO_NO">
<code>01</code>
<count>2</count>
<code>03</code>
<count>1</count>
</variable>
Upvotes: 5