racekiller
racekiller

Reputation: 105

Get List of Unique String values per column in a dataframe using python

here I go with another question

I have a large dataframe about 20 columns by 400.000 rows. In this dataset I can not have string since the software that will process the data only accepts numeric and nulls.

So they way I am thinking it might work is following. 1. go thru each column 2. Get List of unique strings 3. Replace each string with a value from 0 to X 4. repeat the process for the next column 5. Repeat for the next dataframe

This is how the dataframe looks like

DATE        TIME    FNRHP306H   FNRHP306HC  FNRHP306_2MEC_MAX
7-Feb-15    0:00:00 NORMAL      NORMAL      1050
7-Feb-15    0:01:00 NORMAL      NORMAL      1050
7-Feb-15    0:02:00 NORMAL      HIGH        1050
7-Feb-15    0:03:00 HIGH        NORMAL      1050
7-Feb-15    0:04:00 LOW         NORMAL      1050
7-Feb-15    0:05:00 NORMAL      LOW         1050

This is the result expected

DATE        TIME    FNRHP306H   FNRHP306HC  FNRHP306_2MEC_MAX
7-Feb-15    0:00:00 0           0           1050
7-Feb-15    0:01:00 0           0           1050
7-Feb-15    0:02:00 0           1           1050
7-Feb-15    0:03:00 1           0           1050
7-Feb-15    0:04:00 2           0           1050
7-Feb-15    0:05:00 0           2           1050

enter image description here

I am using python 3.5 and the latest version of Pandas

Thanks in advance

JV

Upvotes: 1

Views: 1381

Answers (1)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210872

Solution:

# try to convert all columns to numbers...
df = df.apply(lambda x: pd.to_numeric(x, errors='ignore'))

cols = df.filter(like='FNR').select_dtypes(include=['object']).columns
st = df[cols].stack().to_frame('name')
st['cat'] = pd.factorize(st.name)[0]
df[cols] = st['cat'].unstack()

del st

Demo:

In [233]: df
Out[233]:
       DATE     TIME FNRHP306H FNRHP306HC  FNRHP306_2MEC_MAX
0  7-Feb-15  0:00:00    NORMAL     NORMAL               1050
1  7-Feb-15  0:01:00    NORMAL     NORMAL               1050
2  7-Feb-15  0:02:00    NORMAL       HIGH               1050
3  7-Feb-15  0:03:00      HIGH     NORMAL               1050
4  7-Feb-15  0:04:00       LOW     NORMAL               1050
5  7-Feb-15  0:05:00    NORMAL        LOW               1050

first we stack all object (string) columns:

In [235]: cols = df.filter(like='FNR').select_dtypes(include=['object']).columns

In [236]: st = df[cols].stack().to_frame('name')

now we can factorize stacked column:

In [238]: st['cat'] = pd.factorize(st.name)[0]

In [239]: st
Out[239]:
                name  cat
0 FNRHP306H   NORMAL    0
  FNRHP306HC  NORMAL    0
1 FNRHP306H   NORMAL    0
  FNRHP306HC  NORMAL    0
2 FNRHP306H   NORMAL    0
  FNRHP306HC    HIGH    1
3 FNRHP306H     HIGH    1
  FNRHP306HC  NORMAL    0
4 FNRHP306H      LOW    2
  FNRHP306HC  NORMAL    0
5 FNRHP306H   NORMAL    0
  FNRHP306HC     LOW    2

assign unstacked result back to original DF (to object columns):

In [241]: df[cols] = st['cat'].unstack()

In [242]: df
Out[242]:
       DATE     TIME  FNRHP306H  FNRHP306HC  FNRHP306_2MEC_MAX
0  7-Feb-15  0:00:00          0           0               1050
1  7-Feb-15  0:01:00          0           0               1050
2  7-Feb-15  0:02:00          0           1               1050
3  7-Feb-15  0:03:00          1           0               1050
4  7-Feb-15  0:04:00          2           0               1050
5  7-Feb-15  0:05:00          0           2               1050

Explanation:

In [248]: df.filter(like='FNR')
Out[248]:
  FNRHP306H FNRHP306HC  FNRHP306_2MEC_MAX
0    NORMAL     NORMAL               1050
1    NORMAL     NORMAL               1050
2    NORMAL       HIGH               1050
3      HIGH     NORMAL               1050
4       LOW     NORMAL               1050
5    NORMAL        LOW               1050

In [249]: df.filter(like='FNR').select_dtypes(include=['object'])
Out[249]:
  FNRHP306H FNRHP306HC
0    NORMAL     NORMAL
1    NORMAL     NORMAL
2    NORMAL       HIGH
3      HIGH     NORMAL
4       LOW     NORMAL
5    NORMAL        LOW

Upvotes: 1

Related Questions