Ivo
Ivo

Reputation: 4200

cast data frame to long, dropping rows, expanding others

I have a data frame to pivot to longer:

import pandas as pd
import io

_1 = pd.read_csv(io.StringIO(
    """date;        origin; val_one; val_two; aaa; bbb; ccc; ddd; eee; fff
    10/11/2009;        aaa;       1;       0;   0;   0;   0;   0;   1;   0
    10/11/2009;        bbb;       0;       1;   1;   0;   0;   0;   0;   1
    10/11/2009;        ccc;       0;       1;   0;   0;   0;   0;   0;   0
    10/11/2009;        ddd;       0;       2;   0;   1;   1;   1;   0;   0"""),
    sep=";").set_axis(['date', 'origin', 'val_one', 'val_two',
            'aaa', 'bbb', 'ccc', 'ddd', 'eee', 'fff'], axis=1)

I now want to create origin-target (rows aaa:fff) dyads. Rows where no target exists (i.e. none of the target columns is 1, e.g. the third row, index 2) should be dropped; rows where more than one target exists (e.g. the second row, index 1, where aaa is 1 and fff is one of the target columns is 1) should be turned into two rows. The expected output is:

_2_targ = pd.read_csv(io.StringIO(
"""date;        origin; val_one; val_two; target
10/11/2009;        aaa;       1;       0;    eee
10/11/2009;        bbb;       0;       1;    aaa
10/11/2009;        bbb;       0;       1;    fff
10/11/2009;        ddd;       0;       2;    bbb
10/11/2009;        ddd;       0;       2;    ccc
10/11/2009;        ddd;       0;       2;    ddd"""), 
sep=';').set_axis(['date', 'origin', 'val_one', 'val_two', 'target'])

I have tried using pd.melt to no avail (see below) - this creates all dyads, even the ones I don't want to keep (because no target column is 1).

_2 = pd.melt(_1,
             id_vars=['date', 'origin', 'val_one', 'val_two'],
             value_vars=['aaa', 'bbb', 'ccc', 'ddd', 'eee', 'fff'],
             var_name='target', value_name='tmp')

What am I missing?

Upvotes: 0

Views: 30

Answers (1)

mitoRibo
mitoRibo

Reputation: 4548

I believe you can melt exactly as you are doing, and then drop rows from your long_df that have a value of 0 for tmp

import pandas as pd
import io

#nice reproducible input!
df1 = pd.read_csv(io.StringIO(
    """date;        origin; val_one; val_two; aaa; bbb; ccc; ddd; eee; fff
    10/11/2009;        aaa;       1;       0;   0;   0;   0;   0;   1;   0
    10/11/2009;        bbb;       0;       1;   1;   0;   0;   0;   0;   1
    10/11/2009;        ccc;       0;       1;   0;   0;   0;   0;   0;   0
    10/11/2009;        ddd;       0;       2;   0;   1;   1;   1;   0;   0"""),
    sep=";").set_axis(['date', 'origin', 'val_one', 'val_two',
            'aaa', 'bbb', 'ccc', 'ddd', 'eee', 'fff'], axis=1)

#Melt exactly the same as you did
long_df = pd.melt(
    df1,
    id_vars=['date', 'origin', 'val_one', 'val_two'],
    value_vars=['aaa', 'bbb', 'ccc', 'ddd', 'eee', 'fff'],
    var_name='target', value_name='tmp'
)

#Filter out rows where tmp is 0, sort to match your output, and drop the tmp column
long_df = long_df[long_df['tmp'].gt(0)].sort_values('origin').drop(columns='tmp').reset_index(drop=True)
long_df

The output is nearly as you show above, but I think your forgetting the bbb:fff row

             date       origin  val_one  val_two target
0      10/11/2009          aaa        1        0    eee
1      10/11/2009          bbb        0        1    aaa
2      10/11/2009          bbb        0        1    fff
3      10/11/2009          ddd        0        2    bbb
4      10/11/2009          ddd        0        2    ccc
5      10/11/2009          ddd        0        2    ddd

I've also renamed your variables but that's obviously not important

Upvotes: 1

Related Questions