ej_f
ej_f

Reputation: 460

Dealing with missing data in an one column text file

I have a data set like:

IM,XX
IS,YY
SG,3
OTPL,90
TTPL,90
IM,AA
IS,BB
SG,3
TTPL,50
IM,ZZ
IS,CC
OTPL,10

Each line contain a key,value pair and i need to convert this into a tabular format in order to perform some analysis. The IM variable represent the index of a row and the below parameters are the columns. The tricky part for me is to take account the possible missing values. The expected result is:

IM  IS  OTPL    SG  TTPL  
XX  YY    90     3    90    
AA  BB  null     3    50    
ZZ  CC    10  null  null   

"note the null values".

I have a solution but is not so efficient, when the data set is quite big it is't an appropriate way. I use the following strategy:

  1. With awk, add an extra index for each register (row). It creates a counter n and increment it when IM appears:

    $ awk -F, 'BEGIN{n = 0}{ if($1 == "IM"){n += 1} print n","$0}' inputdata.txt
    1,IM,XX
    1,IS,YY
    1,SG,3
    1,OTPL,90
    1,TTPL,90
    2,IM,AA
    2,IS,BB
    2,SG,3
    2,TTPL,50
    3,IM,ZZ
    3,IS,CC
    3,OTPL,10
    
  2. Next, read the previous result using pandas, apply groupby by the above indices and creates a new table applying concat to pivot subtables:

    In[1]:import pandas as pd
          gb = pd.read_csv("outdata.txt", names = ["id","key","value"]).groupby("id")
          res = pd.concat([df.pivot(index="id", columns='key', values='value') for g, df in gb])
          res
    Out[1]: 
        IM  IS OTPL   SG TTPL
    id                       
    1   XX  YY   90    3   90
    2   AA  BB  NaN    3   50
    3   ZZ  CC   10  NaN  NaN
    

The last step is very expensive.

Has anyone had a similar problem? Would be nice solve this only with the command line.

Thanks in advance!

Upvotes: 2

Views: 189

Answers (3)

ej_f
ej_f

Reputation: 460

Many thanks to @Alexander and @MaxU for your comments.

The awk pure solution had a slightly better performance than pandas. The below result was obtained with a 35500 rows dataset:

# initial solution (pandas + awk)
In [2]: %timeit ej_f_pandas()
1 loops, best of 3: 1min 14s per loop

# maxu's solution (pandas + awk)
In [3]: %timeit maxu_pandas()
1 loops, best of 3: 697 ms per loop

# alexander's solution (pandas)
In [4]: %timeit alexander_pandas()
1 loops, best of 3: 518 ms per loop

# maxu's solution (awk)
In [5]: %timeit maxu_awk()
1 loops, best of 3: 499 ms per loop

Upvotes: 1

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210932

[UPDATE] pure GAWK solution:

BEGIN {
    FS=OFS=","
    n = 0
}
{
    if($1 == "IM") {
        n++
    }
    keys[$1]++
    vals[n,$1]=$2
}
END {
    l=asorti(keys, copy)
    printf "id"
    for (i=1; i<=l; i++) {
        printf "%s%s", FS, copy[i]
    }
    print ""

    for (i=1; i<=n; i++) {
        printf "%s", i
        for (k=1; k<=l; k++) {
            printf "%s%s", FS, vals[i,copy[k]] 
        }
        print ""
    }
}

Output:

{ .data }  » awk -f prg.awk data.csv
id,IM,IS,OTPL,SG,TTPL
1,XX,YY,90,3,90
2,AA,BB,,3,50
3,ZZ,CC,10,,

[OLD] Pandas solution:

i think you can just use pivot_table() instead of groupby() + concat() :

In [105]: df
Out[105]:
    id   key val
0    1    IM  XX
1    1    IS  YY
2    1    SG   3
3    1  OTPL  90
4    1  TTPL  90
5    2    IM  AA
6    2    IS  BB
7    2    SG   3
8    2  TTPL  50
9    3    IM  ZZ
10   3    IS  CC
11   3  OTPL  10

In [106]: df.pivot_table(index='id', columns='key', values='val', aggfunc='sum', fill_value=np.nan)
Out[106]:
key  IM  IS OTPL   SG TTPL
id
1    XX  YY   90    3   90
2    AA  BB  NaN    3   50
3    ZZ  CC   10  NaN  NaN

or pivot() if you don't have duplicates (like in your sample data set):

In [109]: df.pivot(index='id', columns='key', values='val')
Out[109]:
key  IM  IS  OTPL    SG  TTPL
id
1    XX  YY    90     3    90
2    AA  BB  None     3    50
3    ZZ  CC    10  None  None

the same with NaNs instead of Nones:

In [110]: df.pivot(index='id', columns='key', values='val').fillna(np.nan)
Out[110]:
key  IM  IS OTPL   SG TTPL
id
1    XX  YY   90    3   90
2    AA  BB  NaN    3   50
3    ZZ  CC   10  NaN  NaN

Upvotes: 2

Alexander
Alexander

Reputation: 109686

def my_transform(infile, outfile):
    df = pd.read_csv(infile, header=None, sep=",", names=['id', None])
    df = df.groupby([(df.id == 'IM').cumsum(), 'id']).first().unstack()
    df.columns = df.columns.droplevel()
    df.to_csv(outfile, index=None)

infile = "..."
outfile = "..."
my_transform(infile, outfile)

>>> !cat "..."  # outfile
IM,IS,OTPL,SG,TTPL
XX,YY,90,3,90
AA,BB,,3,50
ZZ,CC,10,,

The key to this groupby is grouping on (df.id == 'IM').cumsum(), which means that the occurrence of 'IM' in the first column delineates a new group. The my_transform function takes an input file, transforms it into the desired output, and then saves the result back to a file.

df['group'] = (df.id == 'IM').cumsum()
>>> df
      id NaN  group
0     IM  XX      0
1     IS  YY      0
2     SG   3      0
3   OTPL  90      0
4   TTPL  90      0
5     IM  AA      1
6     IS  BB      1
7     SG   3      1
8   TTPL  50      1
9     IM  ZZ      2
10    IS  CC      2
11  OTPL  10      2

Upvotes: 1

Related Questions