Alexander
Alexander

Reputation: 4645

R's split&apply&combine routine in Python

Hi I have been long time R user and slowly shifting to Python.

I have some split&apply&combine routine in R but I have a difficulty to find python equivalent of these functions like grepl, paste, select etc. What I am trying to do is in step by step

  1. read files with the extension .txt in the folder
  2. Delete the rows with 'TRIAL' string
  3. select some particular columns and name this columns
  4. make sequence of groups in a new column 'No' when the 2.nd column value 0.1 reappear again
  5. multiply this second column with 2 and make a new column
  6. add file name as a new column for each file

and finally make a data.frame output.

         tt      m1      m2      m3      m4      m5      m6  No  tt2 file_name
1       0.10 -0.0047 -0.0168 -0.9938 -0.0087 -0.0105 -0.9709   1  0.2   sum_W_1
2       0.20 -0.0121  0.0002 -0.9898 -0.0364 -0.0027 -0.9925   1  0.4   sum_W_1
3       0.30  0.0193 -0.0068 -0.9884  0.0040  0.0139 -0.9782   1  0.6   sum_W_1
4       0.40 -0.0157  0.0183 -0.9879 -0.0315 -0.0311 -0.9908   1  0.8   sum_W_1
5       0.50 -0.0402  0.0300 -0.9832 -0.0093  0.0269 -0.9781   1  1.0   sum_W_1

here is the example [file][2]

head(sum_data)

TRIAL :            1        3331        9091
  TRIAL :            2  1384786531   278055555
    2     0.10     0.000E+00 -0.0047 -0.0168 -0.9938    -0.0087 -0.0105 -0.9709     0.0035  0.0079 -0.9754     0.0081  0.0023  0.9997      -0.135324E-09    0.278754E-01
    2     0.20     0.000E+00 -0.0121  0.0002 -0.9898    -0.0364 -0.0027 -0.9925    -0.0242 -0.0050 -0.9929     0.0029 -0.0023  0.9998      -0.133521E-09    0.425567E-01
    2     0.30     0.000E+00  0.0193 -0.0068 -0.9884     0.0040  0.0139 -0.9782    -0.0158  0.0150 -0.9814     0.0054 -0.0008  0.9997      -0.134103E-09    0.255356E-01
    2     0.40     0.000E+00 -0.0157  0.0183 -0.9879    -0.0315 -0.0311 -0.9908    -0.0314 -0.0160 -0.9929     0.0040  0.0010  0.9998      -0.134819E-09    0.257300E-01
    2     0.50     0.000E+00 -0.0402  0.0300 -0.9832    -0.0093  0.0269 -0.9781    -0.0326  0.0247 -0.9802     0.0044 -0.0010  0.9997      -0.131515E-09    0.440350E-01

What I have tried,

import os
import glob # damla, topak
import pandas as pd
import numpy

filelist=glob.glob('*.txt')
print(filelist)

names_cols=['tt','m1','m2','m3','m4','m5','m6','m7']

for file in filelist:
    df=pd.read_table(file, header=None,skiprows=7,skipfooter=0,names=names_cols,usecols=[1,5,6,7,8,9,10,11])
    df_del=df[df.V1.str.contains('TRIAL')==False]
    concatdf=pd.concat(df_del,axis=0)

CParserError: Too many columns specified: expected 7 and found 1

Even this simple stage I stuck!

please help to finish this script!

Upvotes: 0

Views: 110

Answers (1)

Parfait
Parfait

Reputation: 107697

Consider the following translation. And because R's lapply saves dataframes to a list you will need to do the counterpart in Python appending dataframes to an initialized list:

names_cols = ['tt','m1','m2','m3','m4','m5','m6','m7']
v1 = ["W_1","B_1"]

dfs = []
for file in filelist:
    df = pd.read_table(file, header=None, skiprows=7, skipfooter=0, sep="\s+",
                       names=names_cols, usecols=[1,5,6,7,8,9,10,11])

    df = df[~df['tt'].str.contains('TRIAL|:')]        # KEEP ROWS WITHOUT TRIAL AND COLON
    df['tt'] = df['tt'].astype(float)                 # CONVERT TO FLOAT COLUMN
    df['tt2'] = df['m2'] * 2                          # MULTIPLY BY 2 (DOES NOT CHANGE SIGN)
    df['No'] = (df['tt'].cumsum()==0.1).astype(int)   # BOOLEAN OF A SERIES CUMSUM()        
    df['file_name'] = file[0:3]+'_'+ v1[0]            # EXTRACT FIRST THREE LETTERS
    dfs.append(df)                                    # APPEND TO LIST

print(dfs[0].head())
#     tt      m1      m2      m3      m4      m5      m6      m7     tt2  No file_name
# 0  0.6 -0.9872  0.0119 -0.0119 -0.9883  0.0306 -0.0259 -0.9903  0.0238   0   Spl_W_1
# 1  0.7 -0.9877 -0.0382 -0.0227 -0.9803 -0.0293 -0.0252 -0.9864 -0.0764   0   Spl_W_1
# 2  0.8 -0.9859 -0.0256  0.0218 -0.9829 -0.0323 -0.0098 -0.9870 -0.0512   0   Spl_W_1
# 3  0.9 -0.9838 -0.0030 -0.0032 -0.9844  0.0048 -0.0206 -0.9866 -0.0060   0   Spl_W_1
# 4  1.0 -0.9885 -0.0346 -0.0061 -0.9865 -0.0259 -0.0105 -0.9887 -0.0692   0   Spl_W_1

And for shorter lines, pandas' assign can best serve as counterpart to R dplyr's mutate with multiple column assignment but do be aware multiple assigned columns are ordered alphabetically:

for file in filelist:
    df = pd.read_table(file, header=None, skiprows=7, skipfooter=0, sep="\s+",
                       names=names_cols, usecols=[1,5,6,7,8,9,10,11])

    df = df[~df['tt'].str.contains('TRIAL|:')].assign(tt2 = df['m2'] * 2, file_name = file[0:3]+'_'+ v1[0])
    df = df.assign(tt = df['tt'].astype(float), No = (df['tt'].astype(float).cumsum()==0.1).astype(int))
    dfs.append(df)

By the way, why slowly shift from R to Python? Use both awesome languages!

Upvotes: 1

Related Questions