R's split&apply&combine routine in Python

Question

Hi I have been long time R user and slowly shifting to Python.

I have some split&apply&combine routine in R but I have a difficulty to find python equivalent of these functions like grepl, paste, select etc. What I am trying to do is in step by step

read files with the extension .txt in the folder
Delete the rows with 'TRIAL' string
select some particular columns and name this columns
make sequence of groups in a new column 'No' when the 2.nd column value 0.1 reappear again
multiply this second column with 2 and make a new column
add file name as a new column for each file

and finally make a data.frame output.

         tt      m1      m2      m3      m4      m5      m6  No  tt2 file_name
1       0.10 -0.0047 -0.0168 -0.9938 -0.0087 -0.0105 -0.9709   1  0.2   sum_W_1
2       0.20 -0.0121  0.0002 -0.9898 -0.0364 -0.0027 -0.9925   1  0.4   sum_W_1
3       0.30  0.0193 -0.0068 -0.9884  0.0040  0.0139 -0.9782   1  0.6   sum_W_1
4       0.40 -0.0157  0.0183 -0.9879 -0.0315 -0.0311 -0.9908   1  0.8   sum_W_1
5       0.50 -0.0402  0.0300 -0.9832 -0.0093  0.0269 -0.9781   1  1.0   sum_W_1

here is the example [file][2]

head(sum_data)

TRIAL :            1        3331        9091
  TRIAL :            2  1384786531   278055555
    2     0.10     0.000E+00 -0.0047 -0.0168 -0.9938    -0.0087 -0.0105 -0.9709     0.0035  0.0079 -0.9754     0.0081  0.0023  0.9997      -0.135324E-09    0.278754E-01
    2     0.20     0.000E+00 -0.0121  0.0002 -0.9898    -0.0364 -0.0027 -0.9925    -0.0242 -0.0050 -0.9929     0.0029 -0.0023  0.9998      -0.133521E-09    0.425567E-01
    2     0.30     0.000E+00  0.0193 -0.0068 -0.9884     0.0040  0.0139 -0.9782    -0.0158  0.0150 -0.9814     0.0054 -0.0008  0.9997      -0.134103E-09    0.255356E-01
    2     0.40     0.000E+00 -0.0157  0.0183 -0.9879    -0.0315 -0.0311 -0.9908    -0.0314 -0.0160 -0.9929     0.0040  0.0010  0.9998      -0.134819E-09    0.257300E-01
    2     0.50     0.000E+00 -0.0402  0.0300 -0.9832    -0.0093  0.0269 -0.9781    -0.0326  0.0247 -0.9802     0.0044 -0.0010  0.9997      -0.131515E-09    0.440350E-01

What I have tried,

import os
import glob # damla, topak
import pandas as pd
import numpy

filelist=glob.glob('*.txt')
print(filelist)

names_cols=['tt','m1','m2','m3','m4','m5','m6','m7']

for file in filelist:
    df=pd.read_table(file, header=None,skiprows=7,skipfooter=0,names=names_cols,usecols=[1,5,6,7,8,9,10,11])
    df_del=df[df.V1.str.contains('TRIAL')==False]
    concatdf=pd.concat(df_del,axis=0)

CParserError: Too many columns specified: expected 7 and found 1

Even this simple stage I stuck!

please help to finish this script!

Parfait · Accepted Answer

Consider the following translation. And because R's lapply saves dataframes to a list you will need to do the counterpart in Python appending dataframes to an initialized list:

names_cols = ['tt','m1','m2','m3','m4','m5','m6','m7']
v1 = ["W_1","B_1"]

dfs = []
for file in filelist:
    df = pd.read_table(file, header=None, skiprows=7, skipfooter=0, sep="\s+",
                       names=names_cols, usecols=[1,5,6,7,8,9,10,11])

    df = df[~df['tt'].str.contains('TRIAL|:')]        # KEEP ROWS WITHOUT TRIAL AND COLON
    df['tt'] = df['tt'].astype(float)                 # CONVERT TO FLOAT COLUMN
    df['tt2'] = df['m2'] * 2                          # MULTIPLY BY 2 (DOES NOT CHANGE SIGN)
    df['No'] = (df['tt'].cumsum()==0.1).astype(int)   # BOOLEAN OF A SERIES CUMSUM()        
    df['file_name'] = file[0:3]+'_'+ v1[0]            # EXTRACT FIRST THREE LETTERS
    dfs.append(df)                                    # APPEND TO LIST

print(dfs[0].head())
#     tt      m1      m2      m3      m4      m5      m6      m7     tt2  No file_name
# 0  0.6 -0.9872  0.0119 -0.0119 -0.9883  0.0306 -0.0259 -0.9903  0.0238   0   Spl_W_1
# 1  0.7 -0.9877 -0.0382 -0.0227 -0.9803 -0.0293 -0.0252 -0.9864 -0.0764   0   Spl_W_1
# 2  0.8 -0.9859 -0.0256  0.0218 -0.9829 -0.0323 -0.0098 -0.9870 -0.0512   0   Spl_W_1
# 3  0.9 -0.9838 -0.0030 -0.0032 -0.9844  0.0048 -0.0206 -0.9866 -0.0060   0   Spl_W_1
# 4  1.0 -0.9885 -0.0346 -0.0061 -0.9865 -0.0259 -0.0105 -0.9887 -0.0692   0   Spl_W_1

And for shorter lines, pandas' assign can best serve as counterpart to R dplyr's mutate with multiple column assignment but do be aware multiple assigned columns are ordered alphabetically:

for file in filelist:
    df = pd.read_table(file, header=None, skiprows=7, skipfooter=0, sep="\s+",
                       names=names_cols, usecols=[1,5,6,7,8,9,10,11])

    df = df[~df['tt'].str.contains('TRIAL|:')].assign(tt2 = df['m2'] * 2, file_name = file[0:3]+'_'+ v1[0])
    df = df.assign(tt = df['tt'].astype(float), No = (df['tt'].astype(float).cumsum()==0.1).astype(int))
    dfs.append(df)

By the way, why slowly shift from R to Python? Use both awesome languages!

R's split&apply&combine routine in Python

Answers (1)

Related Questions

R&#39;s split&amp;apply&amp;combine routine in Python

Answers (1)

Related Questions

R's split&apply&combine routine in Python