Reputation: 679

Duplicate values of columns in a tab delimited file

I would like to duplicate the values that are separated by |, transforming them into new columns.

The following subset example file would be:

1_A 2_A 3_A
1|0 0|0 0|0
0|0 0|0 1|1
1|1 1|0 1|0
0|1 1|1 0|0
0|1 1|0 0|0
0|0 0|0 0|0
0|1 1|1 0|1
0|0 0|0 1|0
1|1 1|0 0|1
0|0 0|0 0|0

The desired format would be:

1_A 1_B 2_A 2_B 3_A 3_B
1|1 0|0 0|0 0|0 0|0 0|0
0|0 0|0 0|0 0|0 1|1 1|1
1|1 1|1 1|1 0|0 1|1 0|0
0|0 1|1 1|1 1|1 0|0 0|0
0|0 1|1 1|1 0|0 0|0 0|0
0|0 0|0 0|0 0|0 0|0 0|0
0|0 1|1 1|1 1|1 0|0 1|1
0|0 0|0 0|0 0|0 1|1 0|0
1|1 1|1 1|1 0|0 0|0 1|1
0|0 0|0 0|0 0|0 0|0 0|0

On the example above, the 1_B sample is the duplicate values to the right of the pipe character from the 1_A sample.

Thanks for any lead.

Upvotes: 0

Answers (4)

James Brown

Reputation: 37464

Yet another in awk:

$ awk '{
    for(i=1;i<=NF;i++)
        if(NR==1)
            sub(/A/,"& " i "_B",$i)
        else {
            t=$i
            sub(/\|/," ",t)
            sub(/\|/,"|"t"|",$i)
        }
}1' file

head -3 of output:

1_A 1_B 2_A 2_B 3_A 3_B
1|1 0|0 0|0 0|0 0|0 0|0
0|0 0|0 0|0 0|0 1|1 1|1
...

And one in sed:

$ sed '1s/\(._\)A/\1A \1B/g; s/\([01]\)|\([01]\)/\1|\1 \2|\2/g' file

Output:

1_A 1_B 2_A 2_B 3_A 3_B
1|1 0|0 0|0 0|0 0|0 0|0
0|0 0|0 0|0 0|0 1|1 1|1
...

Upvotes: 3

karakfa

Reputation: 67567

awk to the rescue!

$ awk 'NR==1 {for(i=1;i<=NF;i++) 
                {t=$i; 
                 sub("A","B",$i); 
                 printf "%s %s ",t,$i} 
              print ""; next}
             {for(i=1;i<=NF;i++) 
                {split($i,a,"|"); 
                 printf "%s %s ",a[1]"|"a[1],a[2]"|"a[2]} 
              print ""}' file

1_A 1_B 2_A 2_B 3_A 3_B
1|1 0|0 0|0 0|0 0|0 0|0 
0|0 0|0 0|0 0|0 1|1 1|1 
1|1 1|1 1|1 0|0 1|1 0|0 
0|0 1|1 1|1 1|1 0|0 0|0 
0|0 1|1 1|1 0|0 0|0 0|0 
0|0 0|0 0|0 0|0 0|0 0|0 
0|0 1|1 1|1 1|1 0|0 1|1 
0|0 0|0 0|0 0|0 1|1 0|0 
1|1 1|1 1|1 0|0 0|0 1|1 
0|0 0|0 0|0 0|0 0|0 0|0

Upvotes: 3

RavinderSingh13

Reputation: 133780

Could you please try following.

awk -F"[ |]" '
FNR==1{
  print $1,"1_B",$2,"2_B",$3,"3_B"
  next
}
FNR>1{
  for(i=1;i<=NF;i++){
     $i=$i "|" $i
  }
}
1
'  Input_file

Explanation: Adding explanation for above code.

awk -F"[ |]" '                         ##Setting field separator eiter space or pipe here.
FNR==1{                                ##Checking condition if this is first line.
  print $1,"1_B",$2,"2_B",$3,"3_B"     ##Printing headers as per OP.
  next                                 ##Using next will skip all further statements from here.
}                                      ##Closing FNR==1 condition BLOCK here.
FNR>1{                                 ##Checking condition if FNR>1 then do following.
  for(i=1;i<=NF;i++){                  ##Starting a for loop from i=1 to till value of NF(number of fields in current line).
     $i=$i "|" $i                      ##Setting value of current field to current field |(pipe) current field value here.
  }                                    ##Closing BLOCK for for loop here.
}                                      ##Closing BLOCK for FNR>1 condition here.
1                                      ##Mentioning 1 will print edited/non-edited current line here.
'  Input_file                          ##Mentioning Input_file(which we need to process here).

Output will be as follows.

1_A 1_B 2_A 2_B 3_A 3_B
1|1 0|0 0|0 0|0 0|0 0|0
0|0 0|0 0|0 0|0 1|1 1|1
1|1 1|1 1|1 0|0 1|1 0|0
0|0 1|1 1|1 1|1 0|0 0|0
0|0 1|1 1|1 0|0 0|0 0|0
0|0 0|0 0|0 0|0 0|0 0|0
0|0 1|1 1|1 1|1 0|0 1|1
0|0 0|0 0|0 0|0 1|1 0|0
1|1 1|1 1|1 0|0 0|0 1|1
0|0 0|0 0|0 0|0 0|0 0|0

Upvotes: 4

Ian

Reputation: 3908

This should work:

def split_values(s):
    # split the string over '|', and format each to the expected output
    return [*map(lambda _: f"{_}|{_}", s.split('|'))]

# list of dataframes which we will later concat over
tmp_df_lst = []
for col in df.columns:
    # apply split_values over each column
    tmp_df = pd.DataFrame(df[col].apply(split_values).values.tolist(), 
                 columns=[f"{col[:-1]}A", f"{col[:-1]}B"])
    tmp_df_lst.append(tmp_df)

# result
pd.concat(tmp_df_lst, axis=1)

Output:

    1_A 1_B 2_A 2_B 3_A 3_B
0   1|1 0|0 0|0 0|0 0|0 0|0
1   0|0 0|0 0|0 0|0 1|1 1|1
2   1|1 1|1 1|1 0|0 1|1 0|0
3   0|0 1|1 1|1 1|1 0|0 0|0
4   0|0 1|1 1|1 0|0 0|0 0|0
5   0|0 0|0 0|0 0|0 0|0 0|0
6   0|0 1|1 1|1 1|1 0|0 1|1
7   0|0 0|0 0|0 0|0 1|1 0|0
8   1|1 1|1 1|1 0|0 0|0 1|1
9   0|0 0|0 0|0 0|0 0|0 0|0

Upvotes: 1

Duplicate values of columns in a tab delimited file

Answers (4)

Related Questions