5r9n
5r9n

Reputation: 187

How can I extract parts of one column and append them to other columns?

I have a large .csv file that I need to extract information from and add this information to another column. My csv looks something like this:

file_name,#,Date,Time,Temp (°C) ,Intensity
    trap12u_10733862_150809.txt,1,05/28/15,06:00:00.0,20.424,215.3,,
    trap12u_10733862_150809.txt,2,05/28/15,07:00:00.0,21.091,1,130.2,,
    trap12u_10733862_150809.txt,3,05/28/15,08:00:00.0,26.195,3,100.0,,
    trap11u_10733862_150809.txt,4,05/28/15,09:00:00.0,25.222,3,444.5,,
    trap11u_10733862_150809.txt,5,05/28/15,10:00:00.0,26.195,3,100.0,,
    trap11u_10733862_150809.txt,6,05/28/15,11:00:00.0,25.902,2,927.8,,
    trap11u_10733862_150809.txt,7,05/28/15,12:00:00.0,25.708,2,325.0,,
    trap12c_10733862_150809.txt,8,05/28/15,13:00:00.0,26.292,3,100.0,,
    trap12c_10733862_150809.txt,9,05/28/15,14:00:00.0,26.390,2,066.7,,
    trap12c_10733862_150809.txt,10,05/28/15,15:00:00.0,26.097,1,463.9,,

I want to create a two new columns that contains the data from the "file_name" column. I want to extract the one to two numbers after the text "trap" and I want to extract the c or the u and create new columns with this data. Data should look like something like this after processing:

file_name,#,Date,Time,Temp (°C) ,Intensity,can_und,trap_no
    trap12u_10733862_150809.txt,1,05/28/15,06:00:00.0,20.424,215.3,,u,12
  trap12u_10733862_150809.txt,2,05/28/15,07:00:00.0,21.091,1,130.2,,u,12
  trap12u_10733862_150809.txt,3,05/28/15,08:00:00.0,26.195,3,100.0,,u,12
  trap11u_10733862_150809.txt,4,05/28/15,09:00:00.0,25.222,3,444.5,,u,11
  trap12c_10733862_150809.txt,8,05/28/15,13:00:00.0,26.292,3,100.0,,c,12
  trap12c_10733862_150809.txt,9,05/28/15,14:00:00.0,26.390,2,066.7,,c,12
 trap12c_10733862_150809.txt,10,05/28/15,15:00:00.0,26.097,1,463.9,,c,12

I suspect the way to do this is with awk and a regular expression, but I'm not sure how to implement the regular expression. How can I extract parts of one column and append them to other columns?

Upvotes: 0

Views: 94

Answers (4)

natersoz
natersoz

Reputation: 1752

Using python pandas reader because python is awesome for numerical analysis:

First: I had to modify the data header row so that the columns were consistent by appending 3 commas:

file_name,#,Date,Time,Temp (°C) ,Intensity,,, There is probably a way to tell pandas to ignore the column differences - but I am yet a noob.

Python code to read your data into columns and create 2 new columns named 'cu_int' and 'cu_char' which contain the parsed elements of the filenames:

import pandas

def main():
    df = pandas.read_csv("file.csv")

    df['cu_int'] = 0                                    # Add the new columns to the data frame.

    df['cu_char'] = ' '

    for index, df_row in df.iterrows():
        file_name = df['file_name'][index].strip()

        trap_string = file_name.split("_")[0]           # Get the file_name string prior to the underscore
        numeric_offset_beg = len("trap")                # Parse the number following the 'trap' string.
        numeric_offset_end = len(trap_string) - 1       # Leave off the 'c' or 'u' char.

        numeric_value = trap_string[numeric_offset_beg : numeric_offset_end]
        cu_value = trap_string[len(trap_string) - 1]

        df['cu_int'] = int(numeric_value)
        df['cu_char'] = cu_value

    # The pandas dataframe is ready for number crunching.
    # For now just print it out:
    print df


if __name__ == "__main__":
    main()

The printed output (note there are inconsistencies in the data set posted - see row 1 as an example):

    $ python read_csv.py 
                         file_name  #      Date        Time  Temp (°C)     Intensity  Unnamed: 6  Unnamed: 7  Unnamed: 8  cu_int cu_char   
0      trap12u_10733862_150809.txt   1  05/28/15  06:00:00.0      20.424        215.3         NaN         NaN         NaN      12       c   
1      trap12u_10733862_150809.txt   2  05/28/15  07:00:00.0      21.091          1.0       130.2         NaN         NaN      12       c   
2      trap12u_10733862_150809.txt   3  05/28/15  08:00:00.0      26.195          3.0       100.0         NaN         NaN      12       c   
3      trap11u_10733862_150809.txt   4  05/28/15  09:00:00.0      25.222          3.0       444.5         NaN         NaN      12       c   
4      trap11u_10733862_150809.txt   5  05/28/15  10:00:00.0      26.195          3.0       100.0         NaN         NaN      12       c   
5      trap11u_10733862_150809.txt   6  05/28/15  11:00:00.0      25.902          2.0       927.8         NaN         NaN      12       c   
6      trap11u_10733862_150809.txt   7  05/28/15  12:00:00.0      25.708          2.0       325.0         NaN         NaN      12       c   
7      trap12c_10733862_150809.txt   8  05/28/15  13:00:00.0      26.292          3.0       100.0         NaN         NaN      12       c   
8      trap12c_10733862_150809.txt   9  05/28/15  14:00:00.0      26.390          2.0        66.7         NaN         NaN      12       c   
9      trap12c_10733862_150809.txt  10  05/28/15  15:00:00.0      26.097          1.0       463.9         NaN         NaN      12       c   

Upvotes: 0

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

gawk approach:

awk -F, 'NR==1{ print $0,"can_und,trap_no" }
         NR>1{ match($1,/^trap([0-9]+)([a-z])/,a); print $0 a[2],a[1] }' OFS="," file

The output:

file_name,#,Date,Time,Temp (°C) ,Intensity,can_und,trap_no
trap12u_10733862_150809.txt,1,05/28/15,06:00:00.0,20.424,215.3,,u,12
trap12u_10733862_150809.txt,2,05/28/15,07:00:00.0,21.091,1,130.2,,u,12
trap12u_10733862_150809.txt,3,05/28/15,08:00:00.0,26.195,3,100.0,,u,12
trap11u_10733862_150809.txt,4,05/28/15,09:00:00.0,25.222,3,444.5,,u,11
trap11u_10733862_150809.txt,5,05/28/15,10:00:00.0,26.195,3,100.0,,u,11
trap11u_10733862_150809.txt,6,05/28/15,11:00:00.0,25.902,2,927.8,,u,11
trap11u_10733862_150809.txt,7,05/28/15,12:00:00.0,25.708,2,325.0,,u,11
trap12c_10733862_150809.txt,8,05/28/15,13:00:00.0,26.292,3,100.0,,c,12
trap12c_10733862_150809.txt,9,05/28/15,14:00:00.0,26.390,2,066.7,,c,12
trap12c_10733862_150809.txt,10,05/28/15,15:00:00.0,26.097,1,463.9,,c,12

  • NR==1{ print $0,"can_und,trap_no" } - print the header line

  • match($1,/^trap([0-9]+)([a-z])/,a) - matches the number following trap word and the next following suffix letter

Upvotes: 1

anubhava
anubhava

Reputation: 784998

Using sed you can do this:

sed -E '1s/.*/&,can_und,trap_no/; 2,$s/trap([0-9]+)([a-z]).*/&\2,\1/' file.csv

file_name,#,Date,Time,Temp (°C) ,Intensity,can_und,trap_no    
trap12u_10733862_150809.txt,1,05/28/15,06:00:00.0,20.424,215.3,,u,12
trap12u_10733862_150809.txt,2,05/28/15,07:00:00.0,21.091,1,130.2,,u,12
trap12u_10733862_150809.txt,3,05/28/15,08:00:00.0,26.195,3,100.0,,u,12
trap11u_10733862_150809.txt,4,05/28/15,09:00:00.0,25.222,3,444.5,,u,11
trap11u_10733862_150809.txt,5,05/28/15,10:00:00.0,26.195,3,100.0,,u,11
trap11u_10733862_150809.txt,6,05/28/15,11:00:00.0,25.902,2,927.8,,u,11
trap11u_10733862_150809.txt,7,05/28/15,12:00:00.0,25.708,2,325.0,,u,11
trap12c_10733862_150809.txt,8,05/28/15,13:00:00.0,26.292,3,100.0,,c,12
trap12c_10733862_150809.txt,9,05/28/15,14:00:00.0,26.390,2,066.7,,c,12
trap12c_10733862_150809.txt,10,05/28/15,15:00:00.0,26.097,1,463.9,,c,12

Upvotes: 3

Michael
Michael

Reputation: 5335

With use of sed, this will be like:

sed 's/trap\([[:digit:]]\+\)\(.\)\(.*\)$/trap\1\2\3\2,\1/' file

Use sed -i ... to replace it in file.

Upvotes: 1

Related Questions