awk function to modify several columns with regex in a csv

Goal:

I need to modify urls to leave only the number in it (latitude/longitude/id): In a .csv-File i have one "certain Titel in Header". This one i need to find. In the column of this found Titel, i will need to delete the beginning and the end of the urls, so its only left a number what was part of the url. This i will need to do on different structured csv with several columns with different Titels and different url-patterns. Is there a way to write a function in bash with awk?

I tried - and its not working because its missing a lot of missing knowledge:

#!/bin/bash
CSVFILE=$(find ./aufzubereiten -type f ! -name ".DS_Store") #only one file in this folder.
FILENAME=$(basename "$CSVFILE")

function modify_col() {
    COL= how to find the right column in the csv?
    awk -F',' OFS="," -v pat='"$PAT"' '{sub(/pat/,X,$${COL})} 1' "$CSVFILE" > "$CSVFILE".tmp1 && mv "$CSVFILE".tmp1 "$CSVFILE"
}

COLTITEL="certain Titel in Header"
PAT='/Text1234Text[0-9]{5,8}Text1.html'
PATNEW=''
modify_col

COLTITEL="certain Titel2 in Header"
PAT='/Text2234Text[0-9]{5,8}Text2.html'
PATNEW=''
modify_col

COLTITEL="certain Titel3 in Header"
PAT='/Text3234Text[0-9]{5,8}Text3.html'
PATNEW=''
modify_col

Examplefile:

header1, header2, certain Titel in Header, certain Titel2 in Header, certain Titel3 in Header
,,/Text2234Text7846641Text.html,/Text2234Text8974341Text2.html,/Text2234Text823241Text3.html
,,/Text2234Text7846642Text.html,/Text2234Text8974342Text2.html,/Text2234Text823242Text3.html
,,/Text2234Text7846643Text.html,/Text2234Text8974343Text2.html,/Text2234Text823243Text3.html

Result should be:

header1, header2, certain Titel in Header, certain Titel2 in Header, certain Titel3 in Header
,,7846641,8974341,823241
,,7846642,8974342,823242
,,7846643,8974343,823243

Thank you for your Ideas :)

Upvotes: 1

Views: 176

Answers (4)

rvbarreto
rvbarreto

Reputation: 691

I know OP asked if there is a way to make it using awk, but from the context provided I understand that any solution that would run in a bash script would solve OP's problem.

For such case, I believe sed is a more elegant solution:

sed 's/[^,]\+[^0-9]\([0-9][0-9]\+\)[^,]\+/\1/g' data.csv

It outputs any number with 2 or more digits which is closer to the end of the field. The extended version of sed may help to visualize it better:

sed -E 's/[^,]+[^0-9]([0-9][0-9]+)[^,]+/\1/g' data.csv

output:

rvb@ubuntu:~$ sed -E 's/[^,]+[^0-9]([0-9][0-9]+)[^,]+/\1/g' data.csv
header1, header2, certain Titel in Header, certain Titel2 in Header, certain Titel3 in Header
,,7846641,8974341,823241
,,7846642,8974342,823242
,,7846643,8974343,823243

Upvotes: 1

tripleee
tripleee

Reputation: 189958

Here's a generalized solution to look for numbers with five or more digits, and delete everything else.

awk -F , 'BEGIN { OFS=FS }
  FNR>1{
    for(i=1;i<=NF;++i) {
        gsub(/(^|[^0-9])[0-9]{1,4}([^0-9]|$)/, "", $i);
        gsub(/[^0-9]+/, "", $i);
    }
  } 1' filename

If you only have a single filename, there is probably no reason to use find. If you don't know the name of the file but there is only one file in the current directory, * will expand to that file name.

This is slightly brittle in that it will not do the right thing if two numbers within a field are separated by a single non-number character. It's not hard to fix this but I'm lazy and your requirements are slightly vague.

Upvotes: 1

markp-fuso
markp-fuso

Reputation: 35556

Assumption:

  • data looks exactly like the sample in the question, ie, the literal Text shows up in 3x locations within each html file name

Sample data:

$ cat text.dat
header1, header2, certain Titel in Header, certain Titel2 in Header, certain Titel3 in Header
,,/Text2234Text7846641Text.html,/Text2234Text8974341Text2.html,/Text2234Text823241Text3.html
,,/Text2234Text7846642Text.html,/Text2234Text8974342Text2.html,/Text2234Text823242Text3.html
,,/Text2234Text7846643Text.html,/Text2234Text8974343Text2.html,/Text2234Text823243Text3.html

One awk solution:

$ awk -F"Text" '
BEGIN  { OFS="," }
FNR==1 { print ; next }
       { print ",,"$3,$6,$9 }
' text.dat

Where:

  • -F"Text" - use Text as our input field separator
  • OFS="," - set output field separator
  • FNR==1 {print ; next} - for row #1 (header row) print the entire row and skip to next row in the file
  • print ",,"$3,$6,$9 - print 2 commas and then fields 3, 6 and 9 (separated by OFS=",")

Result:

header1, header2, certain Titel in Header, certain Titel2 in Header, certain Titel3 in Header
,,7846641,8974341,823241
,,7846642,8974342,823242
,,7846643,8974343,823243

Upvotes: 2

RavinderSingh13
RavinderSingh13

Reputation: 133780

Could you please try following, written and tested with shown samples.

awk '
BEGIN{
  FS=OFS=","
}
FNR==1{
  print
  next
}
{
  for(i=1;i<=NF;i++){
    sub(/^\/Text[0-9]+Text/,"",$i)
    sub(/Text.*/,"",$i)
  }
}
1
'  Input_file

Explanation: Adding a detailed level of explanation of above code.

awk '
BEGIN{                                 ##Starting BEGIN section of code here.
  FS=OFS=","                           ##Setting FS and OFS to comma here.
}
FNR==1{                                ##Checking condition if FNR==1 then do following.
  print                                ##Printing the current line here.
  next                                 ##next will skip all further statements from here.
}
{
  for(i=1;i<=NF;i++){                  ##Starting a for loop to traverse into all fields here.
    sub(/^\/Text[0-9]+Text/,"",$i)     ##Substituting from starting Text digits Text with NULL in current field.
    sub(/Text.*/,"",$i)                ##Substituting everything from Text to till last of field value with NULL in current field.
  }
}
1                                      ##1 will print edited/non-edited line here.
'  Input_file                          ##Mentioning Input_file name here.

Upvotes: 2

Related Questions