Reputation: 71
Goal:
I need to modify urls to leave only the number in it (latitude/longitude/id): In a .csv-File i have one "certain Titel in Header". This one i need to find. In the column of this found Titel, i will need to delete the beginning and the end of the urls, so its only left a number what was part of the url. This i will need to do on different structured csv with several columns with different Titels and different url-patterns. Is there a way to write a function in bash with awk?
I tried - and its not working because its missing a lot of missing knowledge:
#!/bin/bash
CSVFILE=$(find ./aufzubereiten -type f ! -name ".DS_Store") #only one file in this folder.
FILENAME=$(basename "$CSVFILE")
function modify_col() {
COL= how to find the right column in the csv?
awk -F',' OFS="," -v pat='"$PAT"' '{sub(/pat/,X,$${COL})} 1' "$CSVFILE" > "$CSVFILE".tmp1 && mv "$CSVFILE".tmp1 "$CSVFILE"
}
COLTITEL="certain Titel in Header"
PAT='/Text1234Text[0-9]{5,8}Text1.html'
PATNEW=''
modify_col
COLTITEL="certain Titel2 in Header"
PAT='/Text2234Text[0-9]{5,8}Text2.html'
PATNEW=''
modify_col
COLTITEL="certain Titel3 in Header"
PAT='/Text3234Text[0-9]{5,8}Text3.html'
PATNEW=''
modify_col
Examplefile:
header1, header2, certain Titel in Header, certain Titel2 in Header, certain Titel3 in Header
,,/Text2234Text7846641Text.html,/Text2234Text8974341Text2.html,/Text2234Text823241Text3.html
,,/Text2234Text7846642Text.html,/Text2234Text8974342Text2.html,/Text2234Text823242Text3.html
,,/Text2234Text7846643Text.html,/Text2234Text8974343Text2.html,/Text2234Text823243Text3.html
Result should be:
header1, header2, certain Titel in Header, certain Titel2 in Header, certain Titel3 in Header
,,7846641,8974341,823241
,,7846642,8974342,823242
,,7846643,8974343,823243
Thank you for your Ideas :)
Upvotes: 1
Views: 176
Reputation: 691
I know OP asked if there is a way to make it using awk, but from the context provided I understand that any solution that would run in a bash script would solve OP's problem.
For such case, I believe sed
is a more elegant solution:
sed 's/[^,]\+[^0-9]\([0-9][0-9]\+\)[^,]\+/\1/g' data.csv
It outputs any number with 2 or more digits which is closer to the end of the field. The extended version of sed
may help to visualize it better:
sed -E 's/[^,]+[^0-9]([0-9][0-9]+)[^,]+/\1/g' data.csv
output:
rvb@ubuntu:~$ sed -E 's/[^,]+[^0-9]([0-9][0-9]+)[^,]+/\1/g' data.csv
header1, header2, certain Titel in Header, certain Titel2 in Header, certain Titel3 in Header
,,7846641,8974341,823241
,,7846642,8974342,823242
,,7846643,8974343,823243
Upvotes: 1
Reputation: 189958
Here's a generalized solution to look for numbers with five or more digits, and delete everything else.
awk -F , 'BEGIN { OFS=FS }
FNR>1{
for(i=1;i<=NF;++i) {
gsub(/(^|[^0-9])[0-9]{1,4}([^0-9]|$)/, "", $i);
gsub(/[^0-9]+/, "", $i);
}
} 1' filename
If you only have a single filename, there is probably no reason to use find
. If you don't know the name of the file but there is only one file in the current directory, *
will expand to that file name.
This is slightly brittle in that it will not do the right thing if two numbers within a field are separated by a single non-number character. It's not hard to fix this but I'm lazy and your requirements are slightly vague.
Upvotes: 1
Reputation: 35556
Assumption:
Text
shows up in 3x locations within each html
file nameSample data:
$ cat text.dat
header1, header2, certain Titel in Header, certain Titel2 in Header, certain Titel3 in Header
,,/Text2234Text7846641Text.html,/Text2234Text8974341Text2.html,/Text2234Text823241Text3.html
,,/Text2234Text7846642Text.html,/Text2234Text8974342Text2.html,/Text2234Text823242Text3.html
,,/Text2234Text7846643Text.html,/Text2234Text8974343Text2.html,/Text2234Text823243Text3.html
One awk
solution:
$ awk -F"Text" '
BEGIN { OFS="," }
FNR==1 { print ; next }
{ print ",,"$3,$6,$9 }
' text.dat
Where:
-F"Text"
- use Text
as our input field separatorOFS=","
- set output field separatorFNR==1 {print ; next}
- for row #1 (header row) print the entire row and skip to next row in the fileprint ",,"$3,$6,$9
- print 2 commas and then fields 3, 6 and 9 (separated by OFS=","
)Result:
header1, header2, certain Titel in Header, certain Titel2 in Header, certain Titel3 in Header
,,7846641,8974341,823241
,,7846642,8974342,823242
,,7846643,8974343,823243
Upvotes: 2
Reputation: 133780
Could you please try following, written and tested with shown samples.
awk '
BEGIN{
FS=OFS=","
}
FNR==1{
print
next
}
{
for(i=1;i<=NF;i++){
sub(/^\/Text[0-9]+Text/,"",$i)
sub(/Text.*/,"",$i)
}
}
1
' Input_file
Explanation: Adding a detailed level of explanation of above code.
awk '
BEGIN{ ##Starting BEGIN section of code here.
FS=OFS="," ##Setting FS and OFS to comma here.
}
FNR==1{ ##Checking condition if FNR==1 then do following.
print ##Printing the current line here.
next ##next will skip all further statements from here.
}
{
for(i=1;i<=NF;i++){ ##Starting a for loop to traverse into all fields here.
sub(/^\/Text[0-9]+Text/,"",$i) ##Substituting from starting Text digits Text with NULL in current field.
sub(/Text.*/,"",$i) ##Substituting everything from Text to till last of field value with NULL in current field.
}
}
1 ##1 will print edited/non-edited line here.
' Input_file ##Mentioning Input_file name here.
Upvotes: 2