Piloumpicou
Piloumpicou

Reputation: 67

Insert a date in a column using awk

I'm trying to format a date in a column of a csv.

The input is something like: 28 April 1966

And I'd like this output: 1966-04-28

which can be obtain with this code:

date -d "28 April 1966" +%F

So now I thought of mixing awk and this code to format the entire column but I can't find out how.

Edit :

Example of input : (separators "|" are in fact tabs)

1 | 28 April 1966
2 | null
3 | null
4 | 30 June 1987 

Expected output :

1 | 1966-04-28
2 | null
3 | null
4 | 30 June 1987

Upvotes: 2

Views: 1310

Answers (5)

Camusensei
Camusensei

Reputation: 1553

Only one date call and no code injection problem is possible, see the following:

This script extracts the dates (using awk) into a temporary file processes them with one "date" call and merges the results back (using awk).

Code

awk -F '\t' 'match($3,/null/) { $3 = "0000-01-01" } { print $3 }' input > temp.$$
date --file=temp.$$ +%F > dates.$$
awk -F '\t' -v OFS='\t' 'BEGIN {
                           while ( getline < "'"dates.$$"'" > 0 )
                           {
                              f1_counter++
                              if ($0 == "0000-01-01") {$0 = "null"}
                              date[f1_counter] = $0
                           }
                         }
                         {$3 = date[NR]}
                         1' input.$$

One-liner using bash process redirections (no temporary files):

inputfile=/path/to/input
awk -F '\t' -v OFS='\t' 'BEGIN {while ( getline < "'<(date -f <(awk -F '\t' 'match($3,/null/) { $3 = "0000-01-01" } { print $3 }' "$inputfile") +%F)'" > 0 ){f1_counter++; if ($0 == "0000-01-01") {$0 = "null"}; date[f1_counter] = $0}}{$3 = date[NR]}1' "$inputfile"

Details

here is how it can be used:

# configuration
input=/path/to/input
temp1=temp.$$
temp2=dates.$$
output=output.$$
# create the sample file (optional)
#printf "\t%s\n" $'1\t28 April 1966' $'2\tnull' $'3\tnull'  $'4\t30 June 1987' > "$input"
# Extract all dates
awk -F '\t' 'match($3,/null/) { $3 = "0000-01-01" } { print $3 }' "$input" > "$temp1"
# transform the dates
date --file="$temp1" +%F > "$temp2"
# merge csv with transformed date
awk -F '\t' -v OFS='\t' 'BEGIN {while ( getline < "'"$temp2"'" > 0 ){f1_counter++; if ($0 == "0000-01-01") {$0 = "null"}; date[f1_counter] = $0}}{$3 = date[NR]}1' "$input" > "$output"
# print the output
cat "$output"
# cleanup
rm "$temp1" "$temp2" "$output"
#rm "$input"

Caveats

  • Using "0000-01-01" as a temporary placeholder for invalid (null) dates
  • The code should be faster than other methods calling "date" a lot of times, but it reads the input file two times.

Upvotes: 0

Wintermute
Wintermute

Reputation: 44023

A simple way is

awk -F '\\| ' -v OFS='| ' '{ cmd = "date -d \"" $3 "\" +%F 2> /dev/null"; cmd | getline $3; close(cmd) } 1' filename

That is:

{
  cmd = "date -d \"" $3 "\" +%F 2> /dev/null"  # build shell command
  cmd | getline $3                             # run, capture output
  close(cmd)                                   # close pipe
}
1                                              # print

This works because date doesn't print anything to its stdout if the date is invalid, so the getline fails and $3 is not changed.

Caveats to consider:

  1. For very large files, this will spawn a lot of shells and processes in those shells (one each per line). This can become a noticeable performance drag.
  2. Be wary of code injection. If the CSV file comes from an untrustworthy source, this approach is difficult to defend against an attacker, and you're probably better off going the long way around, parsing the date manually with gawk's mktime and strftime.

EDIT re: comment: To use tabs as delimiters, the command can be changed to

awk -F '\t' -v OFS='\t' '{ cmd = "date -d \"" $3 "\" +%F 2> /dev/null"; cmd | getline $3; close(cmd) } 1' filename

EDIT re: comment 2: If performance is a worry, as it appears to be, spawning processes for every line is not a good approach. In that case, you'll have to do the parsing manually. For example:

BEGIN {
  OFS = FS

  m["January"  ] =  1
  m["February" ] =  2
  m["March"    ] =  3
  m["April"    ] =  4
  m["May"      ] =  5
  m["June"     ] =  6
  m["July"     ] =  7
  m["August"   ] =  8
  m["September"] =  9
  m["October"  ] = 10
  m["November" ] = 11
  m["December" ] = 12
}

$3 !~ /null/ {
  split($3, a, " ")
  $3 = sprintf("%04d-%02d-%02d", a[3], m[a[2]], a[1])
}
1

Put that in a file, say foo.awk, and run awk -F '\t' -f foo.awk filename.csv.

Upvotes: 3

Tom Fenech
Tom Fenech

Reputation: 74595

I would suggest using a language that supports parsing dates, like perl:

$ cat file
1       28 April 1966
2       null
3       null
4       30 June 1987
$ perl -F'\t' -MTime::Piece -lane 'print "$F[0]\t", 
  $F[1] eq "null" ? $F[1] : Time::Piece->strptime($F[1], "%d %B %Y")->strftime("%F")' file
1       1966-04-28
2       null
3       null
4       1987-06-30

The Time::Piece core module allows you to parse and format dates, using the standard format specifiers of strftime. This solution splits the input on a tab character and modifies the format if the second field is not "null".

This approach will be much faster than using system calls or invoking subprocesses, as everything is done in native perl.

Upvotes: 1

anubhava
anubhava

Reputation: 785058

Here is how you can do this in pure BASH and avoid calling system or getline from awk:

while IFS=$'\t' read -ra arr; do 
   [[ ${arr[1]} != "null" ]] && arr[1]=$(date -d "${arr[1]}" +%F)
   printf "%s\t%s\n" "${arr[0]}" "${arr[1]}"
done < file

1       1966-04-28
2       null
3       null
4       1987-06-30

Upvotes: 0

user4453924
user4453924

Reputation:

This should work with your given input

awk -F'\\|' -vOFS="|" '!/null/{cmd="date -d \""$3"\" +%F";cmd | getline $3;close(cmd)}1' file

Output

| 1 |1966-04-28
| 2 | null
| 3 | null
| 4 |1987-06-30

Upvotes: 1

Related Questions