MikasaAckerman
MikasaAckerman

Reputation: 529

How to break larger CSV into smaller batches of CSV?

I have 50k CSV String (it's one single string with 50k values), format 23445, 23446, 24567, ..., etc. I want to create a wrapper script which breaks it into batches of 500 and pass it to script which accepts it as input.

input.csv (50k comma separated values)

script(batches of 500), throttles for 60 sec and takes another 500 data.

#!/bin/bash
input.csv | sed -n 1'p' | tr ',' '\n' | while read word; do
script_accpts_batch_of_500=$word
done

Upvotes: 0

Views: 90

Answers (3)

sjnarv
sjnarv

Reputation: 2374

Not knowing much about the setting you'll launch this sort of thing, here's a more self-contained script. The shell function that uses awk (splitcsv) is one way to split a very long line in CSV format into somewhat smaller lines in CSV format, surrounded by some functions to generate test input and simulate processing.

This use of awk leaves the record-separator (RS) value alone and sets FS instead via awk's -F option. "Long" CSV input lines are therefore all processed if splitcsv is presented with many of them, with as many 500-field lines emitted as possible before the current long line runs out, and then a short line - less than 500 fields - emitted before processing the next long line.

But you only asked for one long line to be processed, so I'm stopping here.

#!/usr/bin/env bash

stepdown_csv() {
  local n=500
  [[ $# -eq 1 ]] && n="$1"

  generate50000 |
  splitcsv "$n" |
  while IFS= read -r line; do
    process_csv_line "$line"
  done
}

process_csv_line() {
  local unsep=$(sed 's/,/ /g' <<< "$1")

  if [[ "$unsep" != '' ]]; then
    set $unsep
    echo "Got a CSV line with $# fields"
    # sleep 60
  fi
}

splitcsv() {
  awk -F , -v flds="$1" '{
    for (n=1; n<=NF; n++) {
      printf "%s%s", $n, n % flds == 0 || n == NF ? "\n" : ","
    }
  }'
}

generate50000() {
  for n in {1..50000}; do
    echo -n $RANDOM
    if [[ n -lt 50000 ]]; then
      echo -n ,
    else
      echo
    fi
  done
}

stepdown_csv "$@"

Upvotes: 1

karakfa
karakfa

Reputation: 67467

another awk solution can be

$ awk -v RS=, '{ORS=NR%500?RS:"\n"}1' file

Upvotes: 1

Walter A
Walter A

Reputation: 19982

You can combine different commands with

tr ',' '\n' < input.csv | paste -d, $(yes -- "- " | head -500)

You can also use one command:

awk 'BEGIN {RS=","} {if (NR%500==0) print $0  ; else  printf $0 RS; }' input.csv

Upvotes: 1

Related Questions