user9467855
user9467855

Reputation:

BASH - Split file into several files based on conditions

I have a file (input.txt) with the following structure:

>day_1
ABC
DEF
GHI
>day_2
JKL
MNO
PQR
>day_3
STU
VWX
YZA
>month_1
BCD
EFG
HIJ
>month_2
KLM
NOP
QRS
...

I would like to split this file into multiple files (day.txt; month.txt; ...). Each new text file would contain all "header" lines (the one starting with >) and their content (lines between two header lines).

day.txt would therefore be:

>day_1
ABC
DEF
GHI
>day_2
JKL
MNO
PQR
>day_3
STU
VWX
YZA

and month.txt:

>month_1
BCD
EFG
HIJ
>month_2
KLM
NOP
QRS

I cannot use split -l in this case because the amount of lines is not the same for each category (day, month, etc.). However, each sub-category has the same number of lines (=3).

Upvotes: 2

Views: 2409

Answers (4)

RavinderSingh13
RavinderSingh13

Reputation: 133518

EDIT: As per OP adding 1 more solution now.

awk -F'[>_]' '/^>/{file=$2".txt"} {print > file}'  Input_file

Explanation:

awk -F'[>_]' '        ##Creating field separator as > or _ in current lines.
/^>/{ file=$2".txt" } ##Searching a line which starts with > if yes then creating a variable named file whose value is 2nd field".txt"
    { print > file  } ##Printing current line to variable file(which will create file name of variable file's value).
'  Input_file         ##Mentioning Input_file name here.

Following awk may help you on same.

awk '/^>day/{file="day.txt"} /^>month/{file="month.txt"} {print > file}' Input_file

Upvotes: 1

Aaron
Aaron

Reputation: 24802

Since each subcategory is composed of the same amount of lines, you can use grep's -A / --after flag to specify that number of lines to match after a header.

So if you know in advance the list of categories, you just have to grep the headers of their subcategories to redirect them with their content to the correct file :

lines_by_subcategory=3 # number of lines *after* a subcategory's header
for category in "month" "day"; do
    grep ">$category" -A $lines_by_subcategory input.txt >> "$category.txt"
done

You can try it here.

Note that this isn't the most efficient solution as it must browse the input once for each category. Other solutions could instead browse the content and redirect each subcategory to their respective file in a single pass.

Upvotes: 0

Sundeep
Sundeep

Reputation: 23667

Here's a generic solution for >name_number format

$ awk 'match($0, /^>[^_]+_/){k = substr($0, RSTART+1, RLENGTH-2);
         if(!(k in a)){close(op); a[k]; op=k".txt"}}
       {print > op}' ip.txt
  • match($0, /^>[^_]+_/) if line matches >name_ at start of line
    • k = substr($0, RSTART+1, RLENGTH-2) save the name portion
    • if(!(k in a)) if the key is not found in array
    • a[k] add key to array
    • op=k".txt" output file name
    • close(op) in case there are too many files to write
  • print > op print input record to filename saved in op

Upvotes: 0

jas
jas

Reputation: 10865

You can set the record separator to > and then just set the file name based on the category given by $1.

$ awk -v RS=">" 'NF {f=$1; sub(/_.*$/, ".txt", f); printf ">%s", $0 > f}' input.txt

$ cat day.txt
>day_1
ABC
DEF
GHI
>day_2
JKL
MNO
PQR
>day_3
STU
VWX
YZA

$ cat month.txt
>month_1
BCD
EFG
HIJ
>month_2
KLM
NOP
QRS

Upvotes: 1

Related Questions