big_smile
big_smile

Reputation: 1523

Split CSV file according to value in a column, but only keep 1 column and save as TXT

I have a two column spreadsheet (saved in CSV format) that is like this:

COLUMN 1,COLUMN 2
innovation,3-Entrepreneurship
countless,
innocent,2-Police
toilet handle,2-Bathroom
née dresses,3-Companies
odorless,2-Sense of Smell
old ideas,3-Entrepreneurship
new income streams,3-Entrepreneurship
Zoë’s food store,3-Companies
many,
crime,2-Police
bath room,2-Bathroom
ring,
móvíl résumés,3-Companies
musty smell,2-Sense of Smell
good publicity guru,3-Entrepreneurship
Señor,3-Companies

The full spreadsheet is 6000 rows (saved in CSV format, with commas used to separate the two columns). It has more than the Column 2 categories that are listed here.

As shown, some of the column 1 entries consists of two- or three- words separated by a space. They also make use of apostrophes and accented characters (these appear in multiple categories and not just the category titled 3-Companies).

You can download a sample of the full spreadsheet here.

I would like to split the CSV file up in to separate TXT files by the values in column 2. The separate files will no longer be a spreadsheet table, but just a list of words.

E.g. after split

In file 3-Entrepreneurship.txt

innovation
old ideas
new income streams
good publicity guru

In file 2-Bathroom.txt

toilet handle
bath room

In file 2-Police.txt

innocent
crime

In file 2-Sense of Smell.txt

odorless
musty smell

In file 3-Companies.txt

née dresses
Zoë’s food store
móvíl résumés
Señor

This is only a sample. The full file has more than 5 categories (in column 2), so there will be more than 5 post-split files.


Here is my code so far:

awk -F "," '{print $0 >> ($2 ".txt"); close($2 ".txt")}' test.csv

It works, but there are a few issues:

  1. The file names are printed out with question marks at the end. E.g. 2-Police?.txt 3-Entrepreneurship?.txt

  2. The files print out with two columns.

For example inside of 3-Entrepreneurship.txt it has the following data:

innovation  3-Entrepreneurship
ideas       3-Entrepreneurship
income      3-Entrepreneurship
publicity   3-Entrepreneurship

Upvotes: 0

Views: 166

Answers (2)

Ed Morton
Ed Morton

Reputation: 204731

tail -n +2 file |
sort -t',' -k2 |
awk -F',' '$2~/^[[:space:]]*$/{next} {sub(/\r$/,"")} $2!=prev{close(out); out=$2".txt"; prev=$2} {print $1 > out}'

We sort the data first for efficiency so that awk doesn't have to keep opening/closing files line by line but instead only does that once per unique $2 value.

Here's the script working:

$ ls
file

$ cat file
COLUMN 1,COLUMN 2
innovation,3-Entrepreneurship
countless,
innocent,2-Police
toilet handle,2-Bathroom
née dresses,3-Companies
odorless,2-Sense of Smell
old ideas,3-Entrepreneurship
new income streams,3-Entrepreneurship
Zoë’s food store,3-Companies
many,
crime,2-Police
bath room,2-Bathroom
ring,
móvíl résumés,3-Companies
musty smell,2-Sense of Smell
good publicity guru,3-Entrepreneurship
Señor,3-Companies

.

$ tail -n +2 file | sort -t',' -k2 | awk -F',' '$2~/^[[:space:]]*$/{next} {sub(/\r$/,"")} $2!=prev{close(out); out=$2".txt"; prev=$2} {print $1 > out}'

.

$ ls
 2-Bathroom.txt   2-Police.txt  '2-Sense of Smell.txt'   3-Companies.txt   3-Entrepreneurship.txt   file

$ head -n 50 *.txt
==> 2-Bathroom.txt <==
bath room
toilet handle

==> 2-Police.txt <==
crime
innocent

==> 2-Sense of Smell.txt <==
musty smell
odorless

==> 3-Companies.txt <==
móvíl résumés
née dresses
Señor
Zoë’s food store

==> 3-Entrepreneurship.txt <==
good publicity guru
innovation
new income streams
old ideas

Upvotes: 3

dawg
dawg

Reputation: 104111

You are very close.

First, it seems that the words without a value in column 2 should not be written to a file. If so, add a test to your action. The test can be just $2 which will skip the action inside the { ... } if $2 is blank.

Second, if you want only the word, write only col 1 to the file -- not the entire line in $0

With your example, which is not , delimited, you can do:

awk '$2 {print $1 >> ($2 ".txt"); close($2 ".txt")}' file

Be sure to delete the .txt files or run in an empty directory since this appends the files if they already exist.

Produces:

2-Bathroom.txt:
toilet
bath

2-Police.txt:
innocent
crime

2-Sense.txt:
odorless
musty

3-Entrepreneurship.txt:
innovation
ideas
income
publicity

Upvotes: 2

Related Questions