Reputation: 1523
I have a two column spreadsheet (saved in CSV format) that is like this:
COLUMN 1,COLUMN 2
innovation,3-Entrepreneurship
countless,
innocent,2-Police
toilet handle,2-Bathroom
née dresses,3-Companies
odorless,2-Sense of Smell
old ideas,3-Entrepreneurship
new income streams,3-Entrepreneurship
Zoë’s food store,3-Companies
many,
crime,2-Police
bath room,2-Bathroom
ring,
móvíl résumés,3-Companies
musty smell,2-Sense of Smell
good publicity guru,3-Entrepreneurship
Señor,3-Companies
The full spreadsheet is 6000 rows (saved in CSV format, with commas used to separate the two columns). It has more than the Column 2 categories that are listed here.
As shown, some of the column 1 entries consists of two- or three- words separated by a space. They also make use of apostrophes and accented characters (these appear in multiple categories and not just the category titled 3-Companies
).
You can download a sample of the full spreadsheet here.
I would like to split the CSV file up in to separate TXT files by the values in column 2. The separate files will no longer be a spreadsheet table, but just a list of words.
E.g. after split
In file 3-Entrepreneurship.txt
innovation
old ideas
new income streams
good publicity guru
In file 2-Bathroom.txt
toilet handle
bath room
In file 2-Police.txt
innocent
crime
In file 2-Sense of Smell.txt
odorless
musty smell
In file 3-Companies.txt
née dresses
Zoë’s food store
móvíl résumés
Señor
This is only a sample. The full file has more than 5 categories (in column 2), so there will be more than 5 post-split files.
awk -F "," '{print $0 >> ($2 ".txt"); close($2 ".txt")}' test.csv
It works, but there are a few issues:
The file names are printed out with question marks at the end.
E.g. 2-Police?.txt
3-Entrepreneurship?.txt
The files print out with two columns.
For example inside of 3-Entrepreneurship.txt it has the following data:
innovation 3-Entrepreneurship
ideas 3-Entrepreneurship
income 3-Entrepreneurship
publicity 3-Entrepreneurship
Upvotes: 0
Views: 166
Reputation: 204731
tail -n +2 file |
sort -t',' -k2 |
awk -F',' '$2~/^[[:space:]]*$/{next} {sub(/\r$/,"")} $2!=prev{close(out); out=$2".txt"; prev=$2} {print $1 > out}'
We sort the data first for efficiency so that awk doesn't have to keep opening/closing files line by line but instead only does that once per unique $2 value.
Here's the script working:
$ ls
file
$ cat file
COLUMN 1,COLUMN 2
innovation,3-Entrepreneurship
countless,
innocent,2-Police
toilet handle,2-Bathroom
née dresses,3-Companies
odorless,2-Sense of Smell
old ideas,3-Entrepreneurship
new income streams,3-Entrepreneurship
Zoë’s food store,3-Companies
many,
crime,2-Police
bath room,2-Bathroom
ring,
móvíl résumés,3-Companies
musty smell,2-Sense of Smell
good publicity guru,3-Entrepreneurship
Señor,3-Companies
.
$ tail -n +2 file | sort -t',' -k2 | awk -F',' '$2~/^[[:space:]]*$/{next} {sub(/\r$/,"")} $2!=prev{close(out); out=$2".txt"; prev=$2} {print $1 > out}'
.
$ ls
2-Bathroom.txt 2-Police.txt '2-Sense of Smell.txt' 3-Companies.txt 3-Entrepreneurship.txt file
$ head -n 50 *.txt
==> 2-Bathroom.txt <==
bath room
toilet handle
==> 2-Police.txt <==
crime
innocent
==> 2-Sense of Smell.txt <==
musty smell
odorless
==> 3-Companies.txt <==
móvíl résumés
née dresses
Señor
Zoë’s food store
==> 3-Entrepreneurship.txt <==
good publicity guru
innovation
new income streams
old ideas
Upvotes: 3
Reputation: 104111
You are very close.
First, it seems that the words without a value in column 2 should not be written to a file. If so, add a test to your action. The test can be just $2
which will skip the action inside the { ... }
if $2
is blank.
Second, if you want only the word, write only col 1 to the file -- not the entire line in $0
With your example, which is not ,
delimited, you can do:
awk '$2 {print $1 >> ($2 ".txt"); close($2 ".txt")}' file
Be sure to delete the .txt
files or run in an empty directory since this appends the files if they already exist.
Produces:
2-Bathroom.txt:
toilet
bath
2-Police.txt:
innocent
crime
2-Sense.txt:
odorless
musty
3-Entrepreneurship.txt:
innovation
ideas
income
publicity
Upvotes: 2