koders
koders

Reputation: 5794

Create CSV file from a text file with header tokens using shell scripting

I want to create a CSV file from a bunch of text files in a directory with the following structure to import them into a database later.

Title:
Article title

Word Count:
100

Summary:
Article summary.

Can consist of multiple lines.
    
Keywords:
keyword1, keyword2, keyword3


Article Body:
The rest of the article body.

Till the end of the file.

So desired result is to have them in a CSV file with sections as headers with their contents as follows.

Title          | Word Count | Summary                  | Keywords                     | Article Body |
Article title  | 100        | Article summary.\nCan... | keyword1, keyword2, keyword3 | ...          |
Article2 title | 110        | Article summary.\nCan... | keyword1, keyword2, keyword3 | ...          |

I've tried a few approaches with awk and shell scripting but couldn't succeed so far. Any ideas?

Upvotes: 1

Views: 692

Answers (2)

Fravadona
Fravadona

Reputation: 16980

According to the documentation of COPY, PostgreSQL fully supports the CSV format, and a Text format which is compatible with the lossless TSV format.

Because I'm using awk, I choose to generate a TSV. The reason is that there are newlines in the data and POSIX awk doesn't allow having literal newlines in a user defined variable. A TSV doesn't have this problem because you have to replace the literal newlines with their C‑style notation \n.

Also, I changed the input format for making it easier to parse. The new rule is that one or more empty line(s) delimit the records, which means that you can't have empty lines in the content of Summary or Article Body; the work-around is to add a single space character, like I did in the example.


Input example:

Title:
Article title

Word Count:
100

Summary:
Article summary.
 
Can consist of multiple lines.

Keywords:
keyword1, keyword2, keyword3


Article Body:
The rest of the article body.
 
Till the end of the file.

And here's the awk command, which accepts multiple files as argument:

edit: added TSV escaping for the header / added basic comments / reduced code size

awk -v RS='' -v FS='^$' -v OFS='\t' '
    FNR == 1 { ++onr } # the current file number is our "output record number"
    /^[^:\n]+:/ {
        # lossless TSV escaping
        gsub(/\\/,"\\\\")
        gsub(/\n/,"\\n")
        gsub(/\r/,"\\r")
        gsub(/\t/,"\\t")

        # get the current field name
        id = substr($0,1,index($0,":")-1)

        # strip the first line (NOTE: the newline character is escaped)
        sub(/^(\\[^n]|[^\\])*\\n/,"")

        # save the data
        fields[id]           # keep track of the field names that we came across
        records[0,id] = id   # for the header line
        records[onr,id] = $0 # for the output record
    }
    END {
        # print the header (onr == 0) and the records (onr >= 1)
        for (i = 0; i <= onr; i++) {
            out = sep = ""
            for (id in fields) {
                out = out sep records[i,id]
                sep = OFS
            }
            print out
        }
    }
' *.txt

Then the output (I replaced all the literal tabs with | for better legibility):

Summary | Article Body | Word Count | Title | Keywords
Article summary.\n \nCan consist of multiple lines. | The rest of the article body.\n \nTill the end of the file. | 100 | Article title | keyword1, keyword2, keyword3

Postscript: Once you got a valid TSV file, you can use a tool like mlr to convert it to CSV, JSON, etc... but for the purpose of importing the data in postgreSQL, it isn't required.

The SQL statement will be this (untested):

COPY table_name FROM '/path/file.tsv' WITH HEADER;

remark: You don't need to specify the FORMAT and the DELIMITER because the defaults are already text and \t

Upvotes: 1

koders
koders

Reputation: 5794

I made some changes on @Fravadona's script and created an insert statement. Seemed more practical to me and it works. But the answer was really helpful, just adding here as a reference, might be useful for other people.

awk -v RS='' -v FS='^$' -v OFS='\t' '
    FNR == 1 { fn++ }
    /^[^:]+:/ {
        fieldName = substr($0,1,index($0,":")-1)

        sub("^[^:]+:[^\n]*\n","")
        gsub(/\\/,"\\\\")
        gsub(/\n/,"\\n")
        gsub(/\r/,"\\r")
        gsub(/\t/,"\\t")

        header[fieldName]
        record[fn,fieldName] = $0
    }
    END {
        ORS=""
        print "insert into article(summary, content, word_count, title, keywords) values(E\047"

        for (i = 1; i <= fn; i++) {
            sep = "\047,\047"
            out = ""
            for (fieldName in header) {
                out = out record[fn,fieldName] sep
            }
            print substr(out,0,length(out)-2)")"
        }
    }
' x.txt

Result:

insert into article(summary, content, word_count, title, keywords) values(E'Article summary.\n \nCan consist of multiple lines.','The rest of the article body.\n \nTill the end of the file.','100','Article title','keyword1, keyword2, keyword3')

Upvotes: 0

Related Questions