Reputation: 5794
I want to create a CSV file from a bunch of text files in a directory with the following structure to import them into a database later.
Title:
Article title
Word Count:
100
Summary:
Article summary.
Can consist of multiple lines.
Keywords:
keyword1, keyword2, keyword3
Article Body:
The rest of the article body.
Till the end of the file.
So desired result is to have them in a CSV file with sections as headers with their contents as follows.
Title | Word Count | Summary | Keywords | Article Body |
Article title | 100 | Article summary.\nCan... | keyword1, keyword2, keyword3 | ... |
Article2 title | 110 | Article summary.\nCan... | keyword1, keyword2, keyword3 | ... |
I've tried a few approaches with awk and shell scripting but couldn't succeed so far. Any ideas?
Upvotes: 1
Views: 692
Reputation: 16980
According to the documentation of COPY, PostgreSQL fully supports the CSV format, and a Text format which is compatible with the lossless TSV format.
Because I'm using awk
, I choose to generate a TSV. The reason is that there are newlines in the data and POSIX awk doesn't allow having literal newlines in a user defined variable. A TSV doesn't have this problem because you have to replace the literal newlines with their C‑style notation \n
.
Also, I changed the input format for making it easier to parse. The new rule is that one or more empty line(s) delimit the records, which means that you can't have empty lines in the content of Summary
or Article Body
; the work-around is to add a single space character, like I did in the example.
Input example:
Title:
Article title
Word Count:
100
Summary:
Article summary.
Can consist of multiple lines.
Keywords:
keyword1, keyword2, keyword3
Article Body:
The rest of the article body.
Till the end of the file.
And here's the awk
command, which accepts multiple files as argument:
edit: added TSV escaping for the header / added basic comments / reduced code size
awk -v RS='' -v FS='^$' -v OFS='\t' '
FNR == 1 { ++onr } # the current file number is our "output record number"
/^[^:\n]+:/ {
# lossless TSV escaping
gsub(/\\/,"\\\\")
gsub(/\n/,"\\n")
gsub(/\r/,"\\r")
gsub(/\t/,"\\t")
# get the current field name
id = substr($0,1,index($0,":")-1)
# strip the first line (NOTE: the newline character is escaped)
sub(/^(\\[^n]|[^\\])*\\n/,"")
# save the data
fields[id] # keep track of the field names that we came across
records[0,id] = id # for the header line
records[onr,id] = $0 # for the output record
}
END {
# print the header (onr == 0) and the records (onr >= 1)
for (i = 0; i <= onr; i++) {
out = sep = ""
for (id in fields) {
out = out sep records[i,id]
sep = OFS
}
print out
}
}
' *.txt
Then the output (I replaced all the literal tabs with |
for better legibility):
Summary | Article Body | Word Count | Title | Keywords
Article summary.\n \nCan consist of multiple lines. | The rest of the article body.\n \nTill the end of the file. | 100 | Article title | keyword1, keyword2, keyword3
Postscript: Once you got a valid TSV file, you can use a tool like mlr
to convert it to CSV, JSON, etc... but for the purpose of importing the data in postgreSQL, it isn't required.
The SQL statement will be this (untested):
COPY table_name FROM '/path/file.tsv' WITH HEADER;
remark: You don't need to specify the FORMAT
and the DELIMITER
because the defaults are already text
and \t
Upvotes: 1
Reputation: 5794
I made some changes on @Fravadona's script and created an insert statement. Seemed more practical to me and it works. But the answer was really helpful, just adding here as a reference, might be useful for other people.
awk -v RS='' -v FS='^$' -v OFS='\t' '
FNR == 1 { fn++ }
/^[^:]+:/ {
fieldName = substr($0,1,index($0,":")-1)
sub("^[^:]+:[^\n]*\n","")
gsub(/\\/,"\\\\")
gsub(/\n/,"\\n")
gsub(/\r/,"\\r")
gsub(/\t/,"\\t")
header[fieldName]
record[fn,fieldName] = $0
}
END {
ORS=""
print "insert into article(summary, content, word_count, title, keywords) values(E\047"
for (i = 1; i <= fn; i++) {
sep = "\047,\047"
out = ""
for (fieldName in header) {
out = out record[fn,fieldName] sep
}
print substr(out,0,length(out)-2)")"
}
}
' x.txt
Result:
insert into article(summary, content, word_count, title, keywords) values(E'Article summary.\n \nCan consist of multiple lines.','The rest of the article body.\n \nTill the end of the file.','100','Article title','keyword1, keyword2, keyword3')
Upvotes: 0