Reputation: 115
I've come to realize that using a SQL database (Postgres) is one of the most efficient ways to port my relational data (40+ GB across 24 CSV files) into Elasticsearch with a nested structure. However I'm still having a couple issues with the formatting of my JSON output from Postgres: 1) undesired line feeds (\n), 2) undesired header line and 3) undesired date format. Here is a basic example to demonstrate:
file1
id,age,gender,wave
1,49,M,1
2,72,F,0
file2
id,time,event1
1,2095-04-20 12:28:55,V39
1,2095-04-21 2:27:45,T21
2,2094-05-17 18:17:25,V39
file3
id,time,event2
1,2095-04-22 3:48:53,P90
2,2094-05-18 1:28:23,RT4
2,2094-05-18 4:23:53,W3
after adding these CSVs to a schema named forum and running this SQL code:
with f_1 as(
SELECT id, json_agg(file1.*) AS tag
FROM forum.file1
GROUP BY id
), f_2 as (
SELECT id, json_agg(file2.*) AS tag
FROM forum.file2
GROUP BY id
), f_3 as (
SELECT id, json_agg(file3.*) AS tag
FROM forum.file3
GROUP BY id
)
SELECT ('{"id":' || a.id), ('"file1":' || a.tag), ('"file2":' || b.tag), ('"file3":' || c.tag ||'}')
FROM f_1 AS a, f_2 AS b, f_3 AS c
WHERE b.id = a.id AND c.id = a.id;
I get this output (pgAdminIII - Export to file - no quoting):
?column?,?column?,?column?,?column?
{"id":1,"file1":[{"id":1,"age":49,"gender":"M","wave":1}],"file2":[{"id":1,"time":"2095-04-20T12:28:55","event1":"V39"},
{"id":1,"time":"2095-04-21T02:27:45","event1":"T21"}],"file3":[{"id":1,"time":"2095-04-22T03:48:53","event2":"P90"}]}
{"id":2,"file1":[{"id":2,"age":72,"gender":"F","wave":0}],"file2":[{"id":2,"time":"2094-05-17T18:17:25","event1":"V39"}],"file3":[{"id":2,"time":"2094-05-18T01:28:23","event2":"RT4"},
{"id":2,"time":"2094-05-18T04:23:53","event2":"W3"}]}
You can see that for a given id there is data on multiple lines. I need all of the data to be on one line for a given id (i.e. no \n's). A couple other minor issues which I haven't spent much time on but would like to change are the first row isn't needed, I'd like to get rid of the ?column?,?column?,?column?,?column?
without having to open the file after it's done processing. Ideally I'd also prefer that there was no T
in the date output. I should be able to accommodate the T
in Elasticsearch but thus far haven't gotten it to accept it. This is the output I desire from Postgres which works for input into Elasticsearch (using stream2es and a nested mapping structure):
{"id":1,"file1":[{"id":1,"age":49,"gender":"M","wave":1}],"file2":[{"id":1,"time":"2095-04-20 12:28:55","event1":"V39"},{"id":1,"time":"2095-04-21 02:27:45","event1":"T21"}],"file3":[{"id":1,"time":"2095-04-22 03:48:53","event2":"P90"}]}
{"id":2,"file1":[{"id":2,"age":72,"gender":"F","wave":0}],"file2":[{"id":2,"time":"2094-05-17 18:17:25","event1":"V39"}],"file3":[{"id":2,"time":"2094-05-18 01:28:23","event2":"RT4"},{"id":2,"time":"2094-05-18 04:23:53","event2":"W3"}]}
Adding to_json
does fix the undesired line feeds but it adds \"
in place of "
which the stream2es parser doesn't like:
SELECT to_json('{"id":' || a.id), to_json('"file1":' || a.tag::json), to_json('"file2":' || b.tag::json), to_json('"file3":' || c.tag::json ||'}')
"{\"id\":1","\"file1\":[{\"id\":1,\"age\":49,\"gender\":\"M\",\"wave\":1}]"
...
es2stream exception:
Exception in thread "stream dispatcher" java.lang.ClassCastException: java.lang.String cannot be cast to clojure.lang.IPersistentMap
Upvotes: 2
Views: 779
Reputation: 121624
Select all in one column (instead of four). The function format() will help you to write it down more clearly. Use
regexp_replace (str, '(\d\d\d\d-\d\d-\d\d)T', '\1 ', 'g')
to correct the date format and
replace (str, e' \n ', '')
to skip newline chars.
Use COPY command to simplify the issue:
COPY (
with f_1 as(
SELECT id, json_agg(file1.*) AS tag
FROM forum.file1
GROUP BY id
), f_2 as (
SELECT id, json_agg(file2.*) AS tag
FROM forum.file2
GROUP BY id
), f_3 as (
SELECT id, json_agg(file3.*) AS tag
FROM forum.file3
GROUP BY id
)
SELECT
replace(
regexp_replace(
format('{"id":%s,"file1":%s,"file2":%s,"file3":%s}',
a.id, a.tag, b.tag, c.tag),
'(\d\d\d\d-\d\d-\d\d)T', '\1 ', 'g'),
e' \n ', '')
FROM f_1 AS a, f_2 AS b, f_3 AS c
WHERE b.id = a.id AND c.id = a.id
) TO '/full/path/to/your/file';
To prepend each line of data with a command line you can use a trick with a function returning two rows. Some part of formatting can be moved to the function on the occasion.
create or replace function format_data_line(command text, data_str text)
returns setof text language plpgsql as $$
begin
return next command;
return next
replace(
regexp_replace(data_str,
'(\d\d\d\d-\d\d-\d\d)T', '\1 ', 'g'),
e' \n ', '');
end $$;
COPY (
with f_1 as(
SELECT id, json_agg(file1.*) AS tag
FROM forum.file1
GROUP BY id
), f_2 as (
SELECT id, json_agg(file2.*) AS tag
FROM forum.file2
GROUP BY id
), f_3 as (
SELECT id, json_agg(file3.*) AS tag
FROM forum.file3
GROUP BY id
)
SELECT
format_data_line(
'my command',
format('{"id":%s,"file1":%s,"file2":%s,"file3":%s}',
a.id, a.tag, b.tag, c.tag))
FROM f_1 AS a, f_2 AS b, f_3 AS c
WHERE b.id = a.id AND c.id = a.id
) TO '/full/path/to/your/file';
Upvotes: 1