Gou7haM
Gou7haM

Reputation: 1

Combining TSV files to create a new TSV for Apache Arrow table

I have two TSV files (header.tsv & data.tsv) header.tsv holds 1000+ column names and data.tsv holds ~50K records (with NULL column values too). I would like to create a new TSV file (let's say combined.tsv) by appending data.tsv file to header.tsv. The reason for doing this is to create one final TSV file where it can hold both column names and data together and try to see if I can avoid errors while creating an Apache Arrow table.

**header.tsv**
field1 field2 field3 field4 ... field1000 

**data.tsv**
eng-en    1er2p  NULL  ert,yu1  ...  2020-09-16
frnch-fr  2er3p  NULL  ert,yu2  ...  2020-09-16
.
.
.
ltn-lt    50Ker  NULL  ert,yu50K ... 2020-09-16

Required TSV

**combined.tsv**
field1    field2   field3   field4    ...   field1000
eng-en    1er2p    NULL     ert,yu1   ...   2020-09-16
frnch-fr  2er3p    NULL     ert,yu2   ...   2020-09-16
.
.
.
ltn-lt    50Ker    NULL     ert,yu50K ...   2020-09-16

I've used the SHELL commands like

paste header.tsv data.tsv > combined.tsv

and then tried to create a pyarrow table.

import pyarrow as pa
import pyarrow.csv as csv
combined = csv.read_csv('combined.tsv',parse_options=csv.ParseOptions(delimiter="\t"))

I get below error while executing the above

ArrowInvalid: CSV parse error: Expected 2010 columns, got 1006

The header.tsv file has exactly 1005 columns and this header.tsv file can be parsed to create a pyarrow table but not the data.tsv file.

import pyarrow as pa
import pyarrow.csv as csv
header = csv.read_csv('headers.tsv',parse_options=csv.ParseOptions(delimiter="\t")) 
head_show=header.to_pandas()
head_show.head()

I've even tried to use a concat_tables method from pyarrow by doing below

import pyarrow as pa
final_combined = pa.concat_tables(header,data)

Error

TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Table

Please correct me if my approach is wrong.

Upvotes: 0

Views: 703

Answers (1)

dash-o
dash-o

Reputation: 14452

To create the combined CSV, you want to concatenate the header and data:

cat header.csv data.csv > combined.csv

Using "paste" will perform "horizontal" concatenation - merging the 1st, 2nd, 3rd, ... lines from each files, forming long lines.

Upvotes: 3

Related Questions