Reputation: 1
I have two TSV files (header.tsv & data.tsv) header.tsv holds 1000+ column names and data.tsv holds ~50K records (with NULL column values too). I would like to create a new TSV file (let's say combined.tsv) by appending data.tsv file to header.tsv. The reason for doing this is to create one final TSV file where it can hold both column names and data together and try to see if I can avoid errors while creating an Apache Arrow table.
**header.tsv**
field1 field2 field3 field4 ... field1000
**data.tsv**
eng-en 1er2p NULL ert,yu1 ... 2020-09-16
frnch-fr 2er3p NULL ert,yu2 ... 2020-09-16
.
.
.
ltn-lt 50Ker NULL ert,yu50K ... 2020-09-16
Required TSV
**combined.tsv**
field1 field2 field3 field4 ... field1000
eng-en 1er2p NULL ert,yu1 ... 2020-09-16
frnch-fr 2er3p NULL ert,yu2 ... 2020-09-16
.
.
.
ltn-lt 50Ker NULL ert,yu50K ... 2020-09-16
I've used the SHELL commands like
paste header.tsv data.tsv > combined.tsv
and then tried to create a pyarrow table.
import pyarrow as pa
import pyarrow.csv as csv
combined = csv.read_csv('combined.tsv',parse_options=csv.ParseOptions(delimiter="\t"))
I get below error while executing the above
ArrowInvalid: CSV parse error: Expected 2010 columns, got 1006
The header.tsv file has exactly 1005 columns and this header.tsv file can be parsed to create a pyarrow table but not the data.tsv file.
import pyarrow as pa
import pyarrow.csv as csv
header = csv.read_csv('headers.tsv',parse_options=csv.ParseOptions(delimiter="\t"))
head_show=header.to_pandas()
head_show.head()
I've even tried to use a concat_tables method from pyarrow by doing below
import pyarrow as pa
final_combined = pa.concat_tables(header,data)
Error
TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Table
Please correct me if my approach is wrong.
Upvotes: 0
Views: 703
Reputation: 14452
To create the combined CSV, you want to concatenate the header and data:
cat header.csv data.csv > combined.csv
Using "paste" will perform "horizontal" concatenation - merging the 1st, 2nd, 3rd, ... lines from each files, forming long lines.
Upvotes: 3