Malyk
Malyk

Reputation: 193

How to properly convert an XLSX file to a TSV file in Python?

I'm converting large XLSX files (over 60 columns and 3000 rows) to TSV format. Some fields contain texts with 2-3 paragraphs (multi line breaks). I want support UTF-8 and I want each row to be displayed on a line in the resulting TSV

I did the following:

import xlrd
import csv

# open the tsv file(output) in unicode format
with open('outTSV.tsv', 'w', encoding='utf-8') as TSVfile:
    wr = csv.writer(TSVfile, delimiter="\t")

    # open the xlsx file 
    xlfile = xlrd.open_workbook('inXLSX.xlsx')
    # retrieve sheet
    sheet = xlfile.sheet_by_index(0)

    # write rows into TSVfile
    for row in range(sheet.nrows):
        wr.writerow(sheet.row_values(row))

I expected each row in the XLSX file to transform to a row in the TSV file. However, because some cells have paragraphs, it translates them as newlines. Therefore I get deformed TSV files.

XLSX file

XLSX file

Deformed TSV file Deformed TSV file

Upvotes: 1

Views: 4897

Answers (1)

Malyk
Malyk

Reputation: 193

I was able to solve the problem using pandas dataframe.

import pandas as pd

#Read excel file into a dataframe
data_xlsx = pd.read_excel('excel.xlsx', 'Sheet1', index_col=None)

#Replace all columns having spaces with underscores
data_xlsx.columns = [c.replace(' ', '_') for c in data_xlsx.columns]

#Replace all fields having line breaks with space
df = data_xlsx.replace('\n', ' ',regex=True)

#Write dataframe into csv
df.to_csv('fileTSV.csv', sep='\t', encoding='utf-8',  index=False, line_terminator='\r\n')

Upvotes: 3

Related Questions