Reputation: 193
I'm converting large XLSX files (over 60 columns and 3000 rows) to TSV format. Some fields contain texts with 2-3 paragraphs (multi line breaks). I want support UTF-8 and I want each row to be displayed on a line in the resulting TSV
I did the following:
import xlrd
import csv
# open the tsv file(output) in unicode format
with open('outTSV.tsv', 'w', encoding='utf-8') as TSVfile:
wr = csv.writer(TSVfile, delimiter="\t")
# open the xlsx file
xlfile = xlrd.open_workbook('inXLSX.xlsx')
# retrieve sheet
sheet = xlfile.sheet_by_index(0)
# write rows into TSVfile
for row in range(sheet.nrows):
wr.writerow(sheet.row_values(row))
I expected each row in the XLSX file to transform to a row in the TSV file. However, because some cells have paragraphs, it translates them as newlines. Therefore I get deformed TSV files.
XLSX file
Upvotes: 1
Views: 4897
Reputation: 193
I was able to solve the problem using pandas dataframe.
import pandas as pd
#Read excel file into a dataframe
data_xlsx = pd.read_excel('excel.xlsx', 'Sheet1', index_col=None)
#Replace all columns having spaces with underscores
data_xlsx.columns = [c.replace(' ', '_') for c in data_xlsx.columns]
#Replace all fields having line breaks with space
df = data_xlsx.replace('\n', ' ',regex=True)
#Write dataframe into csv
df.to_csv('fileTSV.csv', sep='\t', encoding='utf-8', index=False, line_terminator='\r\n')
Upvotes: 3