Hannibal
Hannibal

Reputation: 105

How to read space-separated values (csv) file made using shell-like syntax into DataFrame

I'm looking for a fast way to read a csv file into dataframe. the csv file is a space separated, however, columns names are placed between double quotation if it contains more than one word & contains multiple spaces. pd.read_csv with sep=" " does not work, because column names spaces.

I currently solved this by applying shlex.split on every line of the file, changing it to comma separated. However, it is taking too long: ~6 seconds for a file with 15K lines. Below is an example of my file ('template.csv') and code snippet on how it is solved through shlex.split.

Appreciate the help in advance!

a b c "d e  " "f g  " "h k  "
1 2 3 4 5 6
2 2 3 4 5 6
3 2 3 4 5 6
4 2 3 4 5 6
5 2 3 4 5 6
6 2 3 4 5 6

This is the code and desired dataframe output:

import pandas as pd
import shlex
data = []
df = pd.DataFrame()
for line in open(r'template.csv'):
    line = shlex.split(line)
    data.append(line)
df = pd.DataFrame(data[1:], columns=[colName for colName in data[0]])


   a  b  c d e f g h k
0  1  2  3   4   5   6
1  2  2  3   4   5   6
2  3  2  3   4   5   6
3  4  2  3   4   5   6
4  5  2  3   4   5   6
5  6  2  3   4   5   6 

Upvotes: 1

Views: 436

Answers (3)

Chathura Abeywickrama
Chathura Abeywickrama

Reputation: 122

You can use the csv module with a custom delimiter to efficiently read the space-separated file with quoted column names.

import pandas as pd
import csv

with open('template.csv', 'r') as file:
    # Read the header line
    header = next(file)
    # Use csv.reader with a space as the delimiter and a double quote as the quote character
    reader = csv.reader(file, delimiter=' ', quotechar='"')
    # Extract the data
    data = list(reader)

# Create a DataFrame with the header and data
df = pd.DataFrame(data, columns=[colName.strip('"') for colName in header.split()])
print(df)

Upvotes: 0

juanpa.arrivillaga
juanpa.arrivillaga

Reputation: 96172

In this case, it looks like you can just use the delim_whitespace=True, (which is equivalent to sep="\s+" although only one of those should be used), i.e. any whitespace is the separator, and by default, " is a quotechar, So:

df = pd.read_csv('template.csv', delim_whitespacebool=True)

Upvotes: 0

Corralien
Corralien

Reputation: 120479

You can set the separator:

# or suggested by @AndrejKesely, sep=r'\s+'
df = pd.read_csv('template.csv', sep=' ')
print(df)

# Output
   a  b  c  d e  f g  h k
0  1  2  3    4    5    6
1  2  2  3    4    5    6
2  3  2  3    4    5    6
3  4  2  3    4    5    6
4  5  2  3    4    5    6
5  6  2  3    4    5    6

Alternative with shlex but use it only on the first row:

with open('template.csv') as fp:
    headers = shlex.split(fp.readline())
    df = pd.read_csv(fp, header=None, names=headers, sep=' ')

Upvotes: 2

Related Questions