Windtalker
Windtalker

Reputation: 796

Import text file with uneven column number and complicated delimiter

Say I have text file like below:

apple  pear  banana,  peach orange grape

dog  cat  white horse

salmon

tiger  lion  eagle hawk  monkey

Looking for output like:

"apple", "pear", "banana", "peach orange grape"

"dog", "cat", "white horse"

"salmon"

"tiger", "lion", "eagle hawk", "monkey"

Two problems,

  1. each row I only want separate them by double space ' '
  2. column number of each row could be random, from 1 to 100

How can I load them into a pandas dataframe?

In fact I am wondering if it is possible to complete this without reading line by line, because I initial solution is:

  1. read each line, use REX

    re.split(r'\s{2,}', line)

to split by double space

  1. after split by double space, insert each row into DF

however, coz the column number is random, I can't simply generate a DF by that. Adding names=[] in pd.read_csv() will handle uneven columns, but this requires pre define column names and number.

Any suggestion?

Thank you!

Upvotes: 1

Views: 2546

Answers (2)

Jan
Jan

Reputation: 43169

To provide another example in addition to the one provided by @JD Long, you could use a regular expression plus a list comprehension:

import re, pandas as pd

string = """
apple  pear  banana  peach orange grape

dog  cat  white horse

salmon

tiger  lion  eagle hawk  monkey
"""

rx = re.compile(r'''[ ]{2,}''')

items = [(rx.split(line)) for line in string.split("\n") if line]

df = pd.DataFrame.from_records(items)
print(df)

... which yields:

        0     1            2                   3
0   apple  pear       banana  peach orange grape
1     dog   cat  white horse                None
2  salmon  None         None                None
3   tiger  lion   eagle hawk              monkey

Upvotes: 1

JD Long
JD Long

Reputation: 60746

read_table() is your friend here...

df = pd.read_table('./test.txt', sep="  ", header=None)
df.to_csv('outfile.csv')

any missing columns will be filled with NaN.

The text file test.txt above is the following:

test  2  1
t
t2  1
t3
t4  3  4

and df looks like this:

      0    1    2
0  test  2.0  1.0
1     t  NaN  NaN
2    t2  1.0  NaN
3    t3  NaN  NaN
4    t4  3.0  4.0

To get the quoting in the output you may need to use the quoting option from csv:

import csv
import pandas as pd

df = pd.read_table('./test.txt', sep="  ", header=None)
df.to_csv(quoting=csv.QUOTE_NONNUMERIC)

Upvotes: 2

Related Questions