user411103
user411103

Reputation:

Pandas: ignore new lines as separators in read_csv

I have an input string that has delimiter $$$Field$$$. The string has some lines. I need return a list of all the items in the string, separated by $$$Field$$$ only.

In the example below I should receive as output ['Food', 'Fried\nChicken', 'Banana']. However, seems that it is interpreting the new lines as a separator as well, so instead of a list I am getting a table. How can I ignore those new lines, so that I just get a list back?

import pandas as pd
from pandas.compat import StringIO

temp=u"""Food$$$Field$$$Fried
Chicken$$$Field$$$Banana"""
df = pd.read_csv(StringIO(temp), sep='\$\$\$Field\$\$\$',engine='python')
print (df)

The only reason why I am using pandas is because this string is actually a huge .csv file, and I cannot read all this in memory at a time, but a streaming processing would be acceptable.

Upvotes: 2

Views: 2849

Answers (2)

victorlin
victorlin

Reputation: 704

Since you are not looking to store your information in a tabular format, I don't think a DataFrame is necessary. Instead, read your string in chunks and yield the buffer every time it encounters '$$$Field$$$'.

Adapted from https://stackoverflow.com/a/16260159/4410590:

def myreadlines(f, newline):
    buf = ""
    while True:
        while newline in buf:
            pos = buf.index(newline)
            yield buf[:pos]
            buf = buf[pos + len(newline):]
        chunk = f.read(4096)
        if not chunk:
            yield buf
            break
        buf += chunk

Then call the function:

> for x in myreadlines(StringIO(temp), '$$$Field$$$'):
      print repr(x)

u'Food'
u'Fried\nChicken'
u'Banana'

Upvotes: 1

parsethis
parsethis

Reputation: 8078

well this should do what you want just scale it to multiple lines:

df = pd.DataFrame("""Food$$$Field$$$Fried
Chicken$$$Field$$$Banana""".split("$$$Field$$$")).T

print(df)

Depending on where (how) your text is stored just you can do the splitting in a list comprehension:

df = pd.DataFrame(lines.split("$$$Field$$$") for line in lines).T

Upvotes: 0

Related Questions