Reputation:
I have an input string that has delimiter $$$Field$$$
. The string has some lines. I need return a list of all the items in the string, separated by $$$Field$$$
only.
In the example below I should receive as output ['Food', 'Fried\nChicken', 'Banana']
. However, seems that it is interpreting the new lines as a separator as well, so instead of a list I am getting a table. How can I ignore those new lines, so that I just get a list back?
import pandas as pd
from pandas.compat import StringIO
temp=u"""Food$$$Field$$$Fried
Chicken$$$Field$$$Banana"""
df = pd.read_csv(StringIO(temp), sep='\$\$\$Field\$\$\$',engine='python')
print (df)
The only reason why I am using pandas is because this string is actually a huge .csv file, and I cannot read all this in memory at a time, but a streaming processing would be acceptable.
Upvotes: 2
Views: 2849
Reputation: 704
Since you are not looking to store your information in a tabular format, I don't think a DataFrame is necessary. Instead, read your string in chunks and yield the buffer every time it encounters '$$$Field$$$'
.
Adapted from https://stackoverflow.com/a/16260159/4410590:
def myreadlines(f, newline):
buf = ""
while True:
while newline in buf:
pos = buf.index(newline)
yield buf[:pos]
buf = buf[pos + len(newline):]
chunk = f.read(4096)
if not chunk:
yield buf
break
buf += chunk
Then call the function:
> for x in myreadlines(StringIO(temp), '$$$Field$$$'):
print repr(x)
u'Food'
u'Fried\nChicken'
u'Banana'
Upvotes: 1
Reputation: 8078
well this should do what you want just scale it to multiple lines:
df = pd.DataFrame("""Food$$$Field$$$Fried
Chicken$$$Field$$$Banana""".split("$$$Field$$$")).T
print(df)
Depending on where (how) your text is stored just you can do the splitting in a list comprehension:
df = pd.DataFrame(lines.split("$$$Field$$$") for line in lines).T
Upvotes: 0