billyc59
billyc59

Reputation: 91

pandas read csv ignore newline

i have a dataset (for compbio people out there, it's a FASTA) that is littered with newlines, that don't act as a delimiter of the data.

Is there a way for pandas to ignore newlines when importing, using any of the pandas read functions?

sample data:

>ERR899297.10000174 TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGC TATCAAGATCAGCCGATTCT

every entry is delimited by the ">" data is split by newlines (limited to, but not actually respected worldwide with 80 chars per line)

Upvotes: 1

Views: 17342

Answers (5)

Kevin Danikowski
Kevin Danikowski

Reputation: 5186

This should work simply by setting skip_blank_lines=True.

skip_blank_lines : bool, default True

If True, skip over blank lines rather than interpreting as NaN values.

However, I found that I had to set this to False to work with my data that has new lines in it. Very strange, unless I'm misunderstanding.

Docs

Upvotes: 1

hosseinRhamatpour
hosseinRhamatpour

Reputation: 1

After pd.read_csv(), you can use df.split().

 import pandas as pd


 data = pd.read_csv("test.csv")
 data.split()

Upvotes: 0

C8H10N4O2
C8H10N4O2

Reputation: 18995

Is there a way for pandas to ignore newlines when importing, using any of the pandas read functions?

Yes, just look at the doc for pd.read_table()

You want to specify a custom line terminator (>) and then handle the newline (\n) appropriately: use the first as a column delimiter with str.split(maxsplit=1), and ignore subsequent newlines with str.replace (until the next terminator):

#---- EXAMPLE DATA ---
from io import StringIO
example_file = StringIO(
"""
>ERR899297.10000174 
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGC
TATCAAGATCAGCCGATTCT
; this comment should not be read into a dataframe
>ERR123456.12345678
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGC
TATCAAGATCAGCCGATTCT
; this comment should not be read into a dataframe
"""
)
#----------------------


#---- EXAMPLE CODE ---
import pandas as pd
df = pd.read_table(
    example_file,           # Your file goes here
    engine = 'c',           # C parser must be used to allow custom lineterminator, see doc
    lineterminator = '>',   # New lines begin with ">"
    skiprows =1,            # File begins with line terminator ">", so output skips first line 
    names = ['raw'],        # A single column which we will split into two
    comment = ';'           # comment character in FASTA format
)

# The first line break ('\n') separates Column 0 from Column 1
df[['col0','col1']] = pd.DataFrame.from_records(df.raw.apply(lambda s: s.split(maxsplit=1)))

# All subsequent line breaks (which got left in Column 1) should be ignored
df['col1'] = df['col1'].apply(lambda s: s.replace('\n',''))

print(df[['col0','col1']])

# Show that col1 no longer contains line breaks
print('\nExample sequence is:')
print(df['col1'][0])

Returns:

                 col0                                               col1
0  ERR899297.10000174  TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTA...
1  ERR123456.12345678  TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTA...

Example sequence is:
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGCTATCAAGATCAGCCGATTCT

Upvotes: 0

billyc59
billyc59

Reputation: 91

There is no good way to do this. BioPython alone seems to be sufficient, over a hybrid solution involving iterating through a BioPython object, and inserting into a dataframe

Upvotes: 0

Romain Jouin
Romain Jouin

Reputation: 4838

You need to have another sign which will tell pandas when you do actually want to change of tuple.

Here for example I create a file where the new line is encoded by a pipe (|) :

csv = """
col1,col2, col3, col4|
first_col_first_line,2nd_col_first_line,
3rd_col_first_line

de,4rd_col_first_line|
"""
with open("test.csv", "w") as f:
    f.writelines(csv)

Then you read it with the C engine and precise the pipe as the lineterminator :

import pandas as pd
pd.read_csv("test.csv",lineterminator="|", engine="c")

which gives me : enter image description here

Upvotes: 2

Related Questions