user2418898
user2418898

Reputation: 51

Python Pandas read_csv to dataframe without separator

I'm new to the Pandas library.
I have shared code that works off of a dataframe.

Is there a way to read a gzip file line by line without any delimiter (use the full line, the line can include commas and other characters) as a single row and use it in the dataframe? It seems that you have to provide a delimiter and when I provide "\n" it is able to read but error_bad_lines will complain with something like "Skipping line xxx: expected 22 fields but got 23" fields since each line is different.

I want it to treat each line as a single row in the dataframe. How can this be achieved? Any tips would be appreciated.

Upvotes: 4

Views: 6563

Answers (1)

Chris Doyle
Chris Doyle

Reputation: 11992

if you just want each line to be one row and one column then dont use read_csv. Just read the file line by line and build the data frame from it.

You could do this manually by creating an empty data frame with a single columns header. then iterate over each line in the file appending it to the data frame.

#explicitly iterate over each line in the file appending it to the df.
import pandas as pd
with open("query4.txt") as myfile:
    df = pd.DataFrame([], columns=['line'])
    for line in myfile:
        df = df.append({'line': line}, ignore_index=True)
    print(df)

This will work for large files as we only process one line at a time and build the dataframe so we dont use more memory than needed. This probably isnt the most efficent there is a lot of reassigning of the dataframe here but it would certainly work.

However we can do this more cleanly since the pandas dataframe can take an iterable as the input for data.

#create a list to feed the data to the dataframe.
import pandas as pd
with open("query4.txt") as myfile:
    mydata = [line for line in myfile]
    df = pd.DataFrame(mydata, columns=['line'])
    print(df)

Here we read all the lines of the file into a list and then pass the list to pandas to create the data from. However the down side to this is if our file was very large we would essentially have 2 copies of it in memory. One in list and one in the data frame.

Given that we know pandas will accept an iterable for the data so we can use a generator expression to give us a generator that will feed each line of the file to the data frame. Now the data frame will be built its self by reading each line one at a time from the file.

#create a generator to feed the data to the dataframe.
import pandas as pd
with open("query4.txt") as myfile:
    mydata = (line for line in myfile)
    df = pd.DataFrame(mydata, columns=['line'])
    print(df)

In all three cases there is no need to use read_csv since the data you want to load isnt a csv. Each solution provides the same data frame output

SOURCE DATA

this is some data
this is other data
data is fun
data is weird
this is the 5th line

DATA FRAME

                   line
0   this is some data\n
1  this is other data\n
2         data is fun\n
3       data is weird\n
4  this is the 5th line

Upvotes: 5

Related Questions