Leukonoe
Leukonoe

Reputation: 649

How can I insert data from a CSV file into a dataframe using pandas.read_csv?

I have a csv file like:

"B/G/213","B/C/208","WW_cis",,
"B/U/215","B/A/206","WW_cis",,
"B/C/214","B/G/207","WW_cis",,
"B/G/217","B/C/204","WW_cis",,
"B/A/216","B/U/205","WW_cis",,
"B/C/219","B/G/202","WW_cis",,
"B/U/218","B/A/203","WW_cis",,
"B/G/201","B/C/220","WW_cis",,
"B/A/203","B/U/218","WW_cis",,

and I want to read it into something like an array or dataframe, so that I would be able to compare elements from one column to selected elements from another columns. At first, I have read it straight into an array using numpy.genfromtxt, but I got stings like '"B/A/203"' with extra quotes " everywhere. I read somewhere, that pandas allows to strip strings of extra " so I tried:

class StructureReader(object):
    def __init__(self, filename):
        self.filename=filename
    def read(self):
        self.data=pd.read_csv(StringIO(str("RNA/"+self.filename)), header=None, sep = ",")
        self.data

but I get something like so:

<class 'pandas.core.frame.DataFrame'> 0 0 RNA/4v6p.csv

How can I get my CSV file into some kind of a data type that would allow me to search through columns and rows?

Upvotes: 3

Views: 1876

Answers (3)

tmthydvnprt
tmthydvnprt

Reputation: 10748

Data Insert

You are putting the string of the filename into your DataFrame, i.e. RNA/4v6p.csv is your data in location row 0, col 0. You need to read in the file and store the data. This can be done by removing StringIO(str(...)) in your class

class StructureReader(object):
    def __init__(self, filename):
        self.filename = filename
    def read(self):
        self.data = pd.read_csv("RNA/"+self.filename), header=None, sep = ",")
        self.data

Code structure critique

I would also recommend removing the parent directory from being hardcoded by

  1. Always passing in a full file path

    class StructureReader(object):
        def __init__(self, filepath):
            self.filepath = filepath
        def read(self):
            self.data = pd.read_csv(self.filepath), header=None, sep = ",")
            self.data
    
  2. Making the directory an __init__() argument

    class StructureReader(object):
        def __init__(self, directory, filename):
            self.directory = directory
            self.filename = filename
        def read(self):
            self.data=pd.read_csv(self.directory+"/"+self.filename), header=None, sep = ",")
            # or import os and self.data=pd.read_csv(os.path.join(self.directory, self.filename)), header=None, sep = ",")
            self.data
    
  3. Making the directory a constant attribute

    class StructureReader(object):
        def __init__(self, filename):
            self.directory = "RNA"
            self.filename = filename
        def read(self):
            self.data = pd.read_csv(self.directory+"/"+self.filename), header=None, sep = ",")
            # or import os and self.data=pd.read_csv(os.path.join(self.directory, self.filename)), header=None, sep = ",")
            self.data
    

This has nothing to do with reading your data, just a best practice commentary on structuring you code (Just my $0.02).

Upvotes: 3

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210842

I think you've mixed up StringIO with the file name. You either have your data as a string and then you use StringIO or you simply specify a file name (not using StringIO):

In [189]: data="""\
   .....: "B/G/213","B/C/208","WW_cis",,
   .....: "B/U/215","B/A/206","WW_cis",,
   .....: "B/C/214","B/G/207","WW_cis",,
   .....: "B/G/217","B/C/204","WW_cis",,
   .....: "B/A/216","B/U/205","WW_cis",,
   .....: "B/C/219","B/G/202","WW_cis",,
   .....: "B/U/218","B/A/203","WW_cis",,
   .....: "B/G/201","B/C/220","WW_cis",,
   .....: "B/A/203","B/U/218","WW_cis",,
   .....: """

In [190]:

In [190]: df = pd.read_csv(io.StringIO(data), sep=',', header=None, usecols=[0,1,2])

In [191]: df
Out[191]:
         0        1       2
0  B/G/213  B/C/208  WW_cis
1  B/U/215  B/A/206  WW_cis
2  B/C/214  B/G/207  WW_cis
3  B/G/217  B/C/204  WW_cis
4  B/A/216  B/U/205  WW_cis
5  B/C/219  B/G/202  WW_cis
6  B/U/218  B/A/203  WW_cis
7  B/G/201  B/C/220  WW_cis
8  B/A/203  B/U/218  WW_cis

PS you can decide what columns do you want to parse (to have in your data frame) - look at the usecols parameter

Or using file name

import os

df = pd.read_csv(os.path.join('RNA', self.filename), sep=',', header=None, usecols=[0,1,2])

Upvotes: 1

Fabio Lamanna
Fabio Lamanna

Reputation: 21552

IIUC, you can just read it with:

df = pd.read_csv('yourfile.csv', header=None)

that for me returns:

         0        1       2   3   4
0  B/G/213  B/C/208  WW_cis NaN NaN
1  B/U/215  B/A/206  WW_cis NaN NaN
2  B/C/214  B/G/207  WW_cis NaN NaN
3  B/G/217  B/C/204  WW_cis NaN NaN
4  B/A/216  B/U/205  WW_cis NaN NaN
5  B/C/219  B/G/202  WW_cis NaN NaN
6  B/U/218  B/A/203  WW_cis NaN NaN
7  B/G/201  B/C/220  WW_cis NaN NaN
8  B/A/203  B/U/218  WW_cis NaN NaN

you can then select only the columns you want with:

df = df[[0,1,2]]

and operate as usual with dataframes.

Upvotes: 2

Related Questions