Reputation: 649
I have a csv file like:
"B/G/213","B/C/208","WW_cis",,
"B/U/215","B/A/206","WW_cis",,
"B/C/214","B/G/207","WW_cis",,
"B/G/217","B/C/204","WW_cis",,
"B/A/216","B/U/205","WW_cis",,
"B/C/219","B/G/202","WW_cis",,
"B/U/218","B/A/203","WW_cis",,
"B/G/201","B/C/220","WW_cis",,
"B/A/203","B/U/218","WW_cis",,
and I want to read it into something like an array or dataframe, so that I would be able to compare elements from one column to selected elements from another columns. At first, I have read it straight into an array using numpy.genfromtxt
, but I got stings like '"B/A/203"'
with extra quotes "
everywhere. I read somewhere, that pandas allows to strip strings of extra "
so I tried:
class StructureReader(object):
def __init__(self, filename):
self.filename=filename
def read(self):
self.data=pd.read_csv(StringIO(str("RNA/"+self.filename)), header=None, sep = ",")
self.data
but I get something like so:
<class 'pandas.core.frame.DataFrame'> 0
0 RNA/4v6p.csv
How can I get my CSV file into some kind of a data type that would allow me to search through columns and rows?
Upvotes: 3
Views: 1876
Reputation: 10748
You are putting the string of the filename into your DataFrame
, i.e. RNA/4v6p.csv
is your data in location row 0, col 0
. You need to read in the file and store the data. This can be done by removing StringIO(str(...))
in your class
class StructureReader(object):
def __init__(self, filename):
self.filename = filename
def read(self):
self.data = pd.read_csv("RNA/"+self.filename), header=None, sep = ",")
self.data
I would also recommend removing the parent directory from being hardcoded by
Always passing in a full file path
class StructureReader(object):
def __init__(self, filepath):
self.filepath = filepath
def read(self):
self.data = pd.read_csv(self.filepath), header=None, sep = ",")
self.data
Making the directory an __init__()
argument
class StructureReader(object):
def __init__(self, directory, filename):
self.directory = directory
self.filename = filename
def read(self):
self.data=pd.read_csv(self.directory+"/"+self.filename), header=None, sep = ",")
# or import os and self.data=pd.read_csv(os.path.join(self.directory, self.filename)), header=None, sep = ",")
self.data
Making the directory a constant attribute
class StructureReader(object):
def __init__(self, filename):
self.directory = "RNA"
self.filename = filename
def read(self):
self.data = pd.read_csv(self.directory+"/"+self.filename), header=None, sep = ",")
# or import os and self.data=pd.read_csv(os.path.join(self.directory, self.filename)), header=None, sep = ",")
self.data
This has nothing to do with reading your data, just a best practice commentary on structuring you code (Just my $0.02).
Upvotes: 3
Reputation: 210842
I think you've mixed up StringIO with the file name. You either have your data as a string and then you use StringIO or you simply specify a file name (not using StringIO):
In [189]: data="""\
.....: "B/G/213","B/C/208","WW_cis",,
.....: "B/U/215","B/A/206","WW_cis",,
.....: "B/C/214","B/G/207","WW_cis",,
.....: "B/G/217","B/C/204","WW_cis",,
.....: "B/A/216","B/U/205","WW_cis",,
.....: "B/C/219","B/G/202","WW_cis",,
.....: "B/U/218","B/A/203","WW_cis",,
.....: "B/G/201","B/C/220","WW_cis",,
.....: "B/A/203","B/U/218","WW_cis",,
.....: """
In [190]:
In [190]: df = pd.read_csv(io.StringIO(data), sep=',', header=None, usecols=[0,1,2])
In [191]: df
Out[191]:
0 1 2
0 B/G/213 B/C/208 WW_cis
1 B/U/215 B/A/206 WW_cis
2 B/C/214 B/G/207 WW_cis
3 B/G/217 B/C/204 WW_cis
4 B/A/216 B/U/205 WW_cis
5 B/C/219 B/G/202 WW_cis
6 B/U/218 B/A/203 WW_cis
7 B/G/201 B/C/220 WW_cis
8 B/A/203 B/U/218 WW_cis
PS you can decide what columns do you want to parse (to have in your data frame) - look at the usecols
parameter
Or using file name
import os
df = pd.read_csv(os.path.join('RNA', self.filename), sep=',', header=None, usecols=[0,1,2])
Upvotes: 1
Reputation: 21552
IIUC, you can just read it with:
df = pd.read_csv('yourfile.csv', header=None)
that for me returns:
0 1 2 3 4
0 B/G/213 B/C/208 WW_cis NaN NaN
1 B/U/215 B/A/206 WW_cis NaN NaN
2 B/C/214 B/G/207 WW_cis NaN NaN
3 B/G/217 B/C/204 WW_cis NaN NaN
4 B/A/216 B/U/205 WW_cis NaN NaN
5 B/C/219 B/G/202 WW_cis NaN NaN
6 B/U/218 B/A/203 WW_cis NaN NaN
7 B/G/201 B/C/220 WW_cis NaN NaN
8 B/A/203 B/U/218 WW_cis NaN NaN
you can then select only the columns you want with:
df = df[[0,1,2]]
and operate as usual with dataframes.
Upvotes: 2