tblznbits
tblznbits

Reputation: 6778

How to use square brackets as a quote character in Pandas.read_csv

Let's say I have a text file that looks like this:

Item,Date,Time,Location
1,01/01/2016,13:41,[45.2344:-78.25453]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242]
3,01/10/2016,01:27,[51.2344:-86.24432]

What I'd like to be able to do is read that in with pandas.read_csv, but the second row will throw an error. Here is the code I'm currently using:

import pandas as pd
df = pd.read_csv("path/to/file.txt", sep=",", dtype=str)

I've tried to set quotechar to "[", but that obviously just eats up the lines until the next open bracket and adding a closing bracket results in a "string of length 2 found" error. Any insight would be greatly appreciated. Thanks!

Update

There were three primary solutions that were offered: 1) Give a long range of names to the data frame to allow all data to be read in and then post-process the data, 2) Find values in square brackets and put quotes around it, or 3) replace the first n number of commas with semicolons.

Overall, I don't think option 3 is a viable solution in general (albeit just fine for my data) because a) what if I have quoted values in one column that contain commas, and b) what if my column with square brackets is not the last column? That leaves solutions 1 and 2. I think solution 2 is more readable, but solution 1 was more efficient, running in just 1.38 seconds, compared to solution 2, which ran in 3.02 seconds. The tests were run on a text file containing 18 columns and more than 208,000 rows.

Upvotes: 8

Views: 5226

Answers (3)

Igor Raush
Igor Raush

Reputation: 15240

I can't think of a way to trick the CSV parser into accepting distinct open/close quote characters, but you can get away with a pretty simple preprocessing step:

import pandas as pd
import io
import re

# regular expression to capture contents of balanced brackets
location_regex = re.compile(r'\[([^\[\]]+)\]')

with open('path/to/file.txt', 'r') as fi:
    # replaced brackets with quotes, pipe into file-like object
    fo = io.StringIO()
    fo.writelines(unicode(re.sub(location_regex, r'"\1"', line)) for line in fi)

    # rewind file to the beginning
    fo.seek(0)

# read transformed CSV into data frame
df = pd.read_csv(fo)
print df

This gives you a result like

            Date_Time  Item                             Location
0 2016-01-01 13:41:00     1                  [45.2344:-78.25453]
1 2016-01-03 19:11:00     2  [43.3423:-79.23423, 41.2342:-81242]
2 2016-01-10 01:27:00     3                  [51.2344:-86.24432]

Edit If memory is not an issue, then you are better off preprocessing the data in bulk rather than line by line, as is done in Max's answer.

# regular expression to capture contents of balanced brackets
location_regex = re.compile(r'\[([^\[\]]+)\]', flags=re.M)

with open('path/to/file.csv', 'r') as fi:
    data = unicode(re.sub(location_regex, r'"\1"', fi.read()))

df = pd.read_csv(io.StringIO(data))

If you know ahead of time that the only brackets in the document are those surrounding the location coordinates, and that they are guaranteed to be balanced, then you can simplify it even further (Max suggests a line-by-line version of this, but I think the iteration is unnecessary):

with open('/path/to/file.csv', 'r') as fi:
    data = unicode(fi.read().replace('[', '"').replace(']', '"')

df = pd.read_csv(io.StringIO(data))

Below are the timing results I got with a 200k-row by 3-column dataset. Each time is averaged over 10 trials.

  • data frame post-processing (jezrael's solution): 2.19s
  • line by line regex: 1.36s
  • bulk regex: 0.39s
  • bulk string replace: 0.14s

Upvotes: 2

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210852

We can use simple trick - quote balanced square brackets with double quotes:

import re
import six
import pandas as pd


data = """\
Item,Date,Time,Location,junk
1,01/01/2016,13:41,[45.2344:-78.25453],[aaaa,bbb]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242],[0,1,2,3]
3,01/10/2016,01:27,[51.2344:-86.24432],[12,13]
4,01/30/2016,05:55,[51.2344:-86.24432,41.2342:-81242,55.5555:-81242],[45,55,65]"""

print('{0:-^70}'.format('original data'))
print(data)
data = re.sub(r'(\[[^\]]*\])', r'"\1"', data, flags=re.M)
print('{0:-^70}'.format('quoted data'))
print(data)
df = pd.read_csv(six.StringIO(data))
print('{0:-^70}'.format('data frame'))

pd.set_option('display.expand_frame_repr', False)
print(df)

Output:

----------------------------original data-----------------------------
Item,Date,Time,Location,junk
1,01/01/2016,13:41,[45.2344:-78.25453],[aaaa,bbb]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242],[0,1,2,3]
3,01/10/2016,01:27,[51.2344:-86.24432],[12,13]
4,01/30/2016,05:55,[51.2344:-86.24432,41.2342:-81242,55.5555:-81242],[45,55,65]
-----------------------------quoted data------------------------------
Item,Date,Time,Location,junk
1,01/01/2016,13:41,"[45.2344:-78.25453]","[aaaa,bbb]"
2,01/03/2016,19:11,"[43.3423:-79.23423,41.2342:-81242]","[0,1,2,3]"
3,01/10/2016,01:27,"[51.2344:-86.24432]","[12,13]"
4,01/30/2016,05:55,"[51.2344:-86.24432,41.2342:-81242,55.5555:-81242]","[45,55,65]"
------------------------------data frame------------------------------
   Item        Date   Time                                           Location        junk
0     1  01/01/2016  13:41                                [45.2344:-78.25453]  [aaaa,bbb]
1     2  01/03/2016  19:11                 [43.3423:-79.23423,41.2342:-81242]   [0,1,2,3]
2     3  01/10/2016  01:27                                [51.2344:-86.24432]     [12,13]
3     4  01/30/2016  05:55  [51.2344:-86.24432,41.2342:-81242,55.5555:-81242]  [45,55,65]

UPDATE: if you are sure that all square brackets are balances, we don't have to use RegEx's:

import io
import pandas as pd

with open('35948417.csv', 'r') as f:
    fo = io.StringIO()
    data = f.readlines()
    fo.writelines(line.replace('[', '"[').replace(']', ']"') for line in data)
    fo.seek(0)

df = pd.read_csv(fo)
print(df)

Upvotes: 2

jezrael
jezrael

Reputation: 862791

I think you can replace first 3 occurence of , in each line of file to ; and then use parameter sep=";" in read_csv:

import pandas as pd
import io

with open('file2.csv', 'r') as f:
    lines = f.readlines()
    fo = io.StringIO()
    fo.writelines(u"" + line.replace(',',';', 3) for line in lines)
    fo.seek(0)    

df = pd.read_csv(fo, sep=';')
print df
   Item        Date   Time                            Location
0     1  01/01/2016  13:41                 [45.2344:-78.25453]
1     2  01/03/2016  19:11  [43.3423:-79.23423,41.2342:-81242]
2     3  01/10/2016  01:27                 [51.2344:-86.24432]

Or can try this complicated approach, because main problem is, separator , between values in lists is same as separator of other column values.

So you need post - processing:

import pandas as pd
import io

temp=u"""Item,Date,Time,Location
1,01/01/2016,13:41,[45.2344:-78.25453]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242,41.2342:-81242]
3,01/10/2016,01:27,[51.2344:-86.24432]"""
#after testing replace io.StringIO(temp) to filename
#estimated max number of columns
df = pd.read_csv(io.StringIO(temp), names=range(10))
print df
      0           1      2                    3               4  \
0  Item        Date   Time             Location             NaN   
1     1  01/01/2016  13:41  [45.2344:-78.25453]             NaN   
2     2  01/03/2016  19:11   [43.3423:-79.23423  41.2342:-81242   
3     3  01/10/2016  01:27  [51.2344:-86.24432]             NaN   

                 5   6   7   8   9  
0              NaN NaN NaN NaN NaN  
1              NaN NaN NaN NaN NaN  
2  41.2342:-81242] NaN NaN NaN NaN  
3              NaN NaN NaN NaN NaN  
#remove column with all NaN
df = df.dropna(how='all', axis=1)
#first row get as columns names
df.columns = df.iloc[0,:]
#remove first row
df = df[1:]
#remove columns name
df.columns.name = None

#get position of column Location
print df.columns.get_loc('Location')
3
#df1 with Location values
df1 = df.iloc[:, df.columns.get_loc('Location'): ]
print df1
              Location             NaN              NaN
1  [45.2344:-78.25453]             NaN              NaN
2   [43.3423:-79.23423  41.2342:-81242  41.2342:-81242]
3  [51.2344:-86.24432]             NaN              NaN

#combine values to one column
df['Location'] = df1.apply( lambda x : ', '.join([e for e in x if isinstance(e, basestring)]), axis=1)

#subset of desired columns
print df[['Item','Date','Time','Location']]
  Item        Date   Time                                           Location
1    1  01/01/2016  13:41                                [45.2344:-78.25453]
2    2  01/03/2016  19:11  [43.3423:-79.23423, 41.2342:-81242, 41.2342:-8...
3    3  01/10/2016  01:27                                [51.2344:-86.24432]

Upvotes: 1

Related Questions