Breaking Up CSV By Regex in Python

Question

I have a file formatted as below:

S1A23
0.01,0.01
0.02,0.02
0.03,0.03
S25A123
0.05,0.06
0.07,0.08
S3034A1
1000,0.04
2000,0.08
3000,0.1

I'd like to break it up by each "S_A_", and compute the correlation coefficient of the data below. So far, I have:

import re
import pandas as pd

test = pd.read_csv("predict.csv",sep=('S\d+A\d+'))

print test

but that only gives me:

  Unnamed: 0     ,
0  0.01,0.01  None
1  0.02,0.02  None
2  0.03,0.03  None
3        NaN     ,
4  0.05,0.06  None
5  0.07,0.08  None
6        NaN     ,
7  1000,0.04  None
8  2000,0.08  None
9   3000,0.1  None

[10 rows x 2 columns]

I'd, ideally, like to keep the regex delimiter, and have something like:

S1A23: 1.0
S2A123: 0.86
S303A1: 0.75

Is this possible?

EDIT
When running large files (~250k lines), I receive the following error. It is not a problem with the data, as when I break the ~250k lines into smaller chunks, all pieces run fine.

Traceback (most recent call last):
  File "/Users/adamg/PycharmProjects/Subj_AnswerCorrCoef/GetCorrCoef.py", line 15, in 
    print(result)
  File "/Users/adamg/anaconda/lib/python2.7/site-packages/pandas/core/base.py", line 35, in __str__
    return self.__bytes__()
  File "/Users/adamg/anaconda/lib/python2.7/site-packages/pandas/core/base.py", line 47, in __bytes__
    return self.__unicode__().encode(encoding, 'replace')
  File "/Users/adamg/anaconda/lib/python2.7/site-packages/pandas/core/series.py", line 857, in __unicode__
    result = self._tidy_repr(min(30, max_rows - 4))
TypeError: unsupported operand type(s) for -: 'NoneType' and 'int'

My exact code is:

import numpy as np
import pandas as pd
import csv
pd.options.display.max_rows = None
fileName = 'keyStrokeFourgram/TESTING1'

df = pd.read_csv(fileName, names=['pause', 'probability'])
mask = df['pause'].str.match('^S\d+_A\d+')
df['S/A'] = (df['pause']
              .where(mask, np.nan)
              .fillna(method='ffill'))
df = df.loc[~mask]

result = df.groupby(['S/A']).apply(lambda grp: grp['pause'].corr(grp['probability']))
print(result)

unutbu · Accepted Answer

The sep parameter is used for specifying the pattern which separates values on the same line. It can not be used for separating rows of the csv into separate dataframes.

Edit: There is a way to read the csv into a DataFrame using read_csv. This is preferable to using a Python loop (as done in my original answer) since read_csv should be faster. This could be important -- particularly for large csv files.

import numpy as np
import pandas as pd
df = pd.read_csv("data", names=['x', 'y'])
mask = df['x'].str.match('^S\d+A\d+')         # 1
df['type'] = (df['x']
              .where(mask, np.nan)            # 2
              .fillna(method='ffill'))        # 3
df = df.loc[~mask]                            # 4

result = df.groupby(['type']).apply(lambda grp: grp['x'].corr(grp['y']))
print(result)

yields

type
S1A23      1.000000
S25A123    1.000000
S3034A1    0.981981
dtype: float64

The mask is True on the rows that have a "type" in the 'x' column.

In [139]: mask
Out[139]: 
0      True
1     False
2     False
3     False
4      True
5     False
6     False
7      True
8     False
9     False
10    False
Name: x, dtype: bool

df['x'].where(mask, np.nan) returns a Series, equal to df['x'] where the mask is True, and np.nan otherwise.

Forward-fill in the nans with the currency values

In [141]: df['x'].where(mask, np.nan).fillna(method='ffill')
Out[141]: 
0       S1A23
1       S1A23
2       S1A23
3       S1A23
4     S25A123
5     S25A123
6     S25A123
7     S3034A1
8     S3034A1
9     S3034A1
10    S3034A1
Name: x, dtype: object

Select only those rows where the mask is False

Original answer:

Unfortunately, I don't see a way to read your data file directly into an appropriate DataFrame. You'll need to do some massaging of the rows to get it into the right form, using a Python loop.

import pandas as pd
import csv

def to_columns(f):
    val = None
    for row in csv.reader(f):
        if len(row) == 1:
            val = row[0]
        else:
            yield [val] + row

with open('data') as f:
    df = pd.DataFrame.from_records(to_columns(f), columns=['type', 'x', 'y'])

print(df)
result = df.groupby(['type']).apply(lambda grp: grp['x'].corr(grp['y']))
print(result)

Breaking Up CSV By Regex in Python

Answers (1)

Related Questions