PTTHomps
PTTHomps

Reputation: 1499

How do I create a pandas dataframe in python from a csv with additional delimiters?

I have a large csv (on the order of 400k rows) which I wish to turn into a dataframe in python. The original file has two columns: a text column, followed by an int (or NAN) column.

Example:

...
P-X1-6030-07-A01    368963
P-X1-6030-08-A01    368964
P-X1-6030-09-A01    368965
P-A-1-1011-14-G-01  368967
P-A-1-1014-01-G-05  368968
P-A-1-1017-02-D-01  368969
...

I wish to additionally split the text column into a series of columns following the pattern of the last three lines of the example text (P A 1 1017 02 D 01 368969, for example)

Noting that the text field can have varying formatting (P-X1 vs P-X-1), how might this best be accomplished?

Upvotes: 3

Views: 777

Answers (1)

Aaron Hall
Aaron Hall

Reputation: 395943

First Attempt

The spec for read_csv indicates that it takes a regular expression, but this appears to be incorrect. After inspecting the source, it appears to merely takes a series of characters that it may use to populate a set of chars followed by +, so the below arguments to sep would be used to create a regex like

`[- ]+`. 

Import necessary libs to recreate:

import pandas as pd
import StringIO

You can use aset of characters as delimiters, parsing the mismatched rows isn't possible with pd.read_csv, but if you want to parse them separately:

pd.read_csv(StringIO.StringIO('''P-X1-6030-07-A01    368963
P-X1-6030-08-A01    368964
P-X1-6030-09-A01    368965'''), sep=r'- ') # sep arg becomes regex, i.e. `[- ]+`

and

pd.read_csv(StringIO.StringIO('''P-A-1-1011-14-G-01  368967
P-A-1-1014-01-G-05  368968
P-A-1-1017-02-D-01  368969'''), sep=r'- ')

But read_csv is apparently unable to use real regular expressions for the separator.


Final Solution

That means we'll need a custom solution:

import re
import StringIO
import pandas as pd

txt = '''P-X1-6030-07-A01    368963
P-X1-6030-08-A01    368964
P-X1-6030-09-A01    368965
P-A-1-1011-14-G-01  368967
P-A-1-1014-01-G-05  368968
P-A-1-1017-02-D-01  368969'''

fileobj = StringIO.StringIO(txt)

def df_from_file(fileobj):
    '''
    takes a file object, returns DataFrame with columns grouped by 
    contiguous runs of either letters or numbers (but not both together)
    '''
    # unfortunately, we must materialize the data before putting it in the DataFrame
    gen_records = [re.findall(r'(\d+|[A-Z]+)', line) for line in fileobj]
    return pd.DataFrame.from_records(gen_records)

df = df_from_file(fileobj)

and now df returns:

   0  1  2     3   4  5   6       7
0  P  X  1  6030  07  A  01  368963
1  P  X  1  6030  08  A  01  368964
2  P  X  1  6030  09  A  01  368965
3  P  A  1  1011  14  G  01  368967
4  P  A  1  1014  01  G  05  368968
5  P  A  1  1017  02  D  01  368969

Upvotes: 4

Related Questions