Reputation: 1499
I have a large csv (on the order of 400k rows) which I wish to turn into a dataframe in python. The original file has two columns: a text column, followed by an int (or NAN) column.
Example:
...
P-X1-6030-07-A01 368963
P-X1-6030-08-A01 368964
P-X1-6030-09-A01 368965
P-A-1-1011-14-G-01 368967
P-A-1-1014-01-G-05 368968
P-A-1-1017-02-D-01 368969
...
I wish to additionally split the text column into a series of columns following the pattern of the last three lines of the example text (P A 1 1017 02 D 01 368969
, for example)
Noting that the text field can have varying formatting (P-X1
vs P-X-1
), how might this best be accomplished?
Upvotes: 3
Views: 777
Reputation: 395943
First Attempt
The spec for read_csv
indicates that it takes a regular expression, but this appears to be incorrect. After inspecting the source, it appears to merely takes a series of characters that it may use to populate a set of chars followed by +
, so the below arguments to sep would be used to create a regex like
`[- ]+`.
Import necessary libs to recreate:
import pandas as pd
import StringIO
You can use aset of characters as delimiters, parsing the mismatched rows isn't possible with pd.read_csv
, but if you want to parse them separately:
pd.read_csv(StringIO.StringIO('''P-X1-6030-07-A01 368963
P-X1-6030-08-A01 368964
P-X1-6030-09-A01 368965'''), sep=r'- ') # sep arg becomes regex, i.e. `[- ]+`
and
pd.read_csv(StringIO.StringIO('''P-A-1-1011-14-G-01 368967
P-A-1-1014-01-G-05 368968
P-A-1-1017-02-D-01 368969'''), sep=r'- ')
But read_csv is apparently unable to use real regular expressions for the separator.
Final Solution
That means we'll need a custom solution:
import re
import StringIO
import pandas as pd
txt = '''P-X1-6030-07-A01 368963
P-X1-6030-08-A01 368964
P-X1-6030-09-A01 368965
P-A-1-1011-14-G-01 368967
P-A-1-1014-01-G-05 368968
P-A-1-1017-02-D-01 368969'''
fileobj = StringIO.StringIO(txt)
def df_from_file(fileobj):
'''
takes a file object, returns DataFrame with columns grouped by
contiguous runs of either letters or numbers (but not both together)
'''
# unfortunately, we must materialize the data before putting it in the DataFrame
gen_records = [re.findall(r'(\d+|[A-Z]+)', line) for line in fileobj]
return pd.DataFrame.from_records(gen_records)
df = df_from_file(fileobj)
and now df returns:
0 1 2 3 4 5 6 7
0 P X 1 6030 07 A 01 368963
1 P X 1 6030 08 A 01 368964
2 P X 1 6030 09 A 01 368965
3 P A 1 1011 14 G 01 368967
4 P A 1 1014 01 G 05 368968
5 P A 1 1017 02 D 01 368969
Upvotes: 4