Wesley Bowman
Wesley Bowman

Reputation: 1396

Read CSV with two different delimiters

I have an example file of the type of file I am working with. Sometimes the delimiter is | and other times it's *.

The closest I have come to successfully reading this is:

data = pd.read_csv('filename', 
                   skiprows=244, sep=r'\|',
                   header=None,
                   names=['A','B','C','D','E'])
new = data['E']
df = new.str.split().apply(lambda x: pd.Series(x))
df = df.convert_objects(convert_numeric=True)

But the problem with this is that I end up with None rows where my delimiter was a *.

Is there a way to separate this file based on two different delimiters? I have thought about using a regular expression to match either | or * but

data = pd.read_csv('filename', 
                   skiprows=244, nrows=5, sep=r'(\| | \*)',
                   header=None,
                   names=['A','B','C','D'])

doesn't seem to work either. I am not the greatest at reg expressions though.

EDIT: I have also tried using sep=r'\s*', but seeing as how the *'s move around in the file, it makes it to where some rows are shifted, and therefore the columns do not align.

Upvotes: 1

Views: 1071

Answers (3)

LondonRob
LondonRob

Reputation: 78733

Those asterisks everywhere are really problematic because, as you mention, sometimes they replace the | delimiter and other times they're in the middle of a field seemingly at random.

Luckily your file seems to be fixed width, which pandas knows how to handle with read_fwf (fixed-width file).

Let's take a little subset of this nasty data file which shows both asterisk positions:

 |                   *                   |                   |  0.1108      0.0085 ( 1.883%)     0.1066 ( 0.504%)
 |                   |                   |                   |  0.1112      0.0001 (20.851%)     0.1066 ( 0.504%)
 |                   |                   |                   |  0.1116      0.0005 ( 7.536%)     0.1072 ( 0.502%)
 |                *  |                   |                   |  0.1120      0.0059 ( 2.266%)     0.1130 ( 0.488%)
 |                   |                   |                   |  0.1124      0.0001 (16.439%)     0.1131 ( 0.487%)

We can read this with some careful (but boring) counting of character positions:

In [9]: colspecs = [(64, 71), (76, 83), (85, 90), (97, 104), (106, 111)]

In [11]: pd.read_fwf('my_file.csv', colspecs=colspecs, header=None)
Out[11]: 
        0       1      2       3      4
0  0.1108  0.0085  1.883  0.1066  0.504
1  0.1112  0.0001  0.851  0.1066  0.504
2  0.1116  0.0005  7.536  0.1072  0.502
3  0.1120  0.0059  2.266  0.1130  0.488
4  0.1124  0.0001  6.439  0.1131  0.487

Upvotes: 1

ivan7707
ivan7707

Reputation: 1156

with open('C:/example_file.txt') as f:
    content = [x.strip('\n') for x in f.readlines()]
content.pop(0)
cleancontent = []
for index, item in enumerate(content):
    cleancontent.append(content[index][62:])
cleancontent

Then turn this into the dataframe.

Upvotes: 0

mdurant
mdurant

Reputation: 28683

You can go the old fashioned way, and build a list:

rows = []
for line in open('filename'):
    var1 = int(line[62:70])
    ....
    rows.append([var1, ...])

data = pd.DataFrame(data=rows, columns=['A','B','C','D'])

Upvotes: 0

Related Questions