JAdel
JAdel

Reputation: 1616

How can python handle a messy data format in pandas?

Sometimes I see data posted in a Stack Overflow question formatted like in this question. This is not the first time, so I have decided to ask a question about this topic.

I will post the data sample from the link here just in case the question will be deleted.

      itm Date                  Amount 
67    420 2012-09-30 00:00:00   65211
68    421 2012-09-09 00:00:00   29424
69    421 2012-09-16 00:00:00   29877
70    421 2012-09-23 00:00:00   30990
71    421 2012-09-30 00:00:00   61303
72    485 2012-09-09 00:00:00   71781
73    485 2012-09-16 00:00:00     NaN
74    485 2012-09-23 00:00:00   11072
75    485 2012-09-30 00:00:00  113702
76    489 2012-09-09 00:00:00   64731
77    489 2012-09-16 00:00:00     NaN

As you can see this is not the right way to post reproducible data.

What can Python code do to make that table usable?

Upvotes: 2

Views: 281

Answers (2)

JAdel
JAdel

Reputation: 1616

I'm not a fan of answering my own questions, but I've found a good approach using a combination of StringIO and Regex and would like to share it. First, I copy the table without the header. Then I remove the whitespaces in regex and replace them with commas. Then I read the string into a StringIO object. Finally I read the StringIO object with pandas and rename the columns.

import pandas as pd
import os,sys,io, re

temp = """
67    420 2012-09-30 00:00:00   65211
68    421 2012-09-09 00:00:00   29424
69    421 2012-09-16 00:00:00   29877
70    421 2012-09-23 00:00:00   30990
71    421 2012-09-30 00:00:00   61303
72    485 2012-09-09 00:00:00   71781
73    485 2012-09-16 00:00:00     NaN
74    485 2012-09-23 00:00:00   11072
75    485 2012-09-30 00:00:00  113702
76    489 2012-09-09 00:00:00   64731
77    489 2012-09-16 00:00:00     NaN"""

temp2 = re.sub(r" +", r",", temp)
temp2 = io.StringIO(temp2) 
df = pd.read_csv(temp2, sep=",",header=None)
df = df.rename(columns={0:"id",1:"itm",2:"Date",3:"Time",4:"Amount"})

print(df)

    id  itm        Date      Time    Amount
0   67  420  2012-09-30  00:00:00   65211.0
1   68  421  2012-09-09  00:00:00   29424.0
2   69  421  2012-09-16  00:00:00   29877.0
3   70  421  2012-09-23  00:00:00   30990.0
4   71  421  2012-09-30  00:00:00   61303.0
5   72  485  2012-09-09  00:00:00   71781.0
6   73  485  2012-09-16  00:00:00       NaN
7   74  485  2012-09-23  00:00:00   11072.0
8   75  485  2012-09-30  00:00:00  113702.0
9   76  489  2012-09-09  00:00:00   64731.0
10  77  489  2012-09-16  00:00:00       NaN

Upvotes: 0

Laurent
Laurent

Reputation: 13518

The way I proceed when I have to deal with this kind of data, which is indeed frequent in SO posts, is to:

  • first, copy (ctrl+c or right click) the data except the header row ( itm Date Amount in your example);

  • then, run the following code:

import pandas as pd

df = pd.read_clipboard()

# save data which was misinterpreted as column header
first_row = df.columns.to_list()

# rename column headers as 0, 1, 2, ...
df.columns = range(df.shape[1])

# Insert first row, sort dataframe, then get rid of first column while saving to csv
df = (
    pd.concat([pd.DataFrame(first_row).T, df], ignore_index=True)
    .set_index(0)
    .to_csv("temp.csv", index=False)
)

So that:

import pandas as pd

df = pd.read_csv("temp.csv")
print(df)
# Output
      1           2         3         4
0   420  2012-09-30  00:00:00   65211.0
1   421  2012-09-09  00:00:00   29424.0
2   421  2012-09-16  00:00:00   29877.0
3   421  2012-09-23  00:00:00   30990.0
4   421  2012-09-30  00:00:00   61303.0
5   485  2012-09-09  00:00:00   71781.0
6   485  2012-09-16  00:00:00       NaN
7   485  2012-09-23  00:00:00   11072.0
8   485  2012-09-30  00:00:00  113702.0
9   489  2012-09-09  00:00:00   64731.0
10  489  2012-09-16  00:00:00       NaN

From there, you can do some additional processing, depending on the data you are dealing with.

In your example:

df = df.drop(columns="3").rename(columns={"1": "itm", "2": "Date", "4": "Amount"})

print(df)
# Output
    itm        Date    Amount
0   420  2012-09-30   65211.0
1   421  2012-09-09   29424.0
2   421  2012-09-16   29877.0
3   421  2012-09-23   30990.0
4   421  2012-09-30   61303.0
5   485  2012-09-09   71781.0
6   485  2012-09-16       NaN
7   485  2012-09-23   11072.0
8   485  2012-09-30  113702.0
9   489  2012-09-09   64731.0
10  489  2012-09-16       NaN

Upvotes: 2

Related Questions