Reputation: 1616
Sometimes I see data posted in a Stack Overflow question formatted like in this question. This is not the first time, so I have decided to ask a question about this topic.
I will post the data sample from the link here just in case the question will be deleted.
itm Date Amount
67 420 2012-09-30 00:00:00 65211
68 421 2012-09-09 00:00:00 29424
69 421 2012-09-16 00:00:00 29877
70 421 2012-09-23 00:00:00 30990
71 421 2012-09-30 00:00:00 61303
72 485 2012-09-09 00:00:00 71781
73 485 2012-09-16 00:00:00 NaN
74 485 2012-09-23 00:00:00 11072
75 485 2012-09-30 00:00:00 113702
76 489 2012-09-09 00:00:00 64731
77 489 2012-09-16 00:00:00 NaN
As you can see this is not the right way to post reproducible data.
What can Python code do to make that table usable?
Upvotes: 2
Views: 281
Reputation: 1616
I'm not a fan of answering my own questions, but I've found a good approach using a combination of StringIO
and Regex
and would like to share it. First, I copy the table without the header. Then I remove the whitespaces in regex and replace them with commas. Then I read the string into a StringIO object. Finally I read the StringIO object with pandas and rename the columns.
import pandas as pd
import os,sys,io, re
temp = """
67 420 2012-09-30 00:00:00 65211
68 421 2012-09-09 00:00:00 29424
69 421 2012-09-16 00:00:00 29877
70 421 2012-09-23 00:00:00 30990
71 421 2012-09-30 00:00:00 61303
72 485 2012-09-09 00:00:00 71781
73 485 2012-09-16 00:00:00 NaN
74 485 2012-09-23 00:00:00 11072
75 485 2012-09-30 00:00:00 113702
76 489 2012-09-09 00:00:00 64731
77 489 2012-09-16 00:00:00 NaN"""
temp2 = re.sub(r" +", r",", temp)
temp2 = io.StringIO(temp2)
df = pd.read_csv(temp2, sep=",",header=None)
df = df.rename(columns={0:"id",1:"itm",2:"Date",3:"Time",4:"Amount"})
print(df)
id itm Date Time Amount
0 67 420 2012-09-30 00:00:00 65211.0
1 68 421 2012-09-09 00:00:00 29424.0
2 69 421 2012-09-16 00:00:00 29877.0
3 70 421 2012-09-23 00:00:00 30990.0
4 71 421 2012-09-30 00:00:00 61303.0
5 72 485 2012-09-09 00:00:00 71781.0
6 73 485 2012-09-16 00:00:00 NaN
7 74 485 2012-09-23 00:00:00 11072.0
8 75 485 2012-09-30 00:00:00 113702.0
9 76 489 2012-09-09 00:00:00 64731.0
10 77 489 2012-09-16 00:00:00 NaN
Upvotes: 0
Reputation: 13518
The way I proceed when I have to deal with this kind of data, which is indeed frequent in SO posts, is to:
first, copy (ctrl+c or right click) the data except the header row ( itm Date Amount
in your example);
then, run the following code:
import pandas as pd
df = pd.read_clipboard()
# save data which was misinterpreted as column header
first_row = df.columns.to_list()
# rename column headers as 0, 1, 2, ...
df.columns = range(df.shape[1])
# Insert first row, sort dataframe, then get rid of first column while saving to csv
df = (
pd.concat([pd.DataFrame(first_row).T, df], ignore_index=True)
.set_index(0)
.to_csv("temp.csv", index=False)
)
So that:
import pandas as pd
df = pd.read_csv("temp.csv")
print(df)
# Output
1 2 3 4
0 420 2012-09-30 00:00:00 65211.0
1 421 2012-09-09 00:00:00 29424.0
2 421 2012-09-16 00:00:00 29877.0
3 421 2012-09-23 00:00:00 30990.0
4 421 2012-09-30 00:00:00 61303.0
5 485 2012-09-09 00:00:00 71781.0
6 485 2012-09-16 00:00:00 NaN
7 485 2012-09-23 00:00:00 11072.0
8 485 2012-09-30 00:00:00 113702.0
9 489 2012-09-09 00:00:00 64731.0
10 489 2012-09-16 00:00:00 NaN
From there, you can do some additional processing, depending on the data you are dealing with.
In your example:
df = df.drop(columns="3").rename(columns={"1": "itm", "2": "Date", "4": "Amount"})
print(df)
# Output
itm Date Amount
0 420 2012-09-30 65211.0
1 421 2012-09-09 29424.0
2 421 2012-09-16 29877.0
3 421 2012-09-23 30990.0
4 421 2012-09-30 61303.0
5 485 2012-09-09 71781.0
6 485 2012-09-16 NaN
7 485 2012-09-23 11072.0
8 485 2012-09-30 113702.0
9 489 2012-09-09 64731.0
10 489 2012-09-16 NaN
Upvotes: 2