Reputation: 157
I am trying to compile multiple text files into a single data frame. However, when I join the data frames with Pandas Concat function, the shape of the resulting data frame adds new columns. In the code sample below, data frame 3 has 12 columns rather than 8. Why?
**Input:**
import pandas as pd
df1 = pd.read_csv('2011-12-01-data.txt',sep = None, engine = 'python')
df2 = pd.read_csv('2011-12-02-data.txt',sep = None, engine = 'python')
df3= pd.concat([df1, df2])
print(df1.shape)
print(df2.shape)
print(df3.shape)
**Output:**
df1 shape = (26986, 8)
df1 shape =(27266, 8)
df3 shape =(54252, 12)
The I am working with flight data available at http://lunadong.com/datasets/clean_flight.zip
Upvotes: 4
Views: 5873
Reputation: 11
User jezrael's answer solves the problem. But let me try to explain why pandas added new columns to your concatenated data frame and what went wrong.
When you set header = None, pandas reads the first line of your file as a header and sets it by default to names of each column. Based on your code, these are the two sets of columns one would get for each of your dataframes if header = None.
df1: ['aa', 'AA-1007-TPA-MIA', '12/01/2011 01:55 PM', '12/01/2011 02:07 PM', 'F78', '12/01/2011 03:00 PM', '12/01/2011 02:57 PM', 'D5']
df2: ['aa', 'AA-1007-TPA-MIA', '12/02/2011 01:55 PM', '12/02/2011 02:13 PM', 'F78', '12/02/2011 03:00 PM', '12/02/2011 03:05 PM', 'D5']
Finally, when you concatenated the two dataframes, all columns that were not common to df1 and df2 were appended as separate columns. 'aa','AA-1007-TPA-MIA', 'F78' and 'D5' were unique to df1 and df2 while everything else was appended to the list of columns.
This lead to 4(df1&df2) + 4(df1) + 4(df2) = 12 columns
Upvotes: 1
Reputation: 862681
I think you need header=None
parameter for default columns names 0-7
, because files has no headers. Also if there is separator tab
, is possible specify it.
df1 = pd.read_csv('2011-12-01-data.txt',sep = '\t', engine = 'python', header=None)
df2 = pd.read_csv('2011-12-02-data.txt',sep = '\t', engine = 'python', header=None)
df3= pd.concat([df1, df2])
print(df1.shape)
print(df2.shape)
print(df3.shape)
(26987, 8)
(27267, 8)
(54254, 8)
print(df1.columns)
Int64Index([0, 1, 2, 3, 4, 5, 6, 7], dtype='int64')
print(df2.columns)
Int64Index([0, 1, 2, 3, 4, 5, 6, 7], dtype='int64')
print(df3.columns)
Int64Index([0, 1, 2, 3, 4, 5, 6, 7], dtype='int64')
Another solution is specify names
parameter for new column names:
names= ['col1','col2','col3','col4','col5','col6','col7','col8']
df1 = pd.read_csv('2011-12-01-data.txt',sep = '\t', engine = 'python', names=names)
df2 = pd.read_csv('2011-12-02-data.txt',sep = '\t', engine = 'python', names=names)
df3= pd.concat([df1, df2])
print(df1.shape)
print(df2.shape)
print(df3.shape)
(26987, 8)
(27267, 8)
(54254, 8)
print(df1.columns)
print(df2.columns)
print(df3.columns)
Index(['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8'], dtype='object')
Index(['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8'], dtype='object')
Index(['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8'], dtype='object')
You get only 12 columns, because some values in first row of both dataframes are same, so from them was created columns names. After concat
columns are aligned only for this columns. If values was different, there was no align and you get NaN
s.
print(df1.columns)
Index(['aa', 'AA-1007-TPA-MIA', '12/01/2011 01:55 PM', '12/01/2011 02:07 PM',
'F78', '12/01/2011 03:00 PM', '12/01/2011 02:57 PM', 'D5'],
dtype='object')
print(df2.columns)
Index(['aa', 'AA-1007-TPA-MIA', '12/02/2011 01:55 PM', '12/02/2011 02:13 PM',
'F78', '12/02/2011 03:00 PM', '12/02/2011 03:05 PM', 'D5'],
dtype='object')
print(df3.columns)
Index(['12/01/2011 01:55 PM', '12/01/2011 02:07 PM', '12/01/2011 02:57 PM',
'12/01/2011 03:00 PM', '12/02/2011 01:55 PM', '12/02/2011 02:13 PM',
'12/02/2011 03:00 PM', '12/02/2011 03:05 PM', 'AA-1007-TPA-MIA', 'D5',
'F78', 'aa'],
dtype='object')
print(df3.head())
12/01/2011 01:55 PM 12/01/2011 02:07 PM 12/01/2011 02:57 PM \
0 NaN 12/1/2011 2:07PM EST 12/1/2011 2:51PM EST
1 NaN 12/1/11 2:06 PM (-05:00) 12/1/11 2:51 PM (-05:00)
2 NaN 12/1/11 2:06 PM (-05:00) 12/1/11 2:51 PM (-05:00)
3 NaN 12/1/11 2:06 PM (-05:00) 12/1/11 2:51 PM (-05:00)
4 NaN 12/1/11 2:06 PM (-05:00) 12/1/11 2:51 PM (-05:00)
12/01/2011 03:00 PM 12/02/2011 01:55 PM 12/02/2011 02:13 PM \
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
12/02/2011 03:00 PM 12/02/2011 03:05 PM AA-1007-TPA-MIA D5 F78 \
0 NaN NaN AA-1007-TPA-MIA NaN NaN
1 NaN NaN AA-1007-TPA-MIA NaN NaN
2 NaN NaN AA-1007-TPA-MIA NaN NaN
3 NaN NaN AA-1007-TPA-MIA NaN NaN
4 NaN NaN AA-1007-TPA-MIA NaN NaN
aa
0 flightexplorer
1 airtravelcenter
2 myrateplan
3 helloflight
4 flytecomm
print(df3.tail())
12/01/2011 01:55 PM 12/01/2011 02:07 PM 12/01/2011 02:57 PM \
27261 NaN NaN NaN
27262 NaN NaN NaN
27263 NaN NaN NaN
27264 NaN NaN NaN
27265 NaN NaN NaN
12/01/2011 03:00 PM 12/02/2011 01:55 PM 12/02/2011 02:13 PM \
27261 NaN Dec 02 - 10:20pm Dec 02 - 10:23pm
27262 NaN 10:20pDec 2 10:23pDec 2
27263 NaN 2011-12-02 10:20 PM NaN
27264 NaN 2011-12-02 10:20 pm NaN
27265 NaN 2011-12-02 10:20PM CST 2011-12-02 10:31PM CST
12/02/2011 03:00 PM 12/02/2011 03:05 PM AA-1007-TPA-MIA D5 \
27261 Dec 02 - 11:59pm Dec 02 - 11:51pm* AA-2059-DFW-SLC A3
27262 11:43pDec 2 NaN AA-2059-DFW-SLC A3
27263 2011-12-02 11:59 PM NaN AA-2059-DFW-SLC NaN
27264 NaN NaN AA-2059-DFW-SLC NaN
27265 2011-12-02 11:35PM MST 2011-12-02 11:43PM MST AA-2059-DFW-SLC A3
F78 aa
27261 C20/C travelocity
27262 C20 orbitz
27263 NaN weather
27264 C20 dfw
27265 C20 flightwise
Upvotes: 2