Aditya Bhargava
Aditya Bhargava

Reputation: 67

Remove Column with Duplicate Values in Pandas

I have a database with sample as below: enter image description here

Data frame is generated when I load data in Python as per below code

import os
import pandas as pd
data_dir="D:\\userdata\\adbharga\\Desktop\\AVA\\PythonCoding\\VF-Aus\\4G Cell Graphs"
os.chdir(data_dir)
df = pd.read_csv('CA Throughput(Kbit_s) .csv',index_col=None, header=0)

Output:

enter image description here

Is there any way by which we can avoid reading duplicate columns in Pandas, or remove the duplicate columns post reading. Pl Note: Column Name is different once data is read in Pandas, so command like df=df.loc[:,~df.columns.duplicated()] won't work. Actual database is very big and has many duplicate column with Dates only.

Upvotes: 3

Views: 2295

Answers (2)

jpp
jpp

Reputation: 164673

There are 2 ways you can do this.

Ignore columns when reading the data

pandas.read_csv has the argument usecols, which accepts an integer list.

So you can try:

# work out required columns
df = pd.read_csv('file.csv', header=0)
cols = [0] + list(range(1, len(df.columns), 2))

# use column integer list
df = pd.read_csv('file.csv', usecols=cols)

Remove columns from dataframe

You can use similar logic with pd.DataFrame.iloc to remove unwanted columns.

# cols as defined in previous example

df = df.iloc[:, cols]

Upvotes: 3

Anton vBR
Anton vBR

Reputation: 18916

One way of do it could be to read only the first row and create a mask using drop_duplicates(). This we pass to the usecols without the need to specify the index beforehand. It should be failsafe.

m = pd.read_csv(pd.compat.StringIO(data),nrows=1, header=None).T.drop_duplicates().index
df = pd.read_csv(pd.compat.StringIO(data), usecols=m)

Full example:

import pandas as pd

data = '''\
Date,Value1,Date,Value2
2018-01-01,0,2018-01-01,1
2018-01-02,0,2018-01-02,1'''

m = pd.read_csv(pd.compat.StringIO(data),nrows=1, header=None).T.drop_duplicates().index
df = pd.read_csv(pd.compat.StringIO(data), usecols=m)

print(df)

#         Date  Value1  Value2
#0  2018-01-01       0       1
#1  2018-01-02       0       1

Another way to do it would be to remove all columns with a dot inside .. This should work in most cases as the dot is rarely used in column names:

df = df.loc[:,~df.columns.str.contains('.', regex=False)]

Full example:

import pandas as pd

data = '''\
Date,Value1,Date,Value2
2018-01-01,0,2018-01-01,1
2018-01-02,0,2018-01-02,1'''


df = pd.read_csv(pd.compat.StringIO(data))
df = df.loc[:,~df.columns.str.contains('.', regex=False)]
print(df)

#         Date  Value1  Value2
#0  2018-01-01       0       1
#1  2018-01-02       0       1

Upvotes: 1

Related Questions