Reputation: 32316
I have 2 csv files with different sequence of columns. For e.g. the first file starts with 10 digits mobile numbers while that column is at number 4 in the second file. I need to merge all the customer data into a single csv file. The order of the columns should be as follows:
mobile pincode model Name Address Location pincode date
mobile Name Address Model Location pincode Date
9845299999 Raj Shah nagar No 22 Rivi Building 7Th Main I Crz Mumbai 17/02/2011
9880877777 Managing Partner M/S Aitas # 1010, 124Th Main, Bk Stage. - Bmw 320 D Hyderabad 560070 30-Dec-11
Name Address Location mobile pincode Date Model
Asvi Developers pvt Ltd fantry Road Nariman Point, 1St Floor, No. 150 Chennai 9844066666 13/11/2011 Crz
L R Shiva Gaikwad & Sudha Gaikwad # 42, Suvarna Mansion, 1St Cross, 17Th Main, Banjara Hill, B S K Stage,- Bangalore 9844233333 560085 40859 Mercedes_E 350 Cdi
Second task and that may be slightly difficult is that the new files expected may have a totally different column sequence. In that case I need to extract 10 digits mobile number and 6 digits pincode column. I need to write the code that will guess the city column if it matches with any of the given city list. The new files are expected to have relevant column headings but the column heading may be slightly different. for e.g. "customer address" instead of "address". How do I handle such data?
sed 's/.*\([0-9]\{10\}\).*/\1,&/' input
I have been suggested to use sed to rearrange the 10 digits column at the beginning. But I do also need to rearrange the text columns. For e.g. if a column matches the entries in the following list then it is undoubtedly model column.
['Crz', 'Bmw 320 D', 'Benz', 'Mercedes_E 350 Cdi', 'Toyota_Corolla He 1.8']
If any column matches 10% of the entries with the above list then it is a "model" column and should be at number 3 followed by mobile and pincode.
Upvotes: 2
Views: 749
Reputation: 6383
For your first question, I suggest using pandas to load both files and then concat. After that you can rearrange your columns.
import pandas as pd
dataframe1 = pd.read_csv('file1.csv')
dataframe2 = pd.read_csv('file2.csv')
combined = pd.concat([dataframe1, dataframe2]) #the columns will be ordered alphabetically
To get desired order,
result_df = combined[['mobile', 'pincode', 'model', 'Name', 'Address', 'Location', 'pincode', 'date']]
and then result_df.to_csv('oupput.csv', index=False)
to export to csv file.
For the second one, you can do something like this (assuming you have loaded a csv file into df like above)
match_model = lambda m: m in ['Crz', 'Bmw 320 D', 'Benz', 'Mercedes_E 350 Cdi', 'Toyota_Corolla He 1.8']
for c in df:
if df[c].map(match_model).sum()/len(df) > 0.1:
print "Column %s is 'Model'"% c
df.rename(columns={c:'Model'}, inplace=True)
You can modify the matching function match_model
to use regex instead if you want.
Upvotes: 1