Reputation: 15
I've got a large .csv file (5GB) from UK land registry. I need to find all real estate that has been bought/sold two or more times.
Each row of the table looks like this:
{F887F88E-7D15-4415-804E-52EAC2F10958},"70000","1995-07-07 00:00","MK15 9HP","D","N","F","31","","ALDRICH DRIVE","WILLEN","MILTON KEYNES","MILTON KEYNES","MILTON KEYNES","A","A"
I've never used pandas or any data science library. So far I've come up with this plan:
Load the .csv file and add headers and column names
Drop unnecessary columns
Create hashmap of edited df and find duplicates
Export duplicates to a new .csv file
From my research I found that pandas are bad with very big files so I used dask
df = dd.read_csv('pp-complete.csv', header=None, dtype={7: 'object', 8: 'object'}).astype(str)
df.columns = ['ID', 'Price', 'Date', 'ZIP', 'PropType', 'Old/new', 'Duration', 'Padress', 'Sadress', 'Str', 'Locality', 'Town', 'District', 'County', 'PPDType', 'Rec_Stat']
df.head()
df.drop('ID', axis=1).head()
also tried
indexes_to_remove = [0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16]
for index in indexes_to_remove:
df.drop(df.index[index], axis=1)
Nothing worked.
The task is to show the property that has been bought/sold two or more times. I decided to use only address columns because every other column's data isn't consistent (ID - is unique code of transaction, Date, type of offer etc.)
I need to do this task with minimum memory and CPU usage that's why I went with hashmap.
I don't know if there's another method to do this easier or more efficient.
Upvotes: 1
Views: 764
Reputation: 16551
Some minor suggestions:
if 5GB is the full dataset, it's best to use plain pandas. The strategy you outlined might involve communication across partitions, so it's going to be computationally more expensive (or will require some work to make it more efficient). With pandas
all the data will be in memory, so sorting/duplication check will be fast.
In the code, make sure to assign the modified dataframe. Typically the modification is assigned to replace the existing dataframe:
# without "df = " part, the modification is not stored
df = df.drop(columns=['ID'])
usecols
kwarg of pd.read_csv
. Here's the rough idea:column_names = ['ID', 'Price', 'Date', 'ZIP', 'PropType', 'Old/new', 'Duration', 'Padress', 'Sadress', 'Str', 'Locality', 'Town', 'District', 'County', 'PPDType', 'Rec_Stat']
indexes_to_remove = [0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16]
indexes_to_keep = [i for i in range(len(column_names)) if i not in indexes_to_remove]
column_names_to_keep = [n for i,n in enumerate(column_names) if i in indexes_to_keep]
df = pd.read_csv('some_file.csv', header=column_names_to_keep, usecols=indexes_to_keep)
Upvotes: 2