geopandas to_crs returned less records than expected

Question

For a geopandas dataframe containing POLYGON and MULTIPOLYGON geometry data, I have tried to convert from another coordinate reference system (CRS) to EPSG:4326.

Because the geodataframe has approximately 200 thousand records, I have

split the full geodataframe into 200 smaller geodataframes of approximately 1 thousand records each
then I ransmall_gdf.to_crs('epsg:4326', inplace=True)
then exported each 'small_gdf': small_gdf.to_file(f'small_gdf_{filecounter}.shp')

This conversion process took approximately 2 full days. After applying pd.concat on all the small_gdf parts into a full geodataframe, the result shows approximately 60% of records from the original geodataframe. Could records be dropped because the 'to_crs' conversion failed?

Meanwhile, I am going to add a new column to each 'small_gdf' and rerun the to_crs operation to trace back which records were dropped during the conversion process

Code example [Please excuse any typos. I had to retype just for this post]

import geopandas as gpd
gdf = gpd.read_file('bigShapefilePath.shp')
n_records = len(gdf)

# create tuples for start-end indexes of each chunk
chunksize = 1000
i=0
list_start_end_idx_tuples = []
for start in range(i, n_records, chunksize):
    end = start+999
    if end > n_records:
        end = n_records-1

    start_end_idx_tuple = (start, end)
    list_start_end_idx_tuples.append(start_end_idx_tuple)

# convert in chunks
parts_folderpath = 
file_counter=1
for each_start_end in list_start_end_idx_tuples:
    start, end = each_start_end
    small_gdf = gdf.iloc[start:end+1]
    small_gdf['WITHIN_PART_IDX'] = range(len(small_gdf))
    small_gdf.to_crs('epsg:4326', inplace=True)
    small_gdf.to_file(f'{parts_folderpath}/small_gdf_part{file_counte
    r}.shp')

    file_counter+=1


# find file parts
full_folderpath = 
i=0
list_smallgdf_filename = []
list_smallgdf_filenamenext = []

for dir, subdir, filenames in os.walk(parts_folderpath):
    for filenamenext in filenames:
        if ('.shp' in filenamenext) and ('.xml' not in filenamenext):
            filename = filenamenext.split('.')[0]
            i+=1
            list_smallgdf_filename.append(filename)
            list_smallgdf_filenamenext.append(filenamenext)


# concat into full gdf
i=0
for filenamenext in list_smallgdf_filenamenext:
    small_gdf = gpd.read_file(f'{parts_folderpath}/{filenamenext}'
    small_filename = small_filename[i]
    part_num = small_filename.split('_')[-1].split('.')[0]
    small_gdf['PART_NUM'] = int(part_num)
    
    if i<1:
        concat_gdf = small_gdf
    else:
        concat_gdf = pd.concat([concat_gdf, small_gdf])
    i+=1

concat_gdf.to_file(f'{full_folderpath}/concat_gdf.shp')

Kai · Accepted Answer

The issue was with the chunksize.

What happened: The chunksize was set to 1000. This meant that for a chunk of 1000 records where we applied the 'to_crs' transformation, after approximately every 800 records in each chunk, 'to_crs' apparently dropped the remaining 200 records.
What solved the issue: Drop the chunksize to 100. Though the number of chunks you expect to UNION using pd.concat later will increase by 10x, your records will no longer be dropping during the 'to_crs' coordinate transformation.

geopandas to_crs returned less records than expected

Answers (1)

Related Questions