Splitting a DataFrame based on duplicate values in columns

Question

Here's my starting dataframe:

StartDF = pd.DataFrame({'A': {0: 1, 1: 1, 2: 2, 3: 4, 4: 5, 5: 5, 6: 5, 7: 5}, 'B': {0: 2, 1: 2, 2: 4, 3: 2, 4: 2, 5: 4, 6: 4, 7: 5}, 'C': {0: 10, 1: 1000, 2: 250, 3: 100, 4: 550, 5: 100, 6: 3000, 7: 250}})

I need to create a list of individual dataframes based on duplicate values in columns A and B, so it should look like this:

df1 = pd.DataFrame({'A': {0: 1, 1: 1}, 'B': {0: 2, 1: 2}, 'C': {0: 10, 1: 1000}}) 
df2 = pd.DataFrame({'A': {0: 2}, 'B': {0: 4}, 'C': {0: 250}})
df3 = pd.DataFrame({'A': {0: 4}, 'B': {0: 2}, 'C': {0: 100}})
df4 = pd.DataFrame({'A': {0: 5}, 'B': {0: 2}, 'C': {0: 550}})
df5 = pd.DataFrame({'A': {0: 5, 1: 5}, 'B': {0: 4, 1: 4}, 'C': {0: 100, 1: 3000}}) 
df6 = pd.DataFrame({'A': {0: 5}, 'B': {0: 5}, 'C': {0: 250}})

I've seen a lot of answers that explain how to DROP duplicates, but I need to keep the duplicate values because the information in column C will usually be different between rows regardless of duplicates in columns A and B. All of the row data needs to be preserved in the new dataframes.

Additional note, the starting dataframe (StartDF) will change in length, so each time this is run, the number of individual dataframes created will be variable. Ultimately, I need to print the newly created dataframes to their own csv files (I know how to do this part). Just need to know how to break out the data from the original dataframe in an elegant way.

cs95 · Accepted Answer

You can use a groupby, iterate over each group and build a list using a list comprehension.

df_list = [g for _, g in df.groupby(['A', 'B'])]

print(*df_list, sep='

')

   A  B     C
0  1  2    10
1  1  2  1000

   A  B    C
2  2  4  250

   A  B    C
3  4  2  100

   A  B    C
4  5  2  550

   A  B     C
5  5  4   100
6  5  4  3000

   A  B    C
7  5  5  250

Splitting a DataFrame based on duplicate values in columns

Answers (1)

Related Questions