6:[["$","$Le",null,{}],["$","div",null,{"className":"min-h-screen bg-gray-100 p-6","children":[["$","$Lf",null,{}],["$","script",null,{"type":"application/ld+json","dangerouslySetInnerHTML":{"__html":"{\"@context\":\"https://schema.org\",\"@type\":\"QAPage\",\"mainEntity\":{\"@type\":\"Question\",\"name\":\"Sampling in pandas\",\"text\":\"

If I want to randomly sample a pandas dataframe I can use pandas.DataFrame.sample.

\\n\\n

Suppose I randomly sample 80% of the rows. How do I automatically get the other 20% of the rows that were not picked?

\\n\",\"author\":{\"@type\":\"Person\",\"name\":\"wwl\"},\"upvoteCount\":4,\"answerCount\":2,\"acceptedAnswer\":{\"@type\":\"Answer\",\"text\":\"

As Lagerbaer explains, one can add a column with a unique index to the dataframe, or randomly shuffle the entire dataframe. For the latter,

\\n\\n

df.reindex(np.random.permutation(df.index))\\n

\\n\\n

works. (np means numpy)

\\n\",\"author\":{\"@type\":\"Person\",\"name\":\"wwl\"},\"upvoteCount\":4}}}"}}],["$","div",null,{"className":"bg-white shadow-md rounded-lg p-6 mb-6 relative","children":[["$","div",null,{"className":"absolute top-4 right-4 flex flex-wrap space-x-2","children":[["$","span","python",{"className":"bg-blue-600 text-white text-sm px-3 py-1 rounded-full","children":["$","$L10",null,{"href":"/discussion/tag/python/1","children":"python"}]}],["$","span","pandas",{"className":"bg-blue-600 text-white text-sm px-3 py-1 rounded-full","children":["$","$L10",null,{"href":"/discussion/tag/pandas/1","children":"pandas"}]}]]}],["$","div",null,{"className":"flex items-center mb-4","children":[["$","img",null,{"src":"https://www.gravatar.com/avatar/4a59a3ba8c82706a312e18a6e56104c4?s=256&d=identicon&r=PG","alt":"wwl","className":"w-16 h-16 rounded-full border"}],["$","div",null,{"className":"ml-4","children":[["$","a",null,{"href":"https://stackoverflow.com/users/1393043/wwl","target":"_blank","rel":"noopener noreferrer","className":"text-lg font-semibold text-blue-600 hover:underline","children":"wwl"}],["$","p",null,{"className":"text-sm text-gray-500","children":["Reputation: ",2065]}]]}]]}],["$","h1",null,{"className":"text-2xl font-bold text-gray-800 mb-4","children":"Sampling in pandas"}],["$","p",null,{"className":"text-gray-700 mt-4","dangerouslySetInnerHTML":{"__html":"

If I want to randomly sample a pandas dataframe I can use pandas.DataFrame.sample.

\n\n

Suppose I randomly sample 80% of the rows. How do I automatically get the other 20% of the rows that were not picked?

\n"}}],["$","div",null,{"className":"text-gray-600 text-sm mt-4","children":[["$","p",null,{"children":["Upvotes: ",4]}],["$","p",null,{"children":["Views: ",763]}]]}]]}],["$","div",null,{"className":"container mx-auto","children":[["$","h2",null,{"className":"text-2xl font-semibold text-gray-800 mb-6","children":["Answers (",2,")"]}],[["$","div","39801503",{"className":"bg-white shadow-md rounded-lg p-6 mb-6","children":[["$","div",null,{"className":"flex items-center mb-4","children":[["$","img",null,{"src":"https://i.sstatic.net/FC40Y.jpg?s=256","alt":"boot-scootin","className":"w-12 h-12 rounded-full border"}],["$","div",null,{"className":"ml-4","children":[["$","a",null,{"href":"https://stackoverflow.com/users/5015569/boot-scootin","target":"_blank","rel":"noopener noreferrer","className":"text-lg font-semibold text-blue-600 hover:underline","children":"boot-scootin"}],["$","p",null,{"className":"text-sm text-gray-500","children":["Reputation: ",12515]}]]}]]}],["$","p",null,{"className":"text-gray-700 mb-4","dangerouslySetInnerHTML":{"__html":"

>>> import pandas as pd, numpy as np\n>>> df = pd.DataFrame({'a': [1,2,3,4,5,6,7,8,9,10], 'b': [11,12,13,14,15,16,17,18,19,20]})\n>>> df\n    a   b\n0   1  11\n1   2  12\n2   3  13\n3   4  14\n4   5  15\n5   6  16\n6   7  17\n7   8  18\n8   9  19\n9  10  20\n\n# randomly sample 5 rows\n>>> sample = df.sample(5)\n>>> sample\n   a   b\n7  8  18\n2  3  13\n4  5  15\n0  1  11\n3  4  14\n\n# list comprehension to get indices not in sample's indices\n>>> idxs_not_in_sample = [idx for idx in df.index if idx not in sample.index]\n>>> idxs_not_in_sample\n[1, 5, 6, 8, 9]\n\n# locate the rows at the indices in the original dataframe that aren't in the sample\n>>> not_sample = df.loc[idxs_not_in_sample]\n>>> not_sample\n    a   b\n1   2  12\n5   6  16\n6   7  17\n8   9  19\n9  10  20\n

\n"}}],["$","div",null,{"className":"text-gray-600 text-sm","children":["$","p",null,{"children":["Upvotes: ",2]}]}]]}],["$","div","39801461",{"className":"bg-white shadow-md rounded-lg p-6 mb-6","children":[["$","div",null,{"className":"flex items-center mb-4","children":[["$","img",null,{"src":"https://www.gravatar.com/avatar/4a59a3ba8c82706a312e18a6e56104c4?s=256&d=identicon&r=PG","alt":"wwl","className":"w-12 h-12 rounded-full border"}],["$","div",null,{"className":"ml-4","children":[["$","a",null,{"href":"https://stackoverflow.com/users/1393043/wwl","target":"_blank","rel":"noopener noreferrer","className":"text-lg font-semibold text-blue-600 hover:underline","children":"wwl"}],["$","p",null,{"className":"text-sm text-gray-500","children":["Reputation: ",2065]}]]}]]}],["$","p",null,{"className":"text-gray-700 mb-4","dangerouslySetInnerHTML":{"__html":"

As Lagerbaer explains, one can add a column with a unique index to the dataframe, or randomly shuffle the entire dataframe. For the latter,

\n\n

df.reindex(np.random.permutation(df.index))\n

\n\n

works. (np means numpy)

\n"}}],["$","div",null,{"className":"text-gray-600 text-sm","children":["$","p",null,{"children":["Upvotes: ",4]}]}]]}]]]}],["$","div",null,{"className":"bg-white shadow-md rounded-lg p-6 mt-6","children":[["$","h2",null,{"className":"text-2xl font-semibold text-gray-800 mb-4","children":"Related Questions"}],["$","ul",null,{"className":"list-disc list-inside","children":[["$","li","38061876",{"className":"mb-2","children":["$","$L10",null,{"href":"/discussion/solution/38061876","className":"text-blue-600 hover:underline","children":"Python sampling a dataframe"}]}],["$","li","67387252",{"className":"mb-2","children":["$","$L10",null,{"href":"/discussion/solution/67387252","className":"text-blue-600 hover:underline","children":"Column-wise sampling in pandas"}]}],["$","li","67362581",{"className":"mb-2","children":["$","$L10",null,{"href":"/discussion/solution/67362581","className":"text-blue-600 hover:underline","children":"Pandas Different Sampling Size"}]}],["$","li","66934765",{"className":"mb-2","children":["$","$L10",null,{"href":"/discussion/solution/66934765","className":"text-blue-600 hover:underline","children":"Sampling data from the pandas dataframe"}]}],["$","li","62023514",{"className":"mb-2","children":["$","$L10",null,{"href":"/discussion/solution/62023514","className":"text-blue-600 hover:underline","children":"Sample pandas dataframe by column value"}]}],["$","li","59292844",{"className":"mb-2","children":["$","$L10",null,{"href":"/discussion/solution/59292844","className":"text-blue-600 hover:underline","children":"Np random sampling in python"}]}],["$","li","46028283",{"className":"mb-2","children":["$","$L10",null,{"href":"/discussion/solution/46028283","className":"text-blue-600 hover:underline","children":"Random sampling pandas based on column values"}]}],["$","li","32683083",{"className":"mb-2","children":["$","$L10",null,{"href":"/discussion/solution/32683083","className":"text-blue-600 hover:underline","children":"How to sample on condition with pandas?"}]}],["$","li","30601048",{"className":"mb-2","children":["$","$L10",null,{"href":"/discussion/solution/30601048","className":"text-blue-600 hover:underline","children":"Random sampling and Pandas dataframes"}]}],["$","li","19214922",{"className":"mb-2","children":["$","$L10",null,{"href":"/discussion/solution/19214922","className":"text-blue-600 hover:underline","children":"sampling pandas dataframe by different frequencies"}]}]]}]]}]]}],["$","$L11",null,{}],["$","$L12",null,{}],["$","$L13",null,{}],["$","$L14",null,{}],["$","$L15",null,{}]]

Sampling in pandas

Answers (2)

Related Questions