eilalan
eilalan

Reputation: 689

Working with dataframe / matrix to create an input for sklearn & Tensorflow

I am working with pandas / python /numpy / datalab/bigQuery to generate an input table for machine learning processing. The data is genomic - and right now, I am working with small subset of 174 rows 12430 columns

The column names are extracted from bigQuery (df_pik3ca_features = bq.Query(std_sql_features).to_dataframe(dialect='standard',use_cache=True)) at the same way, the row names are extracted: samples_rows = bq.Query('SELECT sample_id FROMspeedy-emissary-167213.pgp_orielresearch.pgp_PIK3CA_all_features_values_step_3GROUP BY sample_id')

what would be the easiest way to create a dataframe / matrix with named rows and columns that were extracted.

I explored the dataframes in pandas and could not find the way to pass the names as parameter.

for empty array, I was able to find the following (numpy) with no names:

a = np.full([num_of_rows, num_of_columns], np.nan)
a.columns

I know R very well (if there is no other way - I hope that I can use it with datalab)

any idea?

Many thanks!

Upvotes: 1

Views: 195

Answers (1)

Ted Petrou
Ted Petrou

Reputation: 62017

If you have your column names and row names stored in lists then you can just use .loc to select the exact rows and columns you desire. Just make sure that the row names are in the index. You might need to do df.set_index('sample_id') to put the correct row name in the index.

Assuming the rows and columns are in variables row_names and col_names, do this.

df.loc[row_names, col_names]

Upvotes: 1

Related Questions