Kobe-Wan Kenobi
Kobe-Wan Kenobi

Reputation: 3884

Running ML algorithms on existing dataframes

I'm new to Spark and I'm trying to figure out what is the procedure for performing data science using it. Concretely, I know how to create Dataframes out of existing data and then perform some analysis.

Now I'm trying to understand how to run ML algorithms on data already in dataframes. When I look at ML documentation, I see that Dataframes are created out of Vectors (dense or sparse), but as that is not the case with my existing dataframes. I was wondering how to convert existing dataframe with a number of columns into a dataframe with single column placed in vectors?

What is the usual procedure when trying to perform exploratory analysis and some plots first and then perform ML on same dataframe?

Upvotes: 0

Views: 47

Answers (1)

user7337271
user7337271

Reputation: 1712

  • org.apache.spark.ml.feature / pyspark.ml.feature contains a large number of feature extraction tools which are extensively documented (Extracting, transforming and selecting features)
  • Spark is not suitable for exploratory data analysis. Usually you use Spark to sample / clean / aggregate and collect data for visualization with independent local tools. Commercial environments (like Databricks) and some open source libraries (like Apache Zeppelin) provide limited tools which can be used directly on collected results.

Upvotes: 1

Related Questions