cahen
cahen

Reputation: 16676

PredictionIO text classification quick start failing when reading the data

I'm following this quick start after starting this ready-to-use PredictionIO Amazon EC2 instance and after running these commands it fails in the pio train:

pio app new MyTextApp
pio import --appid 1 --input data/stopwords.json
pio import --appid 1 --input data/emails.json
pio build
pio train

...

Data set is empty, make sure event fields match imported data.

Exception in thread "main" java.lang.IllegalStateException: Haven't seen any document yet.
    at org.apache.spark.mllib.feature.IDF$DocumentFrequencyAggregator.idf(IDF.scala:132)
    at org.apache.spark.mllib.feature.IDF.fit(IDF.scala:56)
    at uk.co.news.PreparedData.<init>(Preparator.scala:70)
    at uk.co.news.Preparator.prepare(Preparator.scala:47)
    at uk.co.news.Preparator.prepare(Preparator.scala:43)

Since there is no error when running the command to import emails, I don't understand why the data set is still empty. I double-checked the email.json file and the data is indeed there and this is the result when running

pio import --appid 1 --input data/emails.json

ubuntu@ip-172-31-0-60:~/pio-textclassification$ pio import --appid 1 --input data/emails.json
[INFO] [Runner$] Submission command: /opt/spark-1.4.1-bin-hadoop2.6/bin/spark-submit --class io.prediction.tools.imprt.FileToEvents --files file:/opt/PredictionIO/conf/log4j.properties --driver-class-path /opt/PredictionIO/conf file:/opt/PredictionIO/lib/pio-assembly-0.9.4.jar --appid 1 --input file:/home/ubuntu/pio-textclassification/data/emails.json --env PIO_ENV_LOADED=1,PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta,PIO_FS_BASEDIR=/home/ubuntu/.pio_store,PIO_HOME=/opt/PredictionIO,PIO_FS_ENGINESDIR=/home/ubuntu/.pio_store/engines,PIO_STORAGE_SOURCES_PGSQL_URL=jdbc:postgresql://localhost/pio,PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=PGSQL,PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=PGSQL,PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event,PIO_STORAGE_SOURCES_PGSQL_PASSWORD=pio,PIO_STORAGE_SOURCES_PGSQL_TYPE=jdbc,PIO_FS_TMPDIR=/home/ubuntu/.pio_store/tmp,PIO_STORAGE_SOURCES_PGSQL_USERNAME=pio,PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model,PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=PGSQL,PIO_CONF_DIR=/opt/PredictionIO/conf
[INFO] [Remoting] Starting remoting
[INFO] [Remoting] Remoting started; listening on addresses :[akka.tcp://[email protected]:49257]
[INFO] [FileToEvents$] Events are imported.
[INFO] [FileToEvents$] Done.

EDIT:

pio build --verbose

showed an exception that was being swallowed. The problem is with the database connection, but it's still not clear what is wrong since parts of the exception are being replaced with "..."

[DEBUG] [ConnectionPool$] Registered connection pool : ConnectionPool(url:jdbc:postgresql://localhost/pio, user:pio) using factory : <default>
[DEBUG] [ConnectionPool$] Registered singleton connection pool : ConnectionPool(url:jdbc:postgresql://localhost/pio, user:pio)
[DEBUG] [StatementExecutor$$anon$1] SQL execution completed

  [SQL Execution]
   create table if not exists pio_meta_enginemanifests ( id varchar(100) not null primary key, version text not null, engineName text not null, description text, files text not null, engineFactory text not null); (10 ms)

  [Stack Trace]
    ...
    io.prediction.data.storage.jdbc.JDBCEngineManifests$$anonfun$1.apply(JDBCEngineManifests.scala:37)
    io.prediction.data.storage.jdbc.JDBCEngineManifests$$anonfun$1.apply(JDBCEngineManifests.scala:29)
    scalikejdbc.DBConnection$class.autoCommit(DBConnection.scala:222)
    scalikejdbc.DB.autoCommit(DB.scala:60)
    scalikejdbc.DB$$anonfun$autoCommit$1.apply(DB.scala:215)
    scalikejdbc.DB$$anonfun$autoCommit$1.apply(DB.scala:214)
    scalikejdbc.LoanPattern$class.using(LoanPattern.scala:18)
    scalikejdbc.DB$.using(DB.scala:138)
    scalikejdbc.DB$.autoCommit(DB.scala:214)
    io.prediction.data.storage.jdbc.JDBCEngineManifests.<init>(JDBCEngineManifests.scala:29)
    sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    io.prediction.data.storage.Storage$.getDataObject(Storage.scala:293)
    ...

[INFO] [RegisterEngine$] Registering engine JmhjlGoEjJuKXhXpY70MbEkuGHMuOZzL 8ccd38126d56ed48adaa9f85547131467f7629f7
[DEBUG] [StatementExecutor$$anon$1] SQL execution completed

  [SQL Execution]
   update pio_meta_enginemanifests set engineName = 'pio-textclassification', description = 'pio-autogen-manifest', files = 'file:/home/ubuntu/pio-textclassification/target/scala-2.10/uk.co.news-assembly-0.1-SNAPSHOT-deps.jar... (192)', engineFactory = '' where id = 'JmhjlGoEjJuKXhXpY70MbEkuGHMuOZzL' and version = '8ccd38126d56ed48adaa9f85547131467f7629f7'; (3 ms)

  [Stack Trace]
    ...
    io.prediction.data.storage.jdbc.JDBCEngineManifests$$anonfun$7.apply(JDBCEngineManifests.scala:85)
    io.prediction.data.storage.jdbc.JDBCEngineManifests$$anonfun$7.apply(JDBCEngineManifests.scala:78)
    scalikejdbc.DBConnection$$anonfun$3.apply(DBConnection.scala:297)
    scalikejdbc.DBConnection$class.scalikejdbc$DBConnection$$rollbackIfThrowable(DBConnection.scala:274)
    scalikejdbc.DBConnection$class.localTx(DBConnection.scala:295)
    scalikejdbc.DB.localTx(DB.scala:60)
    scalikejdbc.DB$.localTx(DB.scala:257)
    io.prediction.data.storage.jdbc.JDBCEngineManifests.update(JDBCEngineManifests.scala:78)
    io.prediction.tools.RegisterEngine$.registerEngine(RegisterEngine.scala:50)
    io.prediction.tools.console.Console$.build(Console.scala:813)
    io.prediction.tools.console.Console$$anonfun$main$1.apply(Console.scala:698)
    io.prediction.tools.console.Console$$anonfun$main$1.apply(Console.scala:684)
    scala.Option.map(Option.scala:145)
    io.prediction.tools.console.Console$.main(Console.scala:684)
    io.prediction.tools.console.Console.main(Console.scala)
    ...

[DEBUG] [StatementExecutor$$anon$1] SQL execution completed

  [SQL Execution]
   INSERT INTO pio_meta_enginemanifests VALUES( 'JmhjlGoEjJuKXhXpY70MbEkuGHMuOZzL', '8ccd38126d56ed48adaa9f85547131467f7629f7', 'pio-textclassification', 'pio-autogen-manifest', 'file:/home/ubuntu/pio-textclassification/target/scala-2.10/uk.co.news-assembly-0.1-SNAPSHOT-deps.jar... (192)', ''); (1 ms)

  [Stack Trace]
    ...
    io.prediction.data.storage.jdbc.JDBCEngineManifests$$anonfun$2.apply(JDBCEngineManifests.scala:48)
    io.prediction.data.storage.jdbc.JDBCEngineManifests$$anonfun$2.apply(JDBCEngineManifests.scala:40)
    scalikejdbc.DBConnection$$anonfun$3.apply(DBConnection.scala:297)
    scalikejdbc.DBConnection$class.scalikejdbc$DBConnection$$rollbackIfThrowable(DBConnection.scala:274)
    scalikejdbc.DBConnection$class.localTx(DBConnection.scala:295)
    scalikejdbc.DB.localTx(DB.scala:60)
    scalikejdbc.DB$.localTx(DB.scala:257)
    io.prediction.data.storage.jdbc.JDBCEngineManifests.insert(JDBCEngineManifests.scala:40)
    io.prediction.data.storage.jdbc.JDBCEngineManifests.update(JDBCEngineManifests.scala:89)
    io.prediction.tools.RegisterEngine$.registerEngine(RegisterEngine.scala:50)
    io.prediction.tools.console.Console$.build(Console.scala:813)
    io.prediction.tools.console.Console$$anonfun$main$1.apply(Console.scala:698)
    io.prediction.tools.console.Console$$anonfun$main$1.apply(Console.scala:684)
    scala.Option.map(Option.scala:145)
    io.prediction.tools.console.Console$.main(Console.scala:684)
    ...

[INFO] [Console$] Your engine is ready for training.

Upvotes: 2

Views: 870

Answers (2)

cahen
cahen

Reputation: 16676

The solution was to change the DataSource.scala to match the schema in the emails.json file before running pio build.

This is the only method I had to change in the file:

 private def readEventData(sc: SparkContext) : RDD[Observation] = {
    //Get RDD of Events.
    PEventStore.find(
      appName = dsp.appName,
      entityType = Some("content"), 
      eventNames = Some(List("e-mail")) 

      // Convert collected RDD of events to and RDD of Observation
      // objects.
    )(sc).map(e => {
      val label : String = e.properties.get[String]("label")
      Observation(
        if (label == "spam") 1.0 else 0.0,
        e.properties.get[String]("text"),
        label
      )
    }).cache
  }

I had to change the previous values to "content", "e-mail" and "spam".

Upvotes: 0

Tom Chan
Tom Chan

Reputation: 71

A few things to check:

  1. Does "pio app list" show MyTextApp has appId 1?
  2. Download https://github.com/yipjustin/pio-event-distribution-checker and change engine.json so that appId reads 1, then "pio build" and "pio train" to see if the data is actually imported.

P.S. There is a google group (https://groups.google.com/forum/#!forum/predictionio-user) for which your question will be answered more quickly by the community of PredictionIO users.

Upvotes: 1

Related Questions