Reputation: 16676
I'm following this quick start after starting this ready-to-use PredictionIO Amazon EC2 instance and after running these commands it fails in the pio train
:
pio app new MyTextApp
pio import --appid 1 --input data/stopwords.json
pio import --appid 1 --input data/emails.json
pio build
pio train
...
Data set is empty, make sure event fields match imported data.
Exception in thread "main" java.lang.IllegalStateException: Haven't seen any document yet.
at org.apache.spark.mllib.feature.IDF$DocumentFrequencyAggregator.idf(IDF.scala:132)
at org.apache.spark.mllib.feature.IDF.fit(IDF.scala:56)
at uk.co.news.PreparedData.<init>(Preparator.scala:70)
at uk.co.news.Preparator.prepare(Preparator.scala:47)
at uk.co.news.Preparator.prepare(Preparator.scala:43)
Since there is no error when running the command to import emails, I don't understand why the data set is still empty. I double-checked the email.json
file and the data is indeed there and this is the result when running
pio import --appid 1 --input data/emails.json
ubuntu@ip-172-31-0-60:~/pio-textclassification$ pio import --appid 1 --input data/emails.json
[INFO] [Runner$] Submission command: /opt/spark-1.4.1-bin-hadoop2.6/bin/spark-submit --class io.prediction.tools.imprt.FileToEvents --files file:/opt/PredictionIO/conf/log4j.properties --driver-class-path /opt/PredictionIO/conf file:/opt/PredictionIO/lib/pio-assembly-0.9.4.jar --appid 1 --input file:/home/ubuntu/pio-textclassification/data/emails.json --env PIO_ENV_LOADED=1,PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta,PIO_FS_BASEDIR=/home/ubuntu/.pio_store,PIO_HOME=/opt/PredictionIO,PIO_FS_ENGINESDIR=/home/ubuntu/.pio_store/engines,PIO_STORAGE_SOURCES_PGSQL_URL=jdbc:postgresql://localhost/pio,PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=PGSQL,PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=PGSQL,PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event,PIO_STORAGE_SOURCES_PGSQL_PASSWORD=pio,PIO_STORAGE_SOURCES_PGSQL_TYPE=jdbc,PIO_FS_TMPDIR=/home/ubuntu/.pio_store/tmp,PIO_STORAGE_SOURCES_PGSQL_USERNAME=pio,PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model,PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=PGSQL,PIO_CONF_DIR=/opt/PredictionIO/conf
[INFO] [Remoting] Starting remoting
[INFO] [Remoting] Remoting started; listening on addresses :[akka.tcp://[email protected]:49257]
[INFO] [FileToEvents$] Events are imported.
[INFO] [FileToEvents$] Done.
EDIT:
pio build --verbose
showed an exception that was being swallowed. The problem is with the database connection, but it's still not clear what is wrong since parts of the exception are being replaced with "..."
[DEBUG] [ConnectionPool$] Registered connection pool : ConnectionPool(url:jdbc:postgresql://localhost/pio, user:pio) using factory : <default>
[DEBUG] [ConnectionPool$] Registered singleton connection pool : ConnectionPool(url:jdbc:postgresql://localhost/pio, user:pio)
[DEBUG] [StatementExecutor$$anon$1] SQL execution completed
[SQL Execution]
create table if not exists pio_meta_enginemanifests ( id varchar(100) not null primary key, version text not null, engineName text not null, description text, files text not null, engineFactory text not null); (10 ms)
[Stack Trace]
...
io.prediction.data.storage.jdbc.JDBCEngineManifests$$anonfun$1.apply(JDBCEngineManifests.scala:37)
io.prediction.data.storage.jdbc.JDBCEngineManifests$$anonfun$1.apply(JDBCEngineManifests.scala:29)
scalikejdbc.DBConnection$class.autoCommit(DBConnection.scala:222)
scalikejdbc.DB.autoCommit(DB.scala:60)
scalikejdbc.DB$$anonfun$autoCommit$1.apply(DB.scala:215)
scalikejdbc.DB$$anonfun$autoCommit$1.apply(DB.scala:214)
scalikejdbc.LoanPattern$class.using(LoanPattern.scala:18)
scalikejdbc.DB$.using(DB.scala:138)
scalikejdbc.DB$.autoCommit(DB.scala:214)
io.prediction.data.storage.jdbc.JDBCEngineManifests.<init>(JDBCEngineManifests.scala:29)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:526)
io.prediction.data.storage.Storage$.getDataObject(Storage.scala:293)
...
[INFO] [RegisterEngine$] Registering engine JmhjlGoEjJuKXhXpY70MbEkuGHMuOZzL 8ccd38126d56ed48adaa9f85547131467f7629f7
[DEBUG] [StatementExecutor$$anon$1] SQL execution completed
[SQL Execution]
update pio_meta_enginemanifests set engineName = 'pio-textclassification', description = 'pio-autogen-manifest', files = 'file:/home/ubuntu/pio-textclassification/target/scala-2.10/uk.co.news-assembly-0.1-SNAPSHOT-deps.jar... (192)', engineFactory = '' where id = 'JmhjlGoEjJuKXhXpY70MbEkuGHMuOZzL' and version = '8ccd38126d56ed48adaa9f85547131467f7629f7'; (3 ms)
[Stack Trace]
...
io.prediction.data.storage.jdbc.JDBCEngineManifests$$anonfun$7.apply(JDBCEngineManifests.scala:85)
io.prediction.data.storage.jdbc.JDBCEngineManifests$$anonfun$7.apply(JDBCEngineManifests.scala:78)
scalikejdbc.DBConnection$$anonfun$3.apply(DBConnection.scala:297)
scalikejdbc.DBConnection$class.scalikejdbc$DBConnection$$rollbackIfThrowable(DBConnection.scala:274)
scalikejdbc.DBConnection$class.localTx(DBConnection.scala:295)
scalikejdbc.DB.localTx(DB.scala:60)
scalikejdbc.DB$.localTx(DB.scala:257)
io.prediction.data.storage.jdbc.JDBCEngineManifests.update(JDBCEngineManifests.scala:78)
io.prediction.tools.RegisterEngine$.registerEngine(RegisterEngine.scala:50)
io.prediction.tools.console.Console$.build(Console.scala:813)
io.prediction.tools.console.Console$$anonfun$main$1.apply(Console.scala:698)
io.prediction.tools.console.Console$$anonfun$main$1.apply(Console.scala:684)
scala.Option.map(Option.scala:145)
io.prediction.tools.console.Console$.main(Console.scala:684)
io.prediction.tools.console.Console.main(Console.scala)
...
[DEBUG] [StatementExecutor$$anon$1] SQL execution completed
[SQL Execution]
INSERT INTO pio_meta_enginemanifests VALUES( 'JmhjlGoEjJuKXhXpY70MbEkuGHMuOZzL', '8ccd38126d56ed48adaa9f85547131467f7629f7', 'pio-textclassification', 'pio-autogen-manifest', 'file:/home/ubuntu/pio-textclassification/target/scala-2.10/uk.co.news-assembly-0.1-SNAPSHOT-deps.jar... (192)', ''); (1 ms)
[Stack Trace]
...
io.prediction.data.storage.jdbc.JDBCEngineManifests$$anonfun$2.apply(JDBCEngineManifests.scala:48)
io.prediction.data.storage.jdbc.JDBCEngineManifests$$anonfun$2.apply(JDBCEngineManifests.scala:40)
scalikejdbc.DBConnection$$anonfun$3.apply(DBConnection.scala:297)
scalikejdbc.DBConnection$class.scalikejdbc$DBConnection$$rollbackIfThrowable(DBConnection.scala:274)
scalikejdbc.DBConnection$class.localTx(DBConnection.scala:295)
scalikejdbc.DB.localTx(DB.scala:60)
scalikejdbc.DB$.localTx(DB.scala:257)
io.prediction.data.storage.jdbc.JDBCEngineManifests.insert(JDBCEngineManifests.scala:40)
io.prediction.data.storage.jdbc.JDBCEngineManifests.update(JDBCEngineManifests.scala:89)
io.prediction.tools.RegisterEngine$.registerEngine(RegisterEngine.scala:50)
io.prediction.tools.console.Console$.build(Console.scala:813)
io.prediction.tools.console.Console$$anonfun$main$1.apply(Console.scala:698)
io.prediction.tools.console.Console$$anonfun$main$1.apply(Console.scala:684)
scala.Option.map(Option.scala:145)
io.prediction.tools.console.Console$.main(Console.scala:684)
...
[INFO] [Console$] Your engine is ready for training.
Upvotes: 2
Views: 870
Reputation: 16676
The solution was to change the DataSource.scala
to match the schema in the emails.json
file before running pio build
.
This is the only method I had to change in the file:
private def readEventData(sc: SparkContext) : RDD[Observation] = {
//Get RDD of Events.
PEventStore.find(
appName = dsp.appName,
entityType = Some("content"),
eventNames = Some(List("e-mail"))
// Convert collected RDD of events to and RDD of Observation
// objects.
)(sc).map(e => {
val label : String = e.properties.get[String]("label")
Observation(
if (label == "spam") 1.0 else 0.0,
e.properties.get[String]("text"),
label
)
}).cache
}
I had to change the previous values to "content", "e-mail" and "spam".
Upvotes: 0
Reputation: 71
A few things to check:
P.S. There is a google group (https://groups.google.com/forum/#!forum/predictionio-user) for which your question will be answered more quickly by the community of PredictionIO users.
Upvotes: 1