{"paragraphs":[{"text":"%md\n\n#Hortonworks Blog - Predicting Airline Delays\n\nThis notebook is based on Blog posts below, by [Ofer Mendelevitch](http://hortonworks.com/blog/author/ofermend/)\n[http://hortonworks.com/blog/data-science-apacheh-hadoop-predicting-airline-delays/](http://hortonworks.com/blog/data-science-apacheh-hadoop-predicting-airline-delays/)\n[http://hortonworks.com/blog/data-science-hadoop-spark-scala-part-2/](http://hortonworks.com/blog/data-science-hadoop-spark-scala-part-2/)","dateUpdated":"2016-01-26T06:56:20+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1453834308164_195465260","id":"20160126-185148_1042529576","result":{"code":"SUCCESS","type":"HTML","msg":"
This notebook is based on Blog posts below, by Ofer Mendelevitch\n
http://hortonworks.com/blog/data-science-apacheh-hadoop-predicting-airline-delays/\n
http://hortonworks.com/blog/data-science-hadoop-spark-scala-part-2/
In this demo, we demonstrate how to build a predictive model with Hadoop, this time we'll use Apache Spark and ML-Lib.
\nWe will show how to use Apache Spark via its Scala API to generate our feature matrix and also use ML-Lib (Spark's machine learning library) to build and evaluate our classification models.
\nRecall from part 1 that we are constructing a predictive model for flight delays. Our source dataset resides here, and includes details about flights in the US from the years 1987-2008. We have also enriched the data with weather information, where we find daily temperatures (min/max), wind speed, snow conditions and precipitation.
\nWe will build a supervised learning model to predict flight delays for flights leaving O'Hare International airport (ORD). We will use the year 2007 data to build the model, and test its validity using data from 2008.
\nApache Spark's basic data abstraction is that of an RDD (resilient distributed dataset), which is a fault-tolerant collection of elements that can be operated on in parallel across your Hadoop cluster.
\nSpark's API (available in Scala, Python or Java) supports a variety of transformations such as map() and flatMap(), filter(), join(), and others to create and manipulate RDDs. For a full description of the API please check the Spark API programming guide.
\nSimilar to the Scikit-learn demo, in our first iteration we generate the following features for each flight:
\nWe will use Spark RDDs to perform the same pre-processing, transforming the raw flight delay dataset into the two feature matrices: data_2007 (our training set) and data_2008 (our test set).
\nThe case class DelayRec that encapsulates a flight delay record represents the feature vector, and its methods do most of the heavy lifting:
\nWith DelayRec in place, our processing takes on the following steps (in the function prepFlightDelays):
\nFinally, we use the gen_features method to generate the final feature vector per row, as a set of doubles.
\n"},"dateCreated":"2016-01-26T01:41:44+0000","dateStarted":"2016-01-26T06:58:06+0000","dateFinished":"2016-01-26T06:58:07+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3202"},{"text":"%spark\n\nimport org.apache.spark.rdd._\nimport scala.collection.JavaConverters._\nimport au.com.bytecode.opencsv.CSVReader\n\nimport java.io._\nimport org.joda.time._\nimport org.joda.time.format._\nimport org.joda.time.format.DateTimeFormat\nimport org.joda.time.DateTime\nimport org.joda.time.Days\n\n\ncase class DelayRec(year: String,\n month: String,\n dayOfMonth: String,\n dayOfWeek: String,\n crsDepTime: String,\n depDelay: String,\n origin: String,\n distance: String,\n cancelled: String) {\n\n val holidays = List(\"01/01/2007\", \"01/15/2007\", \"02/19/2007\", \"05/28/2007\", \"06/07/2007\", \"07/04/2007\",\n \"09/03/2007\", \"10/08/2007\" ,\"11/11/2007\", \"11/22/2007\", \"12/25/2007\",\n \"01/01/2008\", \"01/21/2008\", \"02/18/2008\", \"05/22/2008\", \"05/26/2008\", \"07/04/2008\",\n \"09/01/2008\", \"10/13/2008\" ,\"11/11/2008\", \"11/27/2008\", \"12/25/2008\")\n\n def gen_features: (String, Array[Double]) = {\n val values = Array(\n depDelay.toDouble,\n month.toDouble,\n dayOfMonth.toDouble,\n dayOfWeek.toDouble,\n get_hour(crsDepTime).toDouble,\n distance.toDouble,\n days_from_nearest_holiday(year.toInt, month.toInt, dayOfMonth.toInt)\n )\n new Tuple2(to_date(year.toInt, month.toInt, dayOfMonth.toInt), values)\n }\n\n def get_hour(depTime: String) : String = \"%04d\".format(depTime.toInt).take(2)\n def to_date(year: Int, month: Int, day: Int) = \"%04d%02d%02d\".format(year, month, day)\n\n def days_from_nearest_holiday(year:Int, month:Int, day:Int): Int = {\n val sampleDate = new DateTime(year, month, day, 0, 0)\n\n holidays.foldLeft(3000) { (r, c) =>\n val holiday = DateTimeFormat.forPattern(\"MM/dd/yyyy\").parseDateTime(c)\n val distance = Math.abs(Days.daysBetween(holiday, sampleDate).getDays)\n math.min(r, distance)\n }\n }\n }\n\n// function to do a preprocessing step for a given file\ndef prepFlightDelays(infile: String): RDD[DelayRec] = {\n val data = sc.textFile(infile)\n\n data.map { line =>\n val reader = new CSVReader(new StringReader(line))\n reader.readAll().asScala.toList.map(rec => DelayRec(rec(0),rec(1),rec(2),rec(3),rec(5),rec(15),rec(16),rec(18),rec(21)))\n }.map(list => list(0))\n .filter(rec => rec.year != \"Year\")\n .filter(rec => rec.cancelled == \"0\")\n .filter(rec => rec.origin == \"ORD\")\n}\n\nval data_2007tmp = prepFlightDelays(\"/tmp/airflightsdelays/flights_2007.csv.bz2\")\nval data_2007 = data_2007tmp.map(rec => rec.gen_features._2)\nval data_2008 = prepFlightDelays(\"/tmp/airflightsdelays/flights_2008.csv.bz2\").map(rec => rec.gen_features._2)\n\ndata_2007tmp.toDF().registerTempTable(\"data_2007tmp\")\n\ndata_2007.take(5).map(x => x mkString \",\").foreach(println)","dateUpdated":"2016-01-28T02:30:04+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1453806329440_-894892047","id":"20160126-110529_1047309575","result":{"code":"SUCCESS","type":"TEXT","msg":"import org.apache.spark.rdd._\nimport scala.collection.JavaConverters._\nimport au.com.bytecode.opencsv.CSVReader\nimport java.io._\nimport org.joda.time._\nimport org.joda.time.format._\nimport org.joda.time.format.DateTimeFormat\nimport org.joda.time.DateTime\nimport org.joda.time.Days\ndefined class DelayRec\nprepFlightDelays: (infile: String)org.apache.spark.rdd.RDD[DelayRec]\ndata_2007tmp: org.apache.spark.rdd.RDD[DelayRec] = MapPartitionsRDD[6] at filter atWith the data_2007 dataset (which we'll use for training) and the data_2008 dataset (which we'll use for validation) as RDDs, we now build a predictive model using Spark's ML-Lib machine learning library.
\nML-Lib is Spark’s scalable machine learning library, which includes various learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and others.
\nIf you compare ML-Lib to Scikit-learn, at the moment ML-Lib lacks a few important algorithms like Random Forest or Gradient Boosted Trees. Having said that, we see a strong pace of innovation from the ML-Lib community and expect more algorithms and other features to be added soon (for example, Random Forest is being actively worked on, and will likely be available in the next release).
\nTo use ML-Lib's machine learning algorithms, first we parse our feature matrices into RDDs of LabeledPoint objects (for both the training and test datasets). LabeledPoint is ML-Lib's abstraction for a feature vector accompanied by a label. We consider flight delays of 15 minutes or more as “delays” and mark it with a label of 1.0, and under 15 minutes as “non-delay” and mark it with a label of 0.0.
\nWe also use ML-Lib's StandardScaler class to normalize our feature values for both training and validation sets. This is important because of ML-Lib's use of Stochastic Gradient Descent, which is known to perform best if feature vectors are normalized.
\n"},"dateCreated":"2016-01-26T01:43:35+0000","dateStarted":"2016-01-26T01:43:54+0000","dateFinished":"2016-01-26T01:43:54+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3207"},{"text":"%spark\n\nimport org.apache.spark.mllib.regression.LabeledPoint\nimport org.apache.spark.mllib.linalg.Vectors\nimport org.apache.spark.mllib.feature.StandardScaler\n\ndef parseData(vals: Array[Double]): LabeledPoint = {\n LabeledPoint(if (vals(0)>=15) 1.0 else 0.0, Vectors.dense(vals.drop(1)))\n}\n\n// Prepare training set\nval parsedTrainData = data_2007.map(parseData)\nparsedTrainData.cache\nval scaler = new StandardScaler(withMean = true, withStd = true).fit(parsedTrainData.map(x => x.features))\nval scaledTrainData = parsedTrainData.map(x => LabeledPoint(x.label, scaler.transform(Vectors.dense(x.features.toArray))))\nscaledTrainData.cache\n\n// Prepare test/validation set\nval parsedTestData = data_2008.map(parseData)\nparsedTestData.cache\nval scaledTestData = parsedTestData.map(x => LabeledPoint(x.label, scaler.transform(Vectors.dense(x.features.toArray))))\nscaledTestData.cache\n\nscaledTrainData.take(3).map(x => (x.label, x.features)).foreach(println)","dateUpdated":"2016-01-28T03:15:56+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1453806769475_-1419256657","id":"20160126-111249_801843407","result":{"code":"SUCCESS","type":"TEXT","msg":"import org.apache.spark.mllib.regression.LabeledPoint\nimport org.apache.spark.mllib.linalg.Vectors\nimport org.apache.spark.mllib.feature.StandardScaler\nparseData: (vals: Array[Double])org.apache.spark.mllib.regression.LabeledPoint\nparsedTrainData: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = MapPartitionsRDD[31] at map atNote that we use the RDD cache method to ensure that these computed RDDs (parsedTrainData, scaledTrainData, parsedTestData and scaledTestData) are cached in memory by Spark and not re-computed with each iteration of stochastic gradient descent.
\nWe also the Metrics class for evaluation of classification metrics: precision, recall, accuracy and the F1-measure
\n"},"dateCreated":"2016-01-26T01:44:07+0000","dateStarted":"2016-01-26T01:44:11+0000","dateFinished":"2016-01-26T01:44:11+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3209"},{"text":"%spark\n\n// Function to compute evaluation metrics\ndef eval_metrics(labelsAndPreds: RDD[(Double, Double)]) : Tuple2[Array[Double], Array[Double]] = {\n val tp = labelsAndPreds.filter(r => r._1==1 && r._2==1).count.toDouble\n val tn = labelsAndPreds.filter(r => r._1==0 && r._2==0).count.toDouble\n val fp = labelsAndPreds.filter(r => r._1==1 && r._2==0).count.toDouble\n val fn = labelsAndPreds.filter(r => r._1==0 && r._2==1).count.toDouble\n\n val precision = tp / (tp+fp)\n val recall = tp / (tp+fn)\n val F_measure = 2*precision*recall / (precision+recall)\n val accuracy = (tp+tn) / (tp+tn+fp+fn)\n new Tuple2(Array(tp, tn, fp, fn), Array(precision, recall, F_measure, accuracy))\n}\n\nimport org.apache.spark.rdd._\nimport org.apache.spark.rdd.RDD\n\nclass Metrics(labelsAndPreds: RDD[(Double, Double)]) extends java.io.Serializable {\n\n private def filterCount(lftBnd:Int,rtBnd:Int):Double = labelsAndPreds\n .map(x => (x._1.toInt, x._2.toInt))\n .filter(_ == (lftBnd,rtBnd)).count()\n\n lazy val tp = filterCount(1,1) // true positives\n lazy val tn = filterCount(0,0) // true negatives\n lazy val fp = filterCount(0,1) // false positives\n lazy val fn = filterCount(1,0) // false negatives\n\n lazy val precision = tp / (tp+fp)\n lazy val recall = tp / (tp+fn)\n lazy val F1 = 2*precision*recall / (precision+recall)\n lazy val accuracy = (tp+tn) / (tp+tn+fp+fn)\n}","dateUpdated":"2016-01-28T03:47:49+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1453806783573_-1125461739","id":"20160126-111303_373848071","result":{"code":"SUCCESS","type":"TEXT","msg":"eval_metrics: (labelsAndPreds: org.apache.spark.rdd.RDD[(Double, Double)])(Array[Double], Array[Double])\nimport org.apache.spark.rdd._\nimport org.apache.spark.rdd.RDD\ndefined class Metrics\n"},"dateCreated":"2016-01-26T11:13:03+0000","dateStarted":"2016-01-28T03:47:49+0000","dateFinished":"2016-01-28T03:47:54+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3210"},{"text":"%md\n\nML-Lib supports a few algorithms for supervised learning, among those are Linear Regression and Logistic Regression, Naive Bayes, Decision Tree, SVM, Random Forest and Gradient Boosted Trees. We will demonstrate the use of Logistic Regression, Decision Tree and Random Forest.\n\nLet's see how to build these models with ML-Lib:","dateUpdated":"2016-01-26T01:55:43+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":false,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1453815855273_-2135049276","id":"20160126-134415_537440182","result":{"code":"SUCCESS","type":"HTML","msg":"ML-Lib supports a few algorithms for supervised learning, among those are Linear Regression and Logistic Regression, Naive Bayes, Decision Tree, SVM, Random Forest and Gradient Boosted Trees. We will demonstrate the use of Logistic Regression, Decision Tree and Random Forest.
\nLet's see how to build these models with ML-Lib:
\n"},"dateCreated":"2016-01-26T01:44:15+0000","dateStarted":"2016-01-26T01:44:26+0000","dateFinished":"2016-01-26T01:44:26+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3211"},{"text":"%spark \n\nimport org.apache.spark.mllib.classification.LogisticRegressionWithSGD\n\n// Build the Logistic Regression model\nval model_lr = LogisticRegressionWithSGD.train(scaledTrainData, numIterations=100)\n\n// Predict\nval labelsAndPreds_lr = scaledTestData.map { point =>\n val pred = model_lr.predict(point.features)\n (pred, point.label)\n}\nval m_lr = eval_metrics(labelsAndPreds_lr)._2\nprintln(\"precision = %.2f, recall = %.2f, F1 = %.2f, accuracy = %.2f\".format(m_lr(0), m_lr(1), m_lr(2), m_lr(3)))\n","dateUpdated":"2016-01-28T03:47:57+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1453806804858_-1669481360","id":"20160126-111324_1100040136","result":{"code":"SUCCESS","type":"TEXT","msg":"import org.apache.spark.mllib.classification.LogisticRegressionWithSGD\nmodel_lr: org.apache.spark.mllib.classification.LogisticRegressionModel = org.apache.spark.mllib.classification.LogisticRegressionModel: intercept = 0.0, numFeatures = 6, numClasses = 2, threshold = 0.5\nlabelsAndPreds_lr: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[240] at map atLet's inspect the feature weights from this model:
\n"},"dateCreated":"2016-01-26T01:46:13+0000","dateStarted":"2016-01-26T01:46:24+0000","dateFinished":"2016-01-26T01:46:25+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3213"},{"text":"println(model_lr.weights)","dateUpdated":"2016-01-28T03:57:34+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1453815995596_1785773523","id":"20160126-134635_1124548361","result":{"code":"SUCCESS","type":"TEXT","msg":"[-0.05853381854183226,0.006414916916192362,-0.03848401341104848,0.41060777495363165,0.05420780154644833,-0.0013592202581950504]\n"},"dateCreated":"2016-01-26T01:46:35+0000","dateStarted":"2016-01-28T03:57:34+0000","dateFinished":"2016-01-28T03:57:36+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3214"},{"text":"%md\nWe have built a model using Logistic Regression with SGD using 100 iterations, and then used it to predict flight delays over the validation set to measure performance: precision, recall, F1 and accuracy. \n\nNext, let's try the Support Vector Machine:","dateUpdated":"2016-01-26T01:56:48+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1453816001347_-1902814182","id":"20160126-134641_1463115936","result":{"code":"SUCCESS","type":"HTML","msg":"We have built a model using Logistic Regression with SGD using 100 iterations, and then used it to predict flight delays over the validation set to measure performance: precision, recall, F1 and accuracy.
\nNext, let's try the Support Vector Machine:
\n"},"dateCreated":"2016-01-26T01:46:41+0000","dateStarted":"2016-01-26T01:56:00+0000","dateFinished":"2016-01-26T01:56:00+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3215"},{"text":"%spark\n\nimport org.apache.spark.mllib.classification.SVMWithSGD\n\n// Build the SVM model\nval svmAlg = new SVMWithSGD()\nsvmAlg.optimizer.setNumIterations(100)\n .setRegParam(1.0)\n .setStepSize(1.0)\nval model_svm = svmAlg.run(scaledTrainData)\n\n// Predict\nval labelsAndPreds_svm = scaledTestData.map { point =>\n val pred = model_svm.predict(point.features)\n (pred, point.label)\n}\nval m_svm = eval_metrics(labelsAndPreds_svm)._2\nprintln(\"precision = %.2f, recall = %.2f, F1 = %.2f, accuracy = %.2f\".format(m_svm(0), m_svm(1), m_svm(2), m_svm(3)))","dateUpdated":"2016-01-28T03:57:42+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1453806830510_999198308","id":"20160126-111350_883085981","result":{"code":"SUCCESS","type":"TEXT","msg":"import org.apache.spark.mllib.classification.SVMWithSGD\nsvmAlg: org.apache.spark.mllib.classification.SVMWithSGD = org.apache.spark.mllib.classification.SVMWithSGD@5c7af3d2\nres48: svmAlg.optimizer.type = org.apache.spark.mllib.optimization.GradientDescent@2a805d12\nmodel_svm: org.apache.spark.mllib.classification.SVMModel = org.apache.spark.mllib.classification.SVMModel: intercept = 0.0, numFeatures = 6, numClasses = 2, threshold = 0.0\nlabelsAndPreds_svm: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[448] at map atNext, let's try a Decision Tree model:
\n"},"dateCreated":"2016-01-26T01:47:25+0000","dateStarted":"2016-01-28T11:47:45+0000","dateFinished":"2016-01-28T11:47:56+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3217"},{"text":"%spark\n\nimport org.apache.spark.mllib.tree.DecisionTree\n\n// Build the Decision Tree model\nval numClasses = 2\nval categoricalFeaturesInfo = Map[Int, Int]()\nval impurity = \"gini\"\nval maxDepth = 10\nval maxBins = 100\nval model_dt = DecisionTree.trainClassifier(parsedTrainData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)\n\n// Predict\nval labelsAndPreds_dt = parsedTestData.map { point =>\n val pred = model_dt.predict(point.features)\n (pred, point.label)\n}\nval m_dt = eval_metrics(labelsAndPreds_dt)._2\nprintln(\"precision = %.2f, recall = %.2f, F1 = %.2f, accuracy = %.2f\".format(m_dt(0), m_dt(1), m_dt(2), m_dt(3)))","dateUpdated":"2016-01-28T04:04:07+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1453806849360_-1435710560","id":"20160126-111409_574051782","result":{"code":"SUCCESS","type":"TEXT","msg":"import org.apache.spark.mllib.tree.DecisionTree\nnumClasses: Int = 2\ncategoricalFeaturesInfo: scala.collection.immutable.Map[Int,Int] = Map()\nimpurity: String = gini\nmaxDepth: Int = 10\nmaxBins: Int = 100\nmodel_dt: org.apache.spark.mllib.tree.model.DecisionTreeModel = DecisionTreeModel classifier of depth 10 with 1851 nodes\nlabelsAndPreds_dt: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[526] at map atFinally, let's try the new Random Forest implementation. A Random Forest is an ensemble method that uses Decision Trees as the underlying “weak” classifier. Let's see how it works with Spark:
\n"},"dateCreated":"2016-01-26T11:14:29+0000","dateStarted":"2016-01-26T01:56:36+0000","dateFinished":"2016-01-26T01:56:37+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3219"},{"text":"%spark\n\nimport org.apache.spark.mllib.tree.RandomForest\nimport org.apache.spark.mllib.tree.configuration.Strategy\n\nval treeStrategy = Strategy.defaultStrategy(\"Classification\")\nval numTrees = 100 \nval featureSubsetStrategy = \"auto\" // Let the algorithm choose\nval model_rf = RandomForest.trainClassifier(parsedTrainData, treeStrategy, numTrees, featureSubsetStrategy, seed = 123)\n\n// Predict\nval labelsAndPreds_rf = parsedTestData.map { point =>\n val pred = model_rf.predict(point.features)\n (point.label, pred)\n}\nval m_rf = new Metrics(labelsAndPreds_rf)\nprintln(\"precision = %.2f, recall = %.2f, F1 = %.2f, accuracy = %.2f\"\n .format(m_rf.precision, m_rf.recall, m_rf.F1, m_rf.accuracy))","dateUpdated":"2016-01-28T04:04:33+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1453807165952_-475317653","id":"20160126-111925_1859203454","result":{"code":"SUCCESS","type":"TEXT","msg":"import org.apache.spark.mllib.tree.RandomForest\nimport org.apache.spark.mllib.tree.configuration.Strategy\ntreeStrategy: org.apache.spark.mllib.tree.configuration.Strategy = org.apache.spark.mllib.tree.configuration.Strategy@375a1c99\nnumTrees: Int = 100\nfeatureSubsetStrategy: String = auto\nmodel_rf: org.apache.spark.mllib.tree.model.RandomForestModel = \nTreeEnsembleModel classifier with 100 trees\n\nlabelsAndPreds_rf: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[565] at map atNote that overall accuracy of decision tree is higher than logistic regression, and Random Forest has the highest accuracy overall.
\n"},"dateCreated":"2016-01-26T01:57:24+0000","dateStarted":"2016-01-26T04:54:27+0000","dateFinished":"2016-01-26T04:54:28+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3221"},{"text":"%md \n\n## Building a richer model with flight delays, weather data using Apache Spark and ML-Lib\n\nAnother common path to improve accuracy is by bringing in new types of data - enriching our dataset - and generating more features, thus achieving better predictive performance overall for our model. Our idea is to layer-in weather data. We can get this data from a publicly available dataset here: http://www.ncdc.noaa.gov/cdo-web/datasets/\n\nWe will look at daily temperatures (min/max), wind speed, snow conditions and precipitation in the flight origin airport (ORD). Clearly, weather conditions in the destination airport also affect delays, but for simplicity of this demo we just include weather at the origin (ORD).\n\nTo accomplish this with Apache Spark, we rewrite our previous *preprocess_spark* function to extract the same base features from the flight delay dataset, and also join those with five variables from the weather datasets: minimum and maximum temperature for the day, precipitation, snow and wind speed. Let's see how this is accomplished.","dateUpdated":"2016-01-26T04:57:07+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1453820656966_-36936379","id":"20160126-150416_1572405220","result":{"code":"SUCCESS","type":"HTML","msg":"Another common path to improve accuracy is by bringing in new types of data - enriching our dataset - and generating more features, thus achieving better predictive performance overall for our model. Our idea is to layer-in weather data. We can get this data from a publicly available dataset here: http://www.ncdc.noaa.gov/cdo-web/datasets/
\nWe will look at daily temperatures (min/max), wind speed, snow conditions and precipitation in the flight origin airport (ORD). Clearly, weather conditions in the destination airport also affect delays, but for simplicity of this demo we just include weather at the origin (ORD).
\nTo accomplish this with Apache Spark, we rewrite our previous preprocess_spark function to extract the same base features from the flight delay dataset, and also join those with five variables from the weather datasets: minimum and maximum temperature for the day, precipitation, snow and wind speed. Let's see how this is accomplished.
\n"},"dateCreated":"2016-01-26T03:04:16+0000","dateStarted":"2016-01-26T04:57:04+0000","dateFinished":"2016-01-26T04:57:05+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3222"},{"text":"%spark\n\nimport org.apache.spark.SparkContext._\nimport scala.collection.JavaConverters._\nimport au.com.bytecode.opencsv.CSVReader\nimport java.io._\n\n// function to do a preprocessing step for a given file\n\ndef preprocess_spark(delay_file: String, weather_file: String): RDD[Array[Double]] = { \n // Read wether data\n val delayRecs = prepFlightDelays(delay_file).map{ rec => \n val features = rec.gen_features\n (features._1, features._2)\n }\n\n // Read weather data into RDDs\n val station_inx = 0\n val date_inx = 1\n val metric_inx = 2\n val value_inx = 3\n\n def filterMap(wdata:RDD[Array[String]], metric:String):RDD[(String,Double)] = {\n wdata.filter(vals => vals(metric_inx) == metric).map(vals => (vals(date_inx), vals(value_inx).toDouble))\n }\n\n val wdata = sc.textFile(weather_file).map(line => line.split(\",\"))\n .filter(vals => vals(station_inx) == \"USW00094846\")\n val w_tmin = filterMap(wdata,\"TMIN\")\n val w_tmax = filterMap(wdata,\"TMAX\")\n val w_prcp = filterMap(wdata,\"PRCP\")\n val w_snow = filterMap(wdata,\"SNOW\")\n val w_awnd = filterMap(wdata,\"AWND\")\n\n delayRecs.join(w_tmin).map(vals => (vals._1, vals._2._1 ++ Array(vals._2._2)))\n .join(w_tmax).map(vals => (vals._1, vals._2._1 ++ Array(vals._2._2)))\n .join(w_prcp).map(vals => (vals._1, vals._2._1 ++ Array(vals._2._2)))\n .join(w_snow).map(vals => (vals._1, vals._2._1 ++ Array(vals._2._2)))\n .join(w_awnd).map(vals => vals._2._1 ++ Array(vals._2._2))\n}\n\nval data_2007 = preprocess_spark(\"/tmp/airflightsdelays/flights_2007.csv.bz2\", \"/tmp/airflightsdelays/weather_2007.csv.gz\")\nval data_2008 = preprocess_spark(\"/tmp/airflightsdelays/flights_2008.csv.bz2\", \"/tmp/airflightsdelays/weather_2008.csv.gz\")\n\ndata_2007.take(5).map(x => x mkString \",\").foreach(println)","dateUpdated":"2016-01-28T07:53:15+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1453827271404_1914194739","id":"20160126-165431_1775141609","result":{"code":"SUCCESS","type":"TEXT","msg":"import org.apache.spark.SparkContext._\nimport scala.collection.JavaConverters._\nimport au.com.bytecode.opencsv.CSVReader\nimport java.io._\npreprocess_spark: (delay_file: String, weather_file: String)org.apache.spark.rdd.RDD[Array[Double]]\ndata_2007: org.apache.spark.rdd.RDD[Array[Double]] = MapPartitionsRDD[699] at map atNote that the minimum and maximum temparature variables from the weather dataset are measured here in Celsius and multiplied by 10. So for example -139.0 would translate into -13.9 Celsius.
\n"},"dateCreated":"2016-01-26T05:01:21+0000","dateStarted":"2016-01-26T05:36:31+0000","dateFinished":"2016-01-26T05:36:33+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3224"},{"text":"%md \n\n## Modeling with weather data\n\nWe are going to repeat the previous models of Logist Regression, decision tree and Random Forest with our enriched feature set. As before, we create an RDD of *LabeledPoint* objects, and normalize our dataset with ML-Lib's *StandardScaler*:","dateUpdated":"2016-01-26T05:37:00+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1453827694629_-1995069745","id":"20160126-170134_1079378135","result":{"code":"SUCCESS","type":"HTML","msg":"We are going to repeat the previous models of Logist Regression, decision tree and Random Forest with our enriched feature set. As before, we create an RDD of LabeledPoint objects, and normalize our dataset with ML-Lib's StandardScaler:
\n"},"dateCreated":"2016-01-26T05:01:34+0000","dateStarted":"2016-01-26T05:36:59+0000","dateFinished":"2016-01-26T05:36:59+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3225"},{"text":"%spark \n\nimport org.apache.spark.mllib.regression.LabeledPoint\nimport org.apache.spark.mllib.linalg.Vectors\nimport org.apache.spark.mllib.feature.StandardScaler\n\ndef parseData(vals: Array[Double]): LabeledPoint = {\n LabeledPoint(if (vals(0)>=15) 1.0 else 0.0, Vectors.dense(vals.drop(1)))\n}\n\n// Prepare training set\nval parsedTrainData = data_2007.map(parseData)\nval scaler = new StandardScaler(withMean = true, withStd = true).fit(parsedTrainData.map(x => x.features))\nval scaledTrainData = parsedTrainData.map(x => LabeledPoint(x.label, scaler.transform(Vectors.dense(x.features.toArray))))\nparsedTrainData.cache\nscaledTrainData.cache\n\n// Prepare test/validation set\nval parsedTestData = data_2008.map(parseData)\nval scaledTestData = parsedTestData.map(x => LabeledPoint(x.label, scaler.transform(Vectors.dense(x.features.toArray))))\nparsedTestData.cache\nscaledTestData.cache\n\nscaledTrainData.take(5).map(x => (x.label, x.features)).foreach(println)","dateUpdated":"2016-01-28T08:27:13+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1453827692372_-2060492383","id":"20160126-170132_96277418","result":{"code":"SUCCESS","type":"TEXT","msg":"import org.apache.spark.mllib.regression.LabeledPoint\nimport org.apache.spark.mllib.linalg.Vectors\nimport org.apache.spark.mllib.feature.StandardScaler\nparseData: (vals: Array[Double])org.apache.spark.mllib.regression.LabeledPoint\nparsedTrainData: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = MapPartitionsRDD[742] at map atNext, let's build a Logistic Regression model using this enriched feature matrix:
\n"},"dateCreated":"2016-01-26T05:01:29+0000","dateStarted":"2016-01-26T05:37:37+0000","dateFinished":"2016-01-26T05:37:37+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3227"},{"text":"%spark\n\nimport org.apache.spark.mllib.classification.LogisticRegressionWithSGD\n\n// Build the Logistic Regression model\nval model_lr = LogisticRegressionWithSGD.train(scaledTrainData, numIterations=100)\n\n// Predict\nval labelsAndPreds_lr = scaledTestData.map { point =>\n val pred = model_lr.predict(point.features)\n (point.label, pred)\n}\nval m_lr = new Metrics(labelsAndPreds_lr)\nprintln(\"precision = %.2f, recall = %.2f, F1 = %.2f, accuracy = %.2f\"\n .format(m_lr.precision, m_lr.recall, m_lr.F1, m_lr.accuracy))","dateUpdated":"2016-01-28T08:29:25+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1453827753089_271009331","id":"20160126-170233_386400294","result":{"code":"SUCCESS","type":"TEXT","msg":"import org.apache.spark.mllib.classification.LogisticRegressionWithSGD\nmodel_lr: org.apache.spark.mllib.classification.LogisticRegressionModel = org.apache.spark.mllib.classification.LogisticRegressionModel: intercept = 0.0, numFeatures = 11, numClasses = 2, threshold = 0.5\nlabelsAndPreds_lr: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[951] at map atNow let's try the decision tree:
\n"},"dateCreated":"2016-01-26T05:02:29+0000","dateStarted":"2016-01-26T05:45:49+0000","dateFinished":"2016-01-26T05:45:52+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3230"},{"text":"%spark\n\nimport org.apache.spark.mllib.tree.DecisionTree\n\n// Build the Decision Tree model\nval numClasses = 2\nval categoricalFeaturesInfo = Map[Int, Int]()\nval impurity = \"gini\"\nval maxDepth = 10\nval maxBins = 100\nval model_dt = DecisionTree.trainClassifier(parsedTrainData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)\n\n// Predict\nval labelsAndPreds_dt = parsedTestData.map { point =>\n val pred = model_dt.predict(point.features)\n (point.label, pred)\n}\nval m_dt = new Metrics(labelsAndPreds_dt)\nprintln(\"precision = %.2f, recall = %.2f, F1 = %.2f, accuracy = %.2f\"\n .format(m_dt.precision, m_dt.recall, m_dt.F1, m_dt.accuracy))","dateUpdated":"2016-01-28T08:46:53+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1453827801012_-345312434","id":"20160126-170321_239192730","result":{"code":"SUCCESS","type":"TEXT","msg":"import org.apache.spark.mllib.tree.DecisionTree\nnumClasses: Int = 2\ncategoricalFeaturesInfo: scala.collection.immutable.Map[Int,Int] = Map()\nimpurity: String = gini\nmaxDepth: Int = 10\nmaxBins: Int = 100\nmodel_dt: org.apache.spark.mllib.tree.model.DecisionTreeModel = DecisionTreeModel classifier of depth 10 with 1855 nodes\nlabelsAndPreds_dt: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[994] at map atAnd finally, let's try the Random Forest model:
\n"},"dateCreated":"2016-01-26T05:03:24+0000","dateStarted":"2016-01-26T05:48:26+0000","dateFinished":"2016-01-26T05:48:28+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3232"},{"text":"%spark\n\nimport org.apache.spark.mllib.tree.RandomForest\nimport org.apache.spark.mllib.tree.configuration.Strategy\n\nval treeStrategy = Strategy.defaultStrategy(\"Classification\")\nval model_rf = RandomForest.trainClassifier(parsedTrainData, treeStrategy, \n numTrees = 100, featureSubsetStrategy = \"auto\", seed = 125)\n\n// Predict\nval labelsAndPreds_rf = parsedTestData.map { point =>\n val pred = model_rf.predict(point.features)\n (point.label, pred)\n}\nval m_rf = new Metrics(labelsAndPreds_rf)\nprintln(\"precision = %.2f, recall = %.2f, F1 = %.2f, accuracy = %.2f\"\n .format(m_rf.precision, m_rf.recall, m_rf.F1, m_rf.accuracy))","dateUpdated":"2016-01-28T09:03:55+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1453827806563_128713535","id":"20160126-170326_2085582197","result":{"code":"SUCCESS","type":"TEXT","msg":"import org.apache.spark.mllib.tree.RandomForest\nimport org.apache.spark.mllib.tree.configuration.Strategy\ntreeStrategy: org.apache.spark.mllib.tree.configuration.Strategy = org.apache.spark.mllib.tree.configuration.Strategy@4677a66f\nmodel_rf: org.apache.spark.mllib.tree.model.RandomForestModel = \nTreeEnsembleModel classifier with 100 trees\n\nlabelsAndPreds_rf: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[1037] at map atAs expected, the improved feature set increased the accuracy of our model for both SVM and Decision Tree models.
\n"},"dateCreated":"2016-01-26T05:03:56+0000","dateStarted":"2016-01-26T05:04:23+0000","dateFinished":"2016-01-26T05:04:25+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3234"},{"text":"%md \n\n## Summary\n\nIn this IPython notebook we have demonstrated how to build a predictive model in Scala with Apache Hadoop, Apache Spark and its machine learning library: ML-Lib. \n\nWe have used Apache Spark on our HDP cluster to perform various types of data pre-processing and feature engineering tasks. We then applied a few ML-Lib machine learning algorithms such as support vector machines and decision tree to the resulting datasets and showed how through iterations we continuously add new and improved features resulting in better model performance.","dateUpdated":"2016-01-26T05:57:21+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1453827834237_-1740583801","id":"20160126-170354_1443133917","result":{"code":"SUCCESS","type":"HTML","msg":"In this IPython notebook we have demonstrated how to build a predictive model in Scala with Apache Hadoop, Apache Spark and its machine learning library: ML-Lib.
\nWe have used Apache Spark on our HDP cluster to perform various types of data pre-processing and feature engineering tasks. We then applied a few ML-Lib machine learning algorithms such as support vector machines and decision tree to the resulting datasets and showed how through iterations we continuously add new and improved features resulting in better model performance.
\n"},"dateCreated":"2016-01-26T05:03:54+0000","dateStarted":"2016-01-26T05:57:21+0000","dateFinished":"2016-01-26T05:57:23+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3235"},{"config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1453831482516_267580823","id":"20160126-180442_1308451960","dateCreated":"2016-01-26T06:04:42+0000","status":"READY","progressUpdateIntervalMs":500,"$$hashKey":"object:3236"}],"name":"Demos / Spark / ML / Predicting Airline Delays","id":"2BB5CUPUW","angularObjects":{},"config":{"looknfeel":"default"},"info":{}}