问题描述:

I have created Data frame to read csv file using sqlContext from which I need to convert a column of table to RDD and then dense Vector to perform matrix multiplication .

I am finding it difficult to do so.

val df = sqlContext.read

.format("com.databricks.spark.csv")

.option("header","true")

.option("inferSchema","true")

.load("/home/project/SparkRead/train.csv")

val result1 = sqlContext.sql("SELECT Sales from train").rdd

how to convert it to dense vector?

网友答案:

You can convert Dataframe to Vector using VectorAssembler. Check out the code below:

val df = spark.read.
  format("com.databricks.spark.csv").
  option("header","true").
  option("inferSchema","true").
  load("/tmp/train.csv")

// assuming input
// a,b,c,d
// 1,2,3,4
// 1,1,2,3
// 1,3,4,5

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors

val assembler = new VectorAssembler().
    setInputCols(Array("a", "b", "c", "d")).
    setOutputCol("vect")

val output = assembler.transform(df)

// show the result
output.show()

// +---+---+---+---+-----------------+
// |  a|  b|  c|  d|             vect|
// +---+---+---+---+-----------------+
// |  1|  2|  3|  4|[1.0,2.0,3.0,4.0]|
// |  1|  1|  2|  3|[1.0,1.0,2.0,3.0]|
// |  1|  3|  4|  5|[1.0,3.0,4.0,5.0]|
// +---+---+---+---+-----------------+
相关阅读:
Top