问题描述:

I'm trying to adapt the code here from a Scala version to a PySpark version. Here's the code I'm using:

 conf = SparkConf().setAppName("Parse Xml File")

sc = SparkContext(conf = conf)

sqlContext = HiveContext(sc)

sc._jsc.hadoopConfiguration().set('stream.recordreader.class', 'org.apache.hadoop.streaming.StreamXmlRecordReader')

sc._jsc.hadoopConfiguration().set('stream.recordreader.begin', '<page>')

sc._jsc.hadoopConfiguration().set('stream.recordreader.end', '</page>')

xml_sdf = sc.newAPIHadoopFile(xml_data_path,

'org.apache.hadoop.streaming.StreamInputFormat',

'org.apache.hadoop.io.Text',

'org.apache.hadoop.io.Text')

print("Found {0} records.".format(wiki_xml_sdf.count()))

sc.stop()

Error I'm getting is:

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile.

: java.lang.ClassCastException: org.apache.hadoop.streaming.StreamInputFormat cannot be cast to org.apache.hadoop.mapreduce.InputFormat

Is there a different input format / settings that I can use to make it work?

网友答案:

The easiest solution is to use spark-xml package. In your case (all documents start with <page>) below code will load data into dataframe:

sqlContext.read.format('com.databricks.spark.xml')
    .options(rowTag='page').load('samplexml.xml')
相关阅读:
Top