Read Parquet Files with SparkSQL
SparkSQL is a Spark module for working with structure data and it can also be used to read columnar data format such as Parquet files. Here a number of useful commands that can be run from the spark-shell:
#Set the context
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
#Read the parquet file in HDFS and
val df = sqlContext.read.parquet(“hdfs://user/myfolder/part-r-00033.gz.parquet”).printSchema
#Show the top 10 rows of data from the parquet file
df.show(10, false)
#Convert to JSON and print out the content of 1 record
df.toJSON.take(1).foreach(println)
Posted on January 26, 2019, in Uncategorized. Bookmark the permalink. Leave a comment.
Leave a comment
Comments 0