Read Parquet Files with SparkSQL

SparkSQL is a Spark module for working with structure data and it can also be used to read columnar data format such as Parquet files.  Here a number of useful commands that can be run from the spark-shell:

#Set the context

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

#Read the parquet file in HDFS and

val df =“hdfs://user/myfolder/part-r-00033.gz.parquet”).printSchema

#Show the top 10 rows of data from the parquet file, false)

#Convert to JSON and print out the content of 1 record


Posted on January 26, 2019, in Uncategorized. Bookmark the permalink. Leave a comment.

This site uses Akismet to reduce spam. Learn how your comment data is processed.