APIs for Spark Development: Java vs Scala

  • Apache Spark is an open source cluster computing platform for data processing tasks
  • It extends Apache Hadoop and introduces concepts such as stream processing (Spark Streaming) and Iterative Computation (Machine Learning tasks)
  • Apache Spark was initially written in Scala and ships with a Scala and Python interactive shell – REPL(Read-Evaluate-Print-Loop). It includes the following APIs:
    • Java
    • Scala
    • Python

Spark APIs: Java Vs Scala

Java

  • Java less concise and more verbose and error prone – support for lambdas and stream only from Java 8
  • More Established programming language, lots of experts in the market
  • Full Java/Scala interoperability – implicit conversions between major Collections’ types

Scala

  • Blend of functional and object-oriented aspects makes Scala highly scalable
  • No distinction between an object and a function – every value is an object and every operation is a method call
  • Scala’s type inference contributes to more readable programs
  • Scala’s Traits tames multiple inheritance
  • Scala displays conciseness, brevity and advanced static typing.

Scala API at work: Loading CSV Files

ScalaApiExample

Java Api at work: Loading CSV Files

Java

Final Remarks

  • Scala’s strengths lay on Scalability, Conciseness and Advanced Static Typing together with full Java interoperability
  • Scala can have a challenging learning curve and still has limited community presence (compared to Java and Python)
  • Scala developers are still a niche Vs Rich Market for Java and Python professionals
  • Python still a strong choice because of easy transition from OOP languages and the number of available statistical and Data Science libraries

Posted on May 13, 2017, in Uncategorized. Bookmark the permalink. Leave a comment.

%d bloggers like this: