Install Hortonworks HDP 3.1.0 on A Cluster of VMWare Virtual Machines

Hortonworks Tutorials

This post describes the process to install Hortontworks HDP 3.1.0 on a cluster of three VMWare virtual machines. The process includes four major steps: 1) set up the cluster environemnt; 2) set up a local repository for both Ambari and HDP stacks; 3) Install Ambari server and agent; 4) install, configure and deploy the cluster. This installation process might work for other versions too. Please check the product versions through Hortonworks support matrix: https://supportmatrix.hortonworks.com/ Read More...

Set Up Scala Development Environment for Apache Spark in Standalone Mode

Apache Spark Tutorials

With the Apache Spark installed through the steps described in last post, this post will introduce you the steps to set up a Scala development environment for Spark and build a WordCount application through Maven and SBT. Althrough Spark can be programmed with either Java, Scala, or Python, this post will focus on Scala. There are a couple of reasons: 1) Spark itself is written in Scala; 2) Scala’s functional programming model is a good fit for distributed processing, thus has less code and boilerplate stuff than Java; 3) Scala compiles to Java bytecode, which gives faster performance than Python. Read More...

Install Python 3 Distributions on a Linux Server

As a data scientist, I often need to run Python scripts on Linux servers. The vast majority of CentOS/RHEL-based Linux servers use Python 2.6.6, while most of my Python applications are written in Python 3. In additoin, on most of the servers, I don’t have the sudo priviledge, so it needs a little trick to install Python 3 on those servers. This post is a step-by-step instructions on installing Python 3.6 through both Anaconda and Python Gzipped source tarball file on a CentOS/RHEL-based Linux server. Read More...