Big Data | Learn for Master - Part 4
  • Spark: Solve Task not serializable Exception

    One of the most frequently occurred exceptions when you use Spark is the Task not serializable exception:

    org.apache.spark.SparkException: Task not serializable

    This exception happens when you create an Non-Serializable Object on the Driver and try to use it on the the reducer.

    Here is an example to produce such an exception:

    Suppose we have a non serializable class named MyTokenlizer:

    You submit a spark job like this:

    Now you will get the org.apache.spark.SparkException: Task not serializable exception.

    To solve this Exception,

    [Read More...]
  • How to package a Scala project to a Jar file with SBT

    When you develop a Spark project using Scala language, you have to package your project into a jar file. This tutorial describes how to use SBT
    to compile and run a Scala project, and package the project as a Jar file. This will be helpful for you to create a spark project and package it to a jar file.

    The directory structure of a typical SBT project

    Here is an example to show a typical SBT project, which has the following directory structures. 

    .
    |-- build.sbt
    |-- lib
    |-- project
    |-- src
    |   |-- main
    |   |   |-- java (store main java files)
    |   |   |-- resources (store include in main jar)
    |   |   |-- scala (store main Scala source files)
    |   |-- test
    |       |-- java (store test java files)
    |       |-- resources (store files include in test jar)
    |       |-- scala (store test scala source files)
    |-- target

    You can use the following command to create this directory structures:

    #!/bin/sh
    cd ~/hello_world
    mkdir -p src/{main,test}/{java,resources,scala}
    mkdir lib project target
    
    # create an initial build.sbt file
    echo 'name := "MyProject"
    version := "1.0"
    scalaVersion := "2.10.0"' >
    [Read More...]
  • Move Hive Table from One Cluster to Another

     This tutorial uses examples to describe how to move Hive table from one cluster to another.  The basic idea is to use the EXPORT and IMPORT commands. 

    The EXPORT command exports the data of a table or partition, along with the metadata, into a specified output location. This output location can then be moved over to a different Hadoop or Hive instance and imported from there with the IMPORT command.

    Export Syntax

    EXPORT TABLE tablename [PARTITION (part_column=”value”[, …])]
      TO ‘export_target_path’ [ FOR replication(‘eventid’) ]

    Import Syntax

    IMPORT [[EXTERNAL] TABLE new_or_original_tablename [PARTITION (part_column=”value”[, …])]]
      FROM ‘source_path’
      [LOCATION ‘import_target_path’]

    Examples to Move Hive Table from one cluster (grid) to another

    Suppose you have two clusters : cluster A and cluster B. 

    [Read More...]
  • Append to a Hive partition from Pig

    When we use Hive, we can append data to the table easily, but when we use Pig (i.e., the HCatalog ) to insert data into Hive table, we are not allowed to append data to a partition if that partition already contains data. 

    In this post, I describe a method that can help you append data to the existing partition using a dummy partition named run. It means  the run number you append some data to this partition. 

    For example, we create the following partitioned hive table:

    Then pig script looks like the following: 

    Now we can run the pig script using the following command:

    Then we have the following content in the table:

    Each time when you want to append data to the partition DATE=20160605,

    [Read More...]
  • Set variable for hive script

    When we run hive scripts, such as Load data into Hive table, we often need to pass parameters to the hive scripts by defining our own variables. 

    Here are some examples to show how to pass parameters or user defined variables to hive. 

    Use hiveconf for variable subsititution

    For example, you can define a variable DATE, then use it as ${hiveconf:DATE}

    you can even pass the variable from command line:

    Use env and system variables

    You can also use env and system variables like this  ${env:USER}

    You can run the following command to see all the available variables:

    If you are o the hive prompt,

    [Read More...]
  • An Example to Create a Partitioned Hive Table

    Partition is a very useful feature of Hive. Without partition, it is hard to reuse the Hive Table if you use HCatalog to store data to Hive table using Apache Pig, as you will get exceptions when you insert data to a non-partitioned Hive Table that is not empty

     In this post, I use an example to show how to create a partitioned table, and populate data into it. 

    Let’s suppose you have a dataset for user impressions. For instance, a sample of the data set might be like this:

    id
    user_id
    user_lang
    user_device
    time_stamp
    url
    date
    country

    1
    u1
    en
    iphone
    201503210011
    http://xxx/xxx/1
    20150321
    US

    2
    u1
    en
    ipad
    201503220111
    http://xxx/xxx/2
    20150322 
    US

    3
    u2
    en
    desktop
    201503210051
    http://xxx/xxx/3
     20150321
    CA

    4
    u3
    en
    iphone
    201503230021
    http://xxx/xxx/4
     20150323
    HK

    If you use Pig to analyze the data,

    [Read More...]
  • Exceptions When Delete rows from Hive Table

    It’s straight forward to delete data from a traditional Relational table using SQL. However, delete rows from Hive Rows can cause several exceptions.

    For examples, let see we have a imps_part table,  we want to delete the values in the Table.  You will get the exception:

    When you run the simple delete command, we get: FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations

     

    Some one suggest to use the following command:

    This will result in the following exception:
    FAILED: SemanticException 1:23 Need to specify partition columns because the destination table is partitioned.

    [Read More...]
  • Save data to Hive table Using Apache Pig

    We have described how to load data from Hive Table using Apache Pig, in this post, I will use an example to show how to save data to Hive table using Pig.

    Before save data to Hive, you need to first create a Hive Table. Please refer to this post on how to create a Hive table

    Suppose we use Apache Pig to Load some data from a text file, then we can save the data to the hive table using the following script. 

    The store_student.pig script is like this:

    Note: You must specify the table name in single quotes: STORE data into ‘tablename’.

    [Read More...]
  • Apache Pig Load ORC data from Hive Table

    There are some cases your data is stored in Hive Table, and you may want to process the data using Apache Pig. In this post, I use an example to describe how to read Hive ORC data using Apache Pig. 

    1. We first create Hive table stored as ORC, and load some data into the table.
    2. Then, we develop a Apache Pig script to load the data from the Hive ORC table. 

    Optimized Row Columnar (ORC) file format

    The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data.

    [Read More...]
  • How to get hive table delimiter or schema

    When you have a hive table, you may want to check its delimiter or detailed information such as Schema. There are two solutions:

    Get the delimiter of a Hive Table

    To get the field delimiter of a hive table, we can use the following command:

    Here is an example:

    Get the schema of Hive Table

    Another solution is to use: 

    This will generate a competed information about the table. 

    [Read More...]
Page 4 of 512345