Big Data | Learn for Master
  • run pyspark on oozie

     In this post, I first give a workable example to run pySpark on oozie. Then I show how to run pyspark on oozie using your own python installation (e.g., anaconda). In this way, you can use numpy, pandas, other python libraries in your pyspark program. 

    The syntax of creating a spark action on oozie workflow

    As described in the document, here are the meanings of these elements.

    The prepare element, if present, indicates a list of paths to delete or create before starting the job. Specified paths must start with hdfs://HOST:PORT .

    [Read More...]
  • Parameter Server 资料汇总

    此处输入图片的描述

    parameter server 介绍
    作者:Superjom
    链接:https://www.zhihu.com/question/26998075/answer/40577680
    来源:知乎
    著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。
    看看李沐的文章 《Parameter Server for Distributed Machine Learning》里面有包含他的框架的一些介绍。
    后面有看到微软研究院 project Adam的论文,大体思路比较相似,但论文中细节比较丰富,也会互补的一些信息描述下。

    概念:
    参数服务器是个编程框架,用于方便分布式并行程序的编写,其中重点是对大规模参数的分布式存储和协同的支持。

    工业界需要训练大型的机器学习模型,一些广泛使用的特定的模型在规模上的两个特点:
    1. 参数很大,超过单个机器的容纳能力(比如大型Logistic Regression和神经网络)
    2. 训练数据巨大,需要分布式并行提速(大数据)

    这种需求下,当前类似MapReduce的框架并不能很好适合。
    因此需要自己实现分布式并行程序,其实在Hadoop出来之前,对于大规模数据的处理,都需要自己写分布式的程序(MPI)。 之后这方面的工作流程被Google的工程师总结和抽象成MapReduce框架,大一统了。

    参数服务器就类似于MapReduce,是大规模机器学习在不断使用过程中,抽象出来的框架之一。重点支持的就是参数的分布式,毕竟巨大的模型其实就是巨大的参数。

    Parameter Server(Mli)
    —————————-
    架构:
    集群中的节点可以分为计算节点和参数服务节点两种。其中,计算节点负责对分配到自己本地的训练数据(块)计算学习,并更新对应的参数;参数服务节点采用分布式存储的方式,各自存储全局参数的一部分,并作为服务方接受计算节点的参数查询和更新请求。

    简而言之吧,计算节点负责干活和更新参数,参数服务节点则负责存储参数。

    冗余和恢复:
    类似MapReduce,每个参数在参数服务器的集群中都在多个不同节点上备份(3个也是极好的),这样当出现节点失效时,冗余的参数依旧能够保证服务的有效性。当有新的节点插入时,把原先失效节点的参数从冗余参数那边复制过来,失效节点的接班人就加入队伍了。

    并行计算:
    并行计算这部分主要在计算节点上进行。 类似于MapReduce,分配任务时,会将数据拆分给每个worker节点。
    参数服务器在开始学习前,也会把大规模的训练数据拆分到每个计算节点上。单个计算节点就对本地数据进行学习就可以了。学习完毕再把参数的更新梯度上传给对应的参数服务节点进行更新。

    详细的流程:

    1.
    分发训练数据 -> 节点1
    节点2
    节点3

    节点i

    节点N

    2.

    [Read More...]
  • Resources for Learning Big Data

    Big Data is a really hot topic! There are millions of job openings. This post aims to introduce some of the most important techniques related to Big Data.  If you want to learn Big Data and find a job related to Big Data, please read the articles and books recommended here. 

    The first key concept of Big Data is MapReduce, which is the core of Hadoop, a open sourced framework that becomes the foundation of the Big Data eco system. 

    • MapReduce was invented by Google. It’s a paradigm for writing distributed systems inspired by some elements of functional programming.
    [Read More...]
  • Append to a Hive partition from Pig

    When we use Hive, we can append data to the table easily, but when we use Pig (i.e., the HCatalog ) to insert data into Hive table, we are not allowed to append data to a partition if that partition already contains data. 

    In this post, I describe a method that can help you append data to the existing partition using a dummy partition named run. It means  the run number you append some data to this partition. 

    For example, we create the following partitioned hive table:

    Then pig script looks like the following: 

    Now we can run the pig script using the following command:

    Then we have the following content in the table:

    Each time when you want to append data to the partition DATE=20160605,

    [Read More...]
  • Set variable for hive script

    When we run hive scripts, such as Load data into Hive table, we often need to pass parameters to the hive scripts by defining our own variables. 

    Here are some examples to show how to pass parameters or user defined variables to hive. 

    Use hiveconf for variable subsititution

    For example, you can define a variable DATE, then use it as ${hiveconf:DATE}

    you can even pass the variable from command line:

    Use env and system variables

    You can also use env and system variables like this  ${env:USER}

    You can run the following command to see all the available variables:

    If you are o the hive prompt,

    [Read More...]
  • An Example to Create a Partitioned Hive Table

    Partition is a very useful feature of Hive. Without partition, it is hard to reuse the Hive Table if you use HCatalog to store data to Hive table using Apache Pig, as you will get exceptions when you insert data to a non-partitioned Hive Table that is not empty

     In this post, I use an example to show how to create a partitioned table, and populate data into it. 

    Let’s suppose you have a dataset for user impressions. For instance, a sample of the data set might be like this:

    id
    user_id
    user_lang
    user_device
    time_stamp
    url
    date
    country

    1
    u1
    en
    iphone
    201503210011
    http://xxx/xxx/1
    20150321
    US

    2
    u1
    en
    ipad
    201503220111
    http://xxx/xxx/2
    20150322 
    US

    3
    u2
    en
    desktop
    201503210051
    http://xxx/xxx/3
     20150321
    CA

    4
    u3
    en
    iphone
    201503230021
    http://xxx/xxx/4
     20150323
    HK

    If you use Pig to analyze the data,

    [Read More...]
  • Exceptions When Delete rows from Hive Table

    It’s straight forward to delete data from a traditional Relational table using SQL. However, delete rows from Hive Rows can cause several exceptions.

    For examples, let see we have a imps_part table,  we want to delete the values in the Table.  You will get the exception:

    When you run the simple delete command, we get: FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations

     

    Some one suggest to use the following command:

    This will result in the following exception:
    FAILED: SemanticException 1:23 Need to specify partition columns because the destination table is partitioned.

    [Read More...]
  • Save data to Hive table Using Apache Pig

    We have described how to load data from Hive Table using Apache Pig, in this post, I will use an example to show how to save data to Hive table using Pig.

    Before save data to Hive, you need to first create a Hive Table. Please refer to this post on how to create a Hive table

    Suppose we use Apache Pig to Load some data from a text file, then we can save the data to the hive table using the following script. 

    The store_student.pig script is like this:

    Note: You must specify the table name in single quotes: STORE data into ‘tablename’.

    [Read More...]
  • How to get hive table delimiter or schema

    When you have a hive table, you may want to check its delimiter or detailed information such as Schema. There are two solutions:

    Get the delimiter of a Hive Table

    To get the field delimiter of a hive table, we can use the following command:

    Here is an example:

    Get the schema of Hive Table

    Another solution is to use: 

    This will generate a competed information about the table. 

    [Read More...]
  • How to load data from a text file to Hive table

    In this post, I describe how to insert data from a text file to a hive table. 

    Suppose you have tab delimited file::

    Create a Hive table stored as a text file.

    Load the text file (stored locally) into the Hive table:

    Create a Hive table stored as sequence file.

    Now you can load into the sequence table from the text table:

     

    [Read More...]
Page 1 of 212