HBase Storage and PIG

HBase Storage and PIG

We’ve been using PIG for analytics and for processing data for use in our site for some time now. PIG is a high level language for building data analysis programs that can run across a distributed Hadoop cluster. It has allowed us to scale up our data processing while decreasing the amount of time it takes to run jobs.

When it came time to update our runtime data storage for the site, it was natural for us to consider using HBase to achieve horizontal scalability. HBase is a distributed, versioned, column-oriented store based on Hadoop. One of the great advantages of using HBase is the ability to integrate it with our existing PIG data processing. In this post I will introduce you to the basics of working with HBase from your PIG scripts.

Getting Started

Before getting into the details of using HBaseStorage there are a couple of environment variables you will need to make sure are set so that HBaseStorage can work correctly.

export HBASE_HOME=/usr/lib/hbase

export PIG_CLASSPATH=”`${HBASE_HOME}/bin/hbase classpath`:$PIG_CLASSPATH”

First, you will need to let HBaseStorage know where to find the HBase configuration, thus the HBASE_HOME environment variable. Second, the PIG_CLASSPATH needs to be extended to include the classpath for loading HBase. If you are using PIG 0.8.x there is a slight variation:

export HADOOP_CLASSPATH=”`${HBASE_HOME}/bin/hbase classpath`:$HADOOP_CLASSPATH”

Hello World

Let’s write a simple script to load some data from a file and write it out to an HBase table. To begin, use the shell to create your table:

jhoover@jhoover2:~$ hbase shell

HBase Shell; enter ‘help‘ for list of supported commands.

Type “exit” to leave the HBase Shell

Version 0.90.3-cdh3u1, r, Mon Jul 18 08:23:50 PDT 2011

hbase(main):002:0> create ‘sample_names’, ‘info’

0 row(s) in 0.5580 seconds

Next, we’ll put some simple data in a file ‘input.csv’:

1, John, Smith

2, Jane, Doe

3, George, Washington

4, Ben, Franklin

Then we’ll write a simple script to extract this data and write it into fixed columns in HBase:

raw_data = LOAD ‘sample_data.csv’ USING PigStorage( ‘,’ ) AS (

listing_id: chararray,

fname: chararray,

lname: chararray );

STORE raw_data INTO ‘hbase://sample_names’ USING

org.apache.pig.backend.hadoop.hbase.HBaseStorage (

‘info:fname info:lname’);

Then run the pig script locally:

jhoover@jhoover2:~/hbase_sample$ pig -x local hbase_sample.pig


Job Stats (time in seconds):

JobId Alias Feature Outputs

job_local_0001 raw_data MAP_ONLY hbase://hello_world,


Successfully read records from: “file:///autohome/jhoover/hbase_sample/sample_data.csv”


Successfully stored records in: “hbase://sample_names”

Job DAG:


You can then see the results of your script in the hbase shell:

hbase(main):001:0> scan ‘hello_world’


1 column=info:fname, timestamp=1356134399789, value= John

1 column=info:lname, timestamp=1356134399789, value= Smith

2 column=info:fname, timestamp=1356134399789, value= Jane

2 column=info:lname, timestamp=1356134399789, value= Doe

3 column=info:fname, timestamp=1356134399789, value= George

3 column=info:lname, timestamp=1356134399789, value= Washington

4 column=info:fname, timestamp=1356134399789, value= Ben

4 column=info:lname, timestamp=1356134399789, value= Franklin

4 row(s) in 0.4850 seconds

Sample Code

You can download the sample code from this blog post here.

Next: Column Families

In PIG 0.9.0 we get some new functionality around being able to treat entire column families using maps. I’ll post some examples as well as some UDFs we wrote to support that next.

Have any questions or tips of your own? Let me know here, ore follow me on Twitter at @sublogical or check out my personal blog!

posted: Developer's Corner

此条目发表在未分类分类目录,贴了, 标签。将固定链接加入收藏夹。