Skip to content

Starbase Encoding Gotcha

February 24, 2014

There’s a library for Python called Stargate that makes it easy to interact with HBase over the REST server.

While trying this library, I discovered a potential gotcha if you’re expecting to be compatible with the Java library for HBase: if you insert any data that is not already a string via Starbase, then it will be converted to a string before it is written to HBase, whereas the Java library encodes the integers as raw bytes. It’s easy to work around, you just need to be aware of it.

For example, say I have an integer value stored in a python variable; to insert it to a column called “info:myInt” for row “fromStarbase” in Starbase looks something like:

my_int_value = 12345
t = c.table('testTable')
t.insert('fromStarbase', { 'info:MyInt':my_int_value })

As the Starbase library prepares this data for submission, it calls str() on the variable my_int_value, converting it to the string "12345".

Using the Java API, the same action looks something like:

int my_int_value = 12345;
Put p = new Put(Bytes.toBytes("fromJava"));
p.add(Bytes.toBytes("info"), Bytes.toBytes("MyInt"),
    Bytes.toBytes(my_int_value));
HTableInterface table = pool.getTable("testTable");
table.put(p);

This uses the function Bytes.toBytes(int) to encode the integer value. In this case, it does not covert it to a string, but rather to an array of bytes in big-endian order.

Note that either approach is incorrect, because HBase deals in sequences of bytes, and is agnostic as to how you choose to encode data. However, by convention, I suspect the Java approach is more widely used, and therefore somewhat expected.

On the other hand, the HBase shell also converts values entered in the shell to a string, so maybe people do expect it:

put 'testTable', 'fromShell', 'info:MyInt', 12345

Doing a scan of the table from the shell we can see the difference in Java from the other two:

hbase(main):016:0> scan 'testTable'
ROW            COLUMN+CELL
 fromJava      column=info:MyInt, ts=1393306182999, value=\x00\x0009
 fromShell     column=info:MyInt, ts=1393301137823, value=12345
 fromStarbase  column=info:MyInt, ts=1393301137823, value=12345
3 row(s) in 0.0230 seconds

Converting to string does have the benefit of making it human readable from the shell. But for most data I would prefer the Java convention. To make the Starbase and Java share the same encoding for integers, encode the value before inserting it:

def encode_int(intval):
    """Encode integer as a string in big-endian order."""
    chars = [ chr((intval >> i) & 0xff) for i in range(24, -1, -8) ]
    return ''.join(chars)
 
my_int_value = 12345
my_encoded_int_value = encode_int(my_int_value)
t = c.table('testTable')
t.insert('fromStarbaseEnc', { 'info:MyInt':my_encoded_int_value })

Now the value is encoded the same as the Java client:

hbase(main):017:0> scan 'testTable'
ROW               COLUMN+CELL
 fromJava         column=info:MyInt, ts=1393306182999, value=\x00\x0009
 fromShell        column=info:MyInt, ts=1393301137823, value=12345
 fromStarbase     column=info:MyInt, ts=1393301137823, value=12345
 fromStarbaseEnc  column=info:MyInt, ts=1393306195342, value=\x00\x0009
4 row(s) in 0.0250 seconds

The Java library uses similar binary encodings for longs, floats and doubles, so for completeness you need the analogous encoders in python.

One Comment leave one →
  1. Mark permalink
    August 14, 2014 8:28 PM

    Awesome post. Well-written, too. Would love for you to add encoders for some of the other data types, not just ints. Or at least tell us how you figured it out. Did you look at the Java code an translate it Python? Thanks, though, this was very helpful.

Leave a Reply

Your email address will not be published. Required fields are marked *