Starbase Encoding Gotcha
There’s a library for Python called Stargate that makes it easy to interact with HBase over the REST server.
While trying this library, I discovered a potential gotcha if you’re expecting to be compatible with the Java library for HBase: if you insert any data that is not already a string via Starbase, then it will be converted to a string before it is written to HBase, whereas the Java library encodes the integers as raw bytes. It’s easy to work around, you just need to be aware of it.
For example, say I have an integer value stored in a python variable; to insert it to a column called “info:myInt” for row “fromStarbase” in Starbase looks something like:
my_int_value = 12345 t = c.table('testTable') t.insert('fromStarbase', { 'info:MyInt':my_int_value }) |
As the Starbase library prepares this data for submission, it calls str()
on the variable my_int_value
, converting it to the string "12345"
.
Using the Java API, the same action looks something like:
int my_int_value = 12345; Put p = new Put(Bytes.toBytes("fromJava")); p.add(Bytes.toBytes("info"), Bytes.toBytes("MyInt"), Bytes.toBytes(my_int_value)); HTableInterface table = pool.getTable("testTable"); table.put(p); |
This uses the function Bytes.toBytes(int)
to encode the integer value. In this case, it does not covert it to a string, but rather to an array of bytes in big-endian order.
Note that either approach is incorrect, because HBase deals in sequences of bytes, and is agnostic as to how you choose to encode data. However, by convention, I suspect the Java approach is more widely used, and therefore somewhat expected.
On the other hand, the HBase shell also converts values entered in the shell to a string, so maybe people do expect it:
put 'testTable', 'fromShell', 'info:MyInt', 12345
Doing a scan of the table from the shell we can see the difference in Java from the other two:
hbase(main):016:0> scan 'testTable' ROW COLUMN+CELL fromJava column=info:MyInt, ts=1393306182999, value=\x00\x0009 fromShell column=info:MyInt, ts=1393301137823, value=12345 fromStarbase column=info:MyInt, ts=1393301137823, value=12345 3 row(s) in 0.0230 seconds
Converting to string does have the benefit of making it human readable from the shell. But for most data I would prefer the Java convention. To make the Starbase and Java share the same encoding for integers, encode the value before inserting it:
def encode_int(intval): """Encode integer as a string in big-endian order.""" chars = [ chr((intval >> i) & 0xff) for i in range(24, -1, -8) ] return ''.join(chars) my_int_value = 12345 my_encoded_int_value = encode_int(my_int_value) t = c.table('testTable') t.insert('fromStarbaseEnc', { 'info:MyInt':my_encoded_int_value }) |
Now the value is encoded the same as the Java client:
hbase(main):017:0> scan 'testTable' ROW COLUMN+CELL fromJava column=info:MyInt, ts=1393306182999, value=\x00\x0009 fromShell column=info:MyInt, ts=1393301137823, value=12345 fromStarbase column=info:MyInt, ts=1393301137823, value=12345 fromStarbaseEnc column=info:MyInt, ts=1393306195342, value=\x00\x0009 4 row(s) in 0.0250 seconds
The Java library uses similar binary encodings for longs, floats and doubles, so for completeness you need the analogous encoders in python.
Awesome post. Well-written, too. Would love for you to add encoders for some of the other data types, not just ints. Or at least tell us how you figured it out. Did you look at the Java code an translate it Python? Thanks, though, this was very helpful.