BIG
DATA

JAVA

HADOOP SERIALIZATION

Read more about »
  • Java 9 features
  • Read about Hadoop
  • Read about Storm
  • Read about Storm
 

Serialization is the process of transforming structured objects into a byte stream for transmission over a network or for writing to persistent storage. Deserialization is the reverse process of transforming a byte stream back into a series of structured objects. Serialization is used in two quite distinct areas of distributed data processing: for interprocess communication and for persistent storage.

Hadoop makes an heavy use of network transmissions for executing its jobs. Doug Cutting (the creator of Hadoop) explaines in one of his post, java.io.Serializable is too heavy for Hadoop's needs and so a new interface was developed: Writable. Every object you need to emit from mapper to reducers or as an output has to implement this interface in order to make Hadoop trasmit the data from/to the nodes in the cluster.

In Hadoop, interprocess communication between nodes in the system is implemented using remote procedure calls (RPCs). The RPC protocol uses serialization to convert the message into a binary stream to be sent to the remote node, which then deserializes the binary stream into the original message.

In general RPC serialization format is expected to be:

  • Compact: To efficenetly use network bandwidth.
  • Fast: Very little performance overhead is expected for serialization and deserilization process.
  • Extensible: To adept to new changes and reqirements.
  • Interoperable:The format needs to be designed to support clients that are written in different languages to the server.

It was assumed that, the data format chosen for persistent storage would have different requirements from a serialization framework. After all, the lifespan of an RPC is less than a second, whereas persistent data may be read years after it was written. But it turns out, the four desirable properties of an RPC’s serialization format are also crucial for a persistent storage format.

Hadoop uses its own serialization format, which is known as Writables. It is certainly compact and fast, but not so easy to extend or use from languages other than Java.


The Writable Interface

The Writable interface has two methods - one for writing and one for reading. The method for writing writes its state to a DataOutput binary stream and the method for reading reads its state from a DataInput binary stream.

package org.apache.hadoop.io;
import java.io.DataOutput;
import java.io.DataInput;
import java.io.IOException;
public interface Writable {
	void write(DataOutput out) throws IOException;
	void readFields(DataInput in) throws IOException;
}

Let's see what we can do with a particular Writable. We will use IntWritable, a wrapper for a Java int. We can create and set its value in two ways: - using set method:

IntWritable writable = new IntWritable();
writable.set(57);

Other way is by using the constructor that takes the integer value:

IntWritable writable = new IntWritable(57);

Now let us understand the above by writing a helper method that wraps a java.io.ByteArrayOutputStream in a java.io.DataOutputStream to capture the bytes in the serialized stream:

public static byte[] serialize(Writable writable) throws IOException {
	ByteArrayOutputStream outStream = new ByteArrayOutputStream();
	DataOutputStream dataOut = new DataOutputStream(outStream);
	writable.write(dataOut);
	dataOut.close();
	return outStream.toByteArray();
}

Ok, now that we understood serialization let's try deserialization:

public static byte[] deserialize(Writable writable, byte[] bytes)
throws IOException {
	ByteArrayInputStream inStream = new ByteArrayInputStream(bytes);
	DataInputStream dataIn = new DataInputStream(inStream);
	writable.readFields(dataIn);
	dataIn.close();
	return bytes;
}

WritableComparable and comparators

WritableComparable interface is a subinterface of the Writable and java.lang.Comparable interfaces.

package org.apache.hadoop.io;
public interface WritableComparable extends Writable, Comparable {
}

MapReduce has a sorting phase during which keys must be compared with one another. Hence comparison of types is crucial for MapReduce. Hadoop provides RawComparator extension of Java’s Comparator.

public interface RawComparator extends Comparator {
	public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);
}

Using this interface you can compare records read from a stream without deserializing them into objects, thereby avoiding any overhead of object creation. For example, the comparator for IntWritables implements the raw compare() method by reading an integer from each of the byte arrays b1 and b2 and comparing them directly from the given start positions (s1 and s2) and lengths (l1 and l2).

WritableComparator is a general-purpose implementation of RawComparator for WritableComparable classes. It provides two main functions. First, it provides a default implementation of the raw compare() method that deserializes the objects to be compared from the stream and invokes the object compare() method. Second, it acts as a factory for RawComparator instances (that Writable implementations have registered). For example, to obtain a comparator for IntWritable, we just use:

RawComparator comparator =
WritableComparator.get(IntWritable.class);

The comparator can be used to compare two IntWritable objects:

IntWritable w1 = new IntWritable(163);
IntWritable w2 = new IntWritable(67);
assertThat(comparator.compare(w1, w2), greaterThan(0));

or their serialized representations:

byte[] b1 = serialize(w1);
byte[] b2 = serialize(w2);
assertThat(comparator.compare(b1, 0, b1.length, b2, 0, b2.length),
greaterThan(0));

Writable Classes

org.apache.hadoop.io package comes with a large selection of Writable classes. There are Writable wrappers for all the Java primitive types except char which can be stored in an IntWritable. All have a get() and set() method for retrieving and storing the wrapped value.

Java primitive Writable implementation
boolean BooleanWritable
byte ByteWritable
short ShortWritable
int IntWritable
VIntWritable
float FloatWritable
long LongWritable
VLongWritable
double DoubleWritable
Java class Writable implementation
String Text
byte[] BytesWritable
Object ObjectWritable
null NullWritable
Java collection Writable implementation
array ArrayWritable
ArrayPrimitiveWritable
TwoDArrayWritable
Map MapWritable
SortedMap SortedMapWritable
enum EnumWritable