Streaming big json files the good way with php

Json is now really popular to share data between components. I’m working on an small EAI (Enterprise Application Integration). It’s basic purpose is to transform and integrate data from an application to another. Original data is transformed to JSON from an SQL instance, parsed by the EAI and backed up in another database with transformations and tests.

The problem

This seems really simple on the first look, but when you work with big files, you can not simply json_decode a file and put it back where you need too. It could work with small files, which sizes aren’t bigger than you’re computer memory. With big files, it’ll eat up memory and might break the server, or worse eat up ressources that other applications might need.

If we want something that’ll work with any files we need to find another way of doing this named “stream”.

A problem within the problem

Streaming a file is easy but it’s json not csv. Take this json sample, this is usually how data will be transformed:

[ //an array
  { //of objects
    "name": "This is the first object"
  },
  { //later I'll call this a root object
    "name": "This is the second object"
  }
]

We could write an object per line, and this would be easier to parse:

[
	{"name": "This is the first object"},
    {"name": "This is the second object"}
]

With the following algorithm (really simplified):

for each line
	# it's valid json, we need to skip [] parts
	if not first and not last
		decode json
    	process object
	endif
endfor

But you can’t be sure that the file structure will always follow you’re application specs, neither can you be sure that there will never be any newline inside a value.

Json streaming parser

After some digging I came uppon to a json streaming parser in php that uses a listener to process objects and arrays as they are detected through the file pointer. Read more about the script on the author blog post.

It’s like a SAX parser but with json.

The problem with the listener, is that the code is not commented (not at all!) and there is a lack of documentation. After a few days I managed to do what I wanted. First, we need to take a look at the Listener interface:

<?php

interface JsonStreamingParser_Listener {
  public function file_position($line, $char);
  //this is called when the document starts
  public function start_document();
  
  //this is called on EOF
  public function end_document();

  //the start/end of an object
  public function start_object();
  public function end_object();
  
  //the start/end of an array
  public function start_array();
  public function end_array();
  
  // Called when a key is found
  public function key($key);
  
  // There it's a value
  public function value($value);
  
  public function whitespace($whitespace);
}

Seems pretty easy on the first look. I want to track objects, arrays and fill them with proper key values. When I’ve detected a full root object I want the listener to call a callback that will handle this object.

Now, we might take a look at the example.php file from the repository (I’ve kept only interesting parts, full file here):

<?php

class ArrayMaker implements JsonStreamingParser_Listener {
  private $_json;

  private $_stack;
  private $_key;

  public function get_json() {
    return $this->_json;
  }

  public function start_document() {
    $this->_stack = array();

    $this->_key = null;
  }

  public function start_object() {
    array_push($this->_stack, array());
  }

  public function end_object() {
    $obj = array_pop($this->_stack);
    if (empty($this->_stack)) {
      // doc is DONE!
      $this->_json = $obj;
    } else {
      $this->value($obj);
    }
  }

  public function start_array() {
    $this->start_object();
  }

  public function end_array() {
    $this->end_object();
  }

  // Key will always be a string
  public function key($key) {
    $this->_key = $key;
  }

  // Note that value may be a string, integer, boolean, null
  public function value($value) {
    $obj = array_pop($this->_stack);
    if ($this->_key) {
      $obj[$this->_key] = $value;
      $this->_key = null;
    } else {
      array_push($obj, $value);
    }
    array_push($this->_stack, $obj);
  }

}

This is really useless as stated in the header comment but it shows a way to build a PHP array from a json file. It’s not perfectly working as mentioned in #24 but it helped me to understand the listener.

To build a perfect array, by perfect I mean that json_encode($result) should produce the same result as the entry json, I choosed to work with pointers by tracking every state of the php object while it’s beeing built.

This is the result, please read comments if you’re want to understand how it works. This is a small draft that can still be improved (eg end_array and end_object code should get back the nearest pointer).

<?php
require_once './vendor/autoload.php';

/**
 * This implementation allows to process an object at a specific level
 * when it has been fully parsed
 */
class ObjectListener implements JsonStreamingParser_Listener {

    /** @var string Current key **/
    private $_key;

    /** @var int Array deep level **/
    private $array_level = 0;
    /** @var int Object deep level **/
    private $object_level = 0;

    /** @var array Pointer that aliases the current array that represents an object or an array **/
    private $pointer;

    /**
     * @var array $array_pointers Stores different array pointers 
     * according to the deep level
     * @var array $object_pointers Stores different objects pointers 
     * according to the deep level
     * Those are used to track pointers, it's easy to go forward 
     * or backwards by using this as they are only references.
     */
    private $array_pointers, $object_pointers;

    /** @var array Main array that stores the current building object **/
    private $stack = array();

    /**
     * @param function $callback the function called when a json 
     * object has been fully parsed
     *
     * @throws InvalidArgumentException if callback isn't callable
     *
     * @return void
     */
    public function __construct($callback)
    {

        if(!is_callable($callback)) {
            throw new \InvalidArgumentException("Callback should be a callable function");
        }

        $this->callback = $callback;
    }

    public function file_position($line, $char) {
    }

    /**
     * Document start
     * Init every variables and place the pointer on the stack
     *
     * @return void
     */
    public function start_document() {

        $this->stack = array();
        $this->array_pointers = array();
        $this->array_level = 0;
        $this->object_level = 0;
        $this->object_pointers = array();
        $this->keys = array();
        $this->_key = null;

        $this->pointer =& $this->stack;
    }

    /**
     * Document end (EOF)
     *
     * @return void
     */
    public function end_document() {
        // release memory
        $this->start_document();
    }

    /**
     * Start object
     * An object began...
     *
     * @return void
     */
    public function start_object() {
        //Increase the object level
        $this->object_level++;

        //Point on the current array
        $this->pointer =& $this->array_pointers[$this->array_level];

        //Get the current index
        $array_index = isset($this->pointer) ? count($this->pointer) : 0;

        //Build an array on this index
        $this->pointer[$array_index] = array();

        //Pointer is now this new array
        $this->pointer =& $this->pointer[$array_index];

        //Store it
        $this->object_pointers[$this->object_level] =& $this->pointer;
    }

    /**
     * End Object
     * An object ended
     *
     * @return void
     */
    public function end_object() {

        $this->pointer =& $this->array_pointers[$this->array_level];
        
        //We've reach a full object on my root array, callback
        if($this->array_level == 1 && $this->object_level == 1) {
            call_user_func($this->callback, $this->stack);
            array_shift($this->stack[0]); //release this item from memory
        } 

        $this->object_level--;
    }

    /** 
     * Start array
     * An array began...
     *
     * @return void
     */
    public function start_array() {
        $this->array_level++;

        //If we have a key it's our index
        if($this->_key) {
            $index = $this->_key;
            $this->_key = null;
        } else {
            $index = isset($this->pointer) ? count($this->pointer) : 0;
        }

        //This is our array, point on it
        $this->pointer[$index] = array();
        $this->pointer =& $this->pointer[$index];

        //Store the pointer
        $this->array_pointers[$this->array_level] =& $this->pointer;

    }

    /**
     * End array
     *
     * Now it ended...
     * @todo, according to both levels, point to the nearest one array 
     * or object
     * @return void
     */
    public function end_array() {
        //Point on the last known object 
        $this->pointer =& $this->object_pointers[$this->object_level];
        $this->array_level--;
    }

    /**
     * Called when a key is founded
     * @param string $key
     * @return void
     */
    public function key($key) {
        $this->_key = $key;
    }

    /**
     * Called when a value is founded
     * @param mixed $value may be a string, integer, boolean, null
     * @return void
     */
    public function value($value) {

        if($this->_key) {
            $this->pointer[$this->_key] = $value;
        } else {
            $this->pointer[] = $value;
        }
    }

    public function whitespace($whitespace) {
    }
}

Happy results

With this listener, I made a small memory benchmark. On the first one I ran the code by removing the array_shift part (ObjectListener#end_object) so that a full stack is stored. The callback was beeing called on every root object and it was writing the timestamp and the current memory usage:

Bad stream

Now I kept the array_shift part to release memory:

Good stream

And it’s doing a great job :). I tested this with 1Mb json file parsed in 1s on my computer. Htop showed a high cpu usage during the test.

If you want to test this on a big file you might use a 18 Mb Shakespeare JSON file - also used in the Elasticsearch tutorial.

If you want to test the listener, here’s how to use it:

<?php

$testfile = __DIR__.'/example.json'; //https://gist.github.com/soyuka/a1d83ff9ff1a6c5cc269

$listener = new ObjectListener(function($obj) {
    var_dump($obj);
});

$stream = fopen($testfile, 'r');
try {
    $parser = new JsonStreamingParser_Parser($stream, $listener);
    $parser->parse();
} catch (Exception $e) {
    fclose($stream);
    throw $e;
}

Up to date gist: https://gist.github.com/soyuka/4468eab47aceb6abd1bf