Normalizing JSON for a relational database or Delta Lake Tables

Recently had to work on translating JSON to a relational database-like structure with Delta Lake Tables.

There are generally two ways to transform JSON into a table-like structure. Option 1, you can flatten the JSON (see https://github.com/amirziai/flatten) and generate unique IDs based on the values within the array. Option 2, you can destructure the JSON by pulling out each nested level.

Option 1 output example:

{"id" = "b3a0351cf4033d9484363bb6d4092d28" "id": "01", "name": "Jeff", "department_name": "Strategic Hazard Intervention Espionage Logistics Directorate", "department_locations_0": "Fort Meade, Maryland, U.S. 39°6'32"N 76°46'17"W", "department_locations_1": "Seattle, WA"}

This would give you a wide-table (a table with lots of columns). Useful to perform some data transformations from this format. Depending on the nesting, it can have a lot of duplicated/redundant data.

The second option is more suitable if you want to query the data in a relational manner. The __id__ is an identifying hash based on each nested JSON level. Pandas provides a normalize method that does something similar.

{"name": "location", "table": ["Fort Meade, Maryland, U.S. 39°6'32'N 76°46'17'W, "Seattle, WA"], "parent_id": "1cd7de4b25fe5c1e134c5d3d43e2ec39", "id": "4b58bb58656974d2bec3258ba6e1d92c"}

{"name": "department", "table": {"name": "Strategic Hazard Intervention Espionage Logistics Directorate"}, "id": "1cd7de4b25fe5c1e134c5d3d43e2ec39", "parent_id": "f2198883d8412e528c0d0128ea2998d6"}

{"name": "root", "table": {"id": "01", "name": "Jeff"}, "id": "f2198883d8412e528c0d0128ea2998d6", "parent_id": None}

The default hashing algorithm is MD5 which is sufficiently unique for our needs. The input is based on the data on that level of the JSON. If the JSON was loaded specifically from a file, the filename can be used as part of the hash. Depending on the needs, duplicated data rows could be created or one single row if we used a __child_id__ rather than a __parent_id__ field. Keeping a deterministic hash that is consistently generated based on the input is important when we want to update existing entries. A GUID/UUID would be unique, but generated every time and we can only ever append to the existing data.

    def hash(self, data: Union[DestructureType, Dict[str, ExpectedTypes], JsonType], seed: str = '') -> str:
        """
        Returns semantically invariant and deterministic hash of passed in JsonTree.
        :param: data - JsonType - The data to be hashed.
        :param: seed - str - Some unique string to be be appended to the data prior to hashing.
        """
        if seed is None: seed = ''
        d: JsonTree = self.is_json(data) # type: ignore
        return self._hasher(bytes(repr(self.__sorted_dict_str(d)), self._encoding) + bytes(seed, self._encoding)).hexdigest()

We iterate through the JSON properties or dictionary items and then recursively traverse the non-primitive properties.

def destructure(self, data: DestructureType, parent_name: str = ROOT) -> List[DestructureType]:
        """
        Recursively extracts nested dictionaries from a dictionary and converts them to a list of dictionaries
        where each nested dictionary is a level and has a reference back to its parent.
        :param: data - Dictionary of DestructureType
        :param: parent_name - Used to determine nested object or root.
        :returns: List of DestructureType - dictionaries of primitive types.
        """
        result: List[DestructureType] = []
        level: DestructureType = {}

        # Note that the root ROW_ID is the hash of the dictionary.
        # Subsequent recursive calls will have either the hash of 
        # nested dictionaries/list.  Could be problematic if we just hash 
        # the primitive values since there could be none.
        if parent_name == ROOT and self.__hash_mode == Destructor.NESTED:
            id_hash = self.d_hasher.hash(data)
        else:
            id_hash = self.d_hasher.hash(data, self._unique_id)

        if self.__hash_mode == Destructor.NESTED:
            self._unique_id = id_hash # For use by recursive calls.

        self.__depth += 1
        for key, value in data.items():
            if self.__depth == self._max_depth:
                raise RecursionError(f"{DEPTH_ERROR}{self._max_depth}")

            # We choose to exclude keys here instead of above so the hash will contain the original data.
            if key in self._keys_to_exclude:
                continue

            if isinstance(value, dict):
                result.extend(self._process_nested_collections(id_hash, key, value))  # type: ignore

            elif isinstance(value, list):
                if all(type(v) in Destructor.primitive_types for v in value):
                    child_hash = self.d_hasher.hash(value, self._unique_id)
                    result.append({NAME: key, TABLE: value, PARENT_ID: id_hash, ROW_ID: child_hash})
                else:
                    # Mix of primitive and dict/list.
                    for v in value:
                        if type(v) in Destructor.primitive_types:
                            child_hash = self.d_hasher.hash(v, self._unique_id)
                            result.append({NAME: key, TABLE: value, PARENT_ID: id_hash, ROW_ID: child_hash})
                        else:
                            result.extend(self._process_nested_collections(id_hash, key, v))
                            #result.extend([self.destructure(val, key) for val in v])
            else:
                level[key] = value

        # Siblings
        level = {NAME: parent_name, TABLE: level, ROW_ID: id_hash}
        result.append(level)

        if parent_name == ROOT:
            # Root could contain immediate child hashes.
            level[NAME] = ROOT
            level[PARENT_ID] = None
            level[ROW_ID] = id_hash
            self.__depth = 0

        return result

The logic to perform the destructuring/normalization can be found here:
https://github.com/Wind010/destructure

Using this code you can get the second output:

import json

d_hasher = DictionaryHasher()
destructor = Destructor(d_hasher)
json_str = '{"id":"01", "name": "Jeff", "department": {"name": "Strategic Hazard Intervention Espionage Logistics Directorate", "location": "Fort Meade, Maryland, U.S. 39°6′32″N 76°46′17″W" } }'
data  = json.loads(json_str)
list_of_dicts = destructor.destructure(data)

Project Destructure

Normalize/destructure a JSON/Dictionary object