trade-tariff-backend: Goods Nomenclature Nested Set
Tariff Data
The Tariff data we receive from CDS or Taric contains an ordered list of all Goods Nomenclatures. The order is determined by the concatenation of their goods_nomenclature_item_id
and their producline_suffix
. Commodities position within the logical hierarchy is determined by any given commodities indentation. This data is held in a second table called goods_nomenclatures_indents
- these indicate the depth of nesting of the Goods Nomenclatures they belong to, eg
0100000000-80 # Indent 0
- 0101000000-80 # Indent 0
- - 0101010000-10 # Indent 1
- - - 0101010000-80 # Indent 2, note: item_id matches parent but suffix is higher
- - - 0101010001-80 # Indent 2
- - 0101010100-80 # Indent 1
A classic Nested Set model
The Nested Set model page on wikipedia explains this well.
A Nested Set assumes an ordered identifier for records within the hierarchical database table.
Those records hold left + right (or start + end) values defining a range exclusively containing the identifiers of descendent records within the hierarchy.
Depth within the hierarchy must be dynamically computed by determining ancestors for a record. Ancestors are any other records whose own range fully encompasses the range (left + right) of the target record.
Reading the data
The Tariff data's hierarchy can be thought of as a modified form of that Nested Set model.
- The data is an ordered list of identifiers (item_id + suffix)
- The 'start of range' value (or 'left') is defined as a records identifier + 1
- We do not need to compute depth, that is provided by the indents table
- The 'end of range' value is 1 less then the lowest identifier which has the same depth as the origin record, and which is greater than the records own identifier.
Descendants and Ancestors can be fetched non-recursively by implementing the above rules.
Tree Nodes materialized view
To facilitate fast retrieval of data, a goods_nomenclature_tree_nodes
materialized view is added as a facade in front of the existing indents table. This provides an indexable ordered identifier (called position
) for all goods nomenclatures which can be used for efficient JOIN
-ing.
There are 3 key changes in this tree_nodes
view relative to the goods_nomenclature_indents
db view sitting behind it.
- Added a
position
column - this is a string concatenation of the 10 digitgoods_nomenclature_item_id
and the 2 digitproducline_suffix
- then cast to an integer. The rationale behind this is -;- it provides an ordered identifier, that follows the correct sequence of the dataset.
- it does not need to preserve the original information, it just needs to maintain the same order
- it is a single field allowing for easy use of MAX/MIN sql operators
- it is an integer for index performance - in initial prototyping this was a string because the source fields are strings but it was found queries were 4x faster when cast to an integer. Using a decimal instead (ie 1010101010.80) was found to be slower then using integers but faster than strings.
- Added a
depth
column - this isnumber_indents + 2
for all Headings and their descendents.- Chapters are the exception to this, they have a position of
number_indents + 1
, ie1
. - This makes the field an absolute definition of the depth, unlike the
number_indents
field which is 0 for both Chapters and Headings despite them being at different conceptual depths in the hierarchy - At present
depth == 0
is a single conceptual root of the hierarchy encompassing all Chapters and all of their descendants
- Chapters are the exception to this, they have a position of
- Populated
validity_end_date
intree_nodes
- this is never populated in the indents table- the missing end date means that we need to join the indents table on to itself to fetch only the most recent indent
- this JOIN is instead done during the generation of the
tree_nodes
view to both simplify querying thetree_nodes
data and to avoid repeatedly doing the JOIN during hierarchy lookups
Note: The position
identifier is not unique because there may be multiple tree_nodes
records for a goods nomenclature, all with the same position
but different validity dates. A unique identifier would be either the triple of the identifier and dates, ie position + validity_start_date + validity_end_date
or the goods_nomenclature_indent_sid
which refers to a tree_nodes
record for a given point in time.
Indexing
The primary index used for hierarchy lookups is depth
+ position
.
- Using
depth
as the first index key allows for efficient segmenting of the dataset and aligns with most lookups trying to min/max records at a specific depth. *position
is the second index key to allow finding the min/max after the subset of records at the expected depth is selected
validity_start_date
and validity_end_date
are not in the index because once the Postgres has filtered by depth
and position
, there are not many records to walk through so there is not much performance benefit. There would be a memory cost though, because every entry in the index would also require 2 dates that will be much larger then the int
+ bigint
for depth
+ position
There are also 2 other indexes
-
goods_nomenclature_sid
- this allows for efficient JOINs to the goods_nomenclatures table -
oid
(unique) - this is theoid
from the indents table. Refreshing a materialized view concurrently (ie without blocking reads from the view) requires the materialized view to have a unique index.
Querying in SQL
Note: origin
record references the tree_nodes
record you are fetching relatives for, eg ancestors of origin
, or children of origin
Ancestors
These can be queried by fetching the maximium position
at every depth
, where -;
-
depth
is less than thedepth
of the origintree_node
record - and the
position
is less than theposition
of the origin record
Next sibling
The tree_nodes
record with
- same
depth
as the origin record - the lowest
position
that is still greater than the origin recordsposition
Previous sibling
Note: Due to how we read the tree this is less useful then next sibling
The tree_nodes
record with
- same
depth
as the origin record - and has the highest
position
that is still less than the origin recordsposition
Children
This is every tree_nodes
record where -;
- the child nodes
depth
is exactly 1 greater than thedepth
of the origin record - and the child nodes
position
is greater than theposition
of the origintree_nodes
record - and the child nodes
position
is less than theposition
of next sibling of the origin record
Descendents
This is every tree_nodes
record where -;
- the child nodes
depth
is greater than thedepth
of the origin record - and the child nodes
position
is greater than theposition
of the origintree_nodes
record - and the child nodes
position
is less than theposition
of next sibling of the origin record
Goods Nomenclatures
The above describes how tree_nodes
can be related to each other. To find relatives on goods_nomenclatures
records you can JOIN to tree_nodes
via goods_nomenclature_sid
, JOIN the tree_nodes
to themselves via position
and depth
, then in turn back onto goods_nomenclatures
via the relatives goods_nomenclature_sid
.
+----+ +----------+ +-----------+ +---------+
| GN | -> | Origin | -> | Related | -> | Related |
+----+ | TreeNode | | TreeNodes | | GNs |
+----------+ +-----------+ +---------+
Querying in Ruby
There are abstractions for all for the above encapsulated in the Sequel relationships on the GoodsNomenclatures
model
Note: These are all eager loadable
Tree relationships
-
parent
- returns the parent -
ancestors
- returns a list of ancestors - starting at root of tree -
children
- all immediate children -
descendants
- all descendants of a goods nomenclature, at any depth
Populators
GoodsNomenclature records loaded for one relationship are often relevant to others, so where possible the fetching a relationship will also populate related relationships on the data model. Eg, fetching #ancestors
will also populate the #parent
relationship since that is the closest of the ancestors.
-
ancestors
also populatesparent
on self and all ancestors -
descendants
also populates-
parent
for all descendants -
children
for self plus all descendants -
ancestors
for all descendants if self already has ancestors loaded
-
The above means you can get a nice recursive tree of children, so in the following example the first line will generate 2 queries and the second line will generate 0 queries.
chapter = Chapter.actual.by_code('01').eager(:descendants).take
chapter.children.first.children.first.children.first.parent.children.second
And if you eager load #ancestors
before #descendants
then that too will be populated, so the following example triggers 3 queries for line 1, and 0 for line 2 or any subsequent movement around the eager loaded hierarchy.
chapter = Chapter.actual.by_code('01').eager(ancestors, :descendants).take
chapter.children.first.children.first.children.map(&ancestors)
Useful methods
-
#leaf?
- tells you whether a Goods Nomenclature is branch or leaf (ie, no children). This benefits from eager loading (children) and the Populators (see above) -
#declarable?
- replacement for#declarable
but eager loadable, internally relies upon#leaf?
-
#number_indents
- if data is loaded via the nested set relationships then this is populated automatically without needing to eager loadgoods_nomenclature_indents
-
#depth
- internal reference for the depth of a goods nomenclature, normallynumber_indents
+ 2 except for chapters which arenumber_indents
+ 1 -
#goods_nomenclature_class
- this now utilises leaf? internally so benefits from eager loading#children
or#descendants
the same -
.declarable
- Dataset method to filter by only declarable goods nomenclatures - this does do a left join to check for child_nodes but it skips any rows which have children so shouldn't impact results -
.with_leaf_column
- Dataset method which includes a virtualleaf
column showing whether an record has any children. This can be utilised bydeclarable?
to determining declarability without requiring additional queries. Carries a performance cost but can provideleaf
for 24k commodities in ~0.5 seconds
Measures
There are two new eager loadable measures relationships
-
measures
- all measures directly on a goods nomenclature -
overview_measures
- the overview measures directly on a goods nomenclature
These are different from the existing #measures
and #overview_measures
because they only load measures directly referencing the goods nomenclature.
#measures
and #overview_measures
would also include the measures against the goods nomenclatures ancestors. This was changed because it allows for eager loading and avoids repeat loads of the same measures for all descendants of the goods nomenclature they apply to.
There are two model methods which replicate the old behaviour -;
-
#applicable_measures
- concatenation of measures against both self and ancestors -
#applicable_overview_measures
- concatenation of overview measures against both self and ancestors
Eager loading measures
The measures will need eager loading for both self and ancestors, eg
Chapter.actual
.by_code('01')
.eager(:measures, ancestors: :measures)
.take
This makes it possible to eager load measures for all descendants as well -;
Chapter.actual
.by_code('01')
.eager(:measures,
ancestors: :measures,
descendants: :measures)
.take
If you need to eager load relationships below measures, you'll need to duplicate parts of the eager load block with variables/constants, eg
MEASURE_EAGER = {
measures: [:measure_type,
{ measure_conditions: :measure_condition_code }]
}
Chapter.actual
.by_code('01')
.eager(MEASURE_EAGER,
ancestors: MEASURE_EAGER,
descendants: MEASURE_EAGER)
.take
Some examples
HeadingsService::PrecacheService.rb
HeadingsService::Serialization::NsNondeclarableService
Api::V2::ChaptersController