We were asked by Bitext, a company that specialises in large-scale text analysis using Deep Linguistics Analysis, to make a bespoke visualisation that would allow them to express the output of their text analytics service.
Their service analyses text and automatically extracts several attributes of the text, such as entities and concepts, as well as additional information that allows them to categorise sentences and paragraphs by their main topic. Bitext also provide sentiment analysis by identifying the topic/concept of the opinion and detecting exactly which attributes or features of the topic are seen as positive or negative. As an alternative to a sentence-by-sentence, phrase-by-phrase visualisation, Bitext were keen to illustrate how each piece of information present in the text is related to each other and how it could be summarised in a visually appealing and easy to understand way.
The challenge for us here at AltViz was to represent both relationships within the text and the number of times a certain attribute was present. We chose a node graph to represent the information.
However, given the large amount of data, we were faced with the issue of making a graph that was nice and easy to understand, while still representing all the information.
A graph is a visual representation of information, where the relationships between entities are of importance.
Entities are represented by graph 'nodes' (also known as 'vertices') and their relationships with other entities are represented pairwise by lines connecting them to each other that are referred to as 'edges'.
A nod'’s neighbours are all other nodes connected to it by an edge.
Intuitively, we have an idea of what a 'nice, easy to understand' graph is. A formal definition could be a graph that follows these rules:
Force-directed graph drawing algorithms have been developed to draw graphs in an aesthetically pleasing way. Force directed graphs are created by assigning repulsive and attractive forces to the set of nodes and the set of edges according to their relative positions. Then these forces are used to simulate the motion of the nodes in space in order to find a configuration where the energy is minimised, which corresponds to a graph that looks nice and is easy to understand.
We consider two classes in our implementation: the Node and the Graph class. The node class defines the characteristics of each node, in this case its position, and the graph class defines all the nodes and edges.
We use two forces in our model:
For each node in our graph and its edges, we calculate the joint repulsive and attractive forces that affect it based on all nodes in the graph. For the repulsive force, we consider only nodes with equal number of neighbours and for the attractive force, only those nodes that are neighbours.
Then we multiply the sum of both forces on each node by a fixed parameter to obtain a measure of its 'velocity' and then update the current position of a node with the sum of its current position and its velocity. We also set a maximum value that the velocity can take, so that the nodes don't zoom about the graph too wildly.
Finally, after all nodes have had their positions updated we re-centre the graph and re-scale it to fit the display area.
So far we have an algorithm that can be used to update the position of nodes once, however as they move, they could drift closer to other nodes, so we need to repeat the process until the system converges on a stable state.
Having all the graph information (the position of all nodes and edges) at any point in time, we use a renderer to display the graph.
Now that we have an algorithm for a nice, easy to understand graph, we can use it to display our data.
We choose to make each attribute that is returned by the Bitext analytics API a node in the graph, and, if an attribute appears more than once in the text, instead of creating a new node, we increased the size of the first node. This way, more popular attributes will be reflected as bigger nodes.
We decided to use the sentence as our connecting structure. This means that attributes that appear in the same sentence in the text appear as connected nodes in the graph. If attributes are present in more than one sentence at the same time, the edge becomes thicker.
Finally, the Bitext analytics API categorises the attributes it analyses into groups. Nodes of the same colour represent attributes that belong to the same group. For example, the Bitext analytics API classifies the concepts extracted from the text into groups corresponding to the parts of speech they are (noun, verbal, adjectival, and adverbial) and the visualisation assigns a particular colour to each group.
Because the rearranging of the graph is an iterative process, there is no need to create the nodes all at once, they can also be added one-by-one while the model rearranges itself. As we move through the analysed text, new nodes are added as relevant attributes appear.
The final result is a tool that visualises the data in a force-directed graph, which you can view in the short video below.