# Natural Language Toolkit: Tree Transformations
#
# Copyright (C) 20052007 Oregon Graduate Institute
# Author: Nathan Bodenstab <bodenstab@cslu.ogi.edu>
# URL: <https://www.nltk.org/>
# For license information, see LICENSE.TXT
r"""
A collection of methods for tree (grammar) transformations used
in parsing natural language.
Although many of these methods are technically grammar transformations
(ie. Chomsky Norm Form), when working with treebanks it is much more
natural to visualize these modifications in a tree structure. Hence,
we will do all transformation directly to the tree itself.
Transforming the tree directly also allows us to do parent annotation.
A grammar can then be simply induced from the modified tree.
The following is a short tutorial on the available transformations.
1. Chomsky Normal Form (binarization)
It is well known that any grammar has a Chomsky Normal Form (CNF)
equivalent grammar where CNF is defined by every production having
either two nonterminals or one terminal on its right hand side.
When we have hierarchically structured data (ie. a treebank), it is
natural to view this in terms of productions where the root of every
subtree is the head (left hand side) of the production and all of
its children are the right hand side constituents. In order to
convert a tree into CNF, we simply need to ensure that every subtree
has either two subtrees as children (binarization), or one leaf node
(nonterminal). In order to binarize a subtree with more than two
children, we must introduce artificial nodes.
There are two popular methods to convert a tree into CNF: left
factoring and right factoring. The following example demonstrates
the difference between them. Example::
Original RightFactored LeftFactored
A A A
/  \ / \ / \
B C D ==> B A<CD> OR A<BC> D
/ \ / \
C D B C
2. Parent Annotation
In addition to binarizing the tree, there are two standard
modifications to node labels we can do in the same traversal: parent
annotation and Markov orderN smoothing (or sibling smoothing).
The purpose of parent annotation is to refine the probabilities of
productions by adding a small amount of context. With this simple
addition, a CYK (insideoutside, dynamic programming chart parse)
can improve from 74% to 79% accuracy. A natural generalization from
parent annotation is to grandparent annotation and beyond. The
tradeoff becomes accuracy gain vs. computational complexity. We
must also keep in mind data sparcity issues. Example::
Original Parent Annotation
A A^<?>
/  \ / \
B C D ==> B^<A> A<CD>^<?> where ? is the
/ \ parent of A
C^<A> D^<A>
3. Markov orderN smoothing
Markov smoothing combats data sparcity issues as well as decreasing
computational requirements by limiting the number of children
included in artificial nodes. In practice, most people use an order
2 grammar. Example::
Original No Smoothing Markov order 1 Markov order 2 etc.
__A__ A A A
/ /\ \ / \ / \ / \
B C D E F ==> B A<CDEF> ==> B A<C> ==> B A<CD>
/ \ / \ / \
C ... C ... C ...
Annotation decisions can be thought about in the vertical direction
(parent, grandparent, etc) and the horizontal direction (number of
siblings to keep). Parameters to the following functions specify
these values. For more information see:
Dan Klein and Chris Manning (2003) "Accurate Unlexicalized
Parsing", ACL03. https://www.aclweb.org/anthology/P031054
4. Unary Collapsing
Collapse unary productions (ie. subtrees with a single child) into a
new nonterminal (Tree node). This is useful when working with
algorithms that do not allow unary productions, yet you do not wish
to lose the parent information. Example::
A

B ==> A+B
/ \ / \
C D C D
"""
from nltk.tree import Tree
[docs]def collapse_unary(tree, collapsePOS=False, collapseRoot=False, joinChar="+"):
"""
Collapse subtrees with a single child (ie. unary productions)
into a new nonterminal (Tree node) joined by 'joinChar'.
This is useful when working with algorithms that do not allow
unary productions, and completely removing the unary productions
would require loss of useful information. The Tree is modified
directly (since it is passed by reference) and no value is returned.
:param tree: The Tree to be collapsed
:type tree: Tree
:param collapsePOS: 'False' (default) will not collapse the parent of leaf nodes (ie.
PartofSpeech tags) since they are always unary productions
:type collapsePOS: bool
:param collapseRoot: 'False' (default) will not modify the root production
if it is unary. For the Penn WSJ treebank corpus, this corresponds
to the TOP > productions.
:type collapseRoot: bool
:param joinChar: A string used to connect collapsed node values (default = "+")
:type joinChar: str
"""
if collapseRoot == False and isinstance(tree, Tree) and len(tree) == 1:
nodeList = [tree[0]]
else:
nodeList = [tree]
# depthfirst traversal of tree
while nodeList != []:
node = nodeList.pop()
if isinstance(node, Tree):
if (
len(node) == 1
and isinstance(node[0], Tree)
and (collapsePOS == True or isinstance(node[0, 0], Tree))
):
node.set_label(node.label() + joinChar + node[0].label())
node[0:] = [child for child in node[0]]
# since we assigned the child's children to the current node,
# evaluate the current node again
nodeList.append(node)
else:
for child in node:
nodeList.append(child)
#################################################################
# Demonstration
#################################################################
def demo():
"""
A demonstration showing how each tree transform can be used.
"""
from copy import deepcopy
from nltk import tree, treetransforms
from nltk.draw.tree import draw_trees
# original tree from WSJ bracketed text
sentence = """(TOP
(S
(S
(VP
(VBN Turned)
(ADVP (RB loose))
(PP
(IN in)
(NP
(NP (NNP Shane) (NNP Longman) (POS 's))
(NN trading)
(NN room)))))
(, ,)
(NP (DT the) (NN yuppie) (NNS dealers))
(VP (AUX do) (NP (NP (RB little)) (ADJP (RB right))))
(. .)))"""
t = tree.Tree.fromstring(sentence, remove_empty_top_bracketing=True)
# collapse subtrees with only one child
collapsedTree = deepcopy(t)
treetransforms.collapse_unary(collapsedTree)
# convert the tree to CNF
cnfTree = deepcopy(collapsedTree)
treetransforms.chomsky_normal_form(cnfTree)
# convert the tree to CNF with parent annotation (one level) and horizontal smoothing of order two
parentTree = deepcopy(collapsedTree)
treetransforms.chomsky_normal_form(parentTree, horzMarkov=2, vertMarkov=1)
# convert the tree back to its original form (used to make CYK results comparable)
original = deepcopy(parentTree)
treetransforms.un_chomsky_normal_form(original)
# convert tree back to bracketed text
sentence2 = original.pprint()
print(sentence)
print(sentence2)
print("Sentences the same? ", sentence == sentence2)
draw_trees(t, collapsedTree, cnfTree, parentTree, original)
if __name__ == "__main__":
demo()
__all__ = ["chomsky_normal_form", "un_chomsky_normal_form", "collapse_unary"]