public final class CharsToNameCanonicalizer
extends java.lang.Object
For optimal performance, usage pattern should be one where matches
should be very common (especially after "warm-up"), and as with most hash-based
maps/sets, that hash codes are uniformly distributed. Also, collisions
are slightly more expensive than with HashMap or HashSet, since hash codes
are not used in resolving collisions; that is, equals() comparison is
done with all symbols in same bucket index.
Finally, rehashing is also more expensive, as hash codes are not
stored; rehashing requires all entries' hash codes to be recalculated.
Reason for not storing hash codes is reduced memory usage, hoping
for better memory locality.
Usual usage pattern is to create a single "master" instance, and either use that instance in sequential fashion, or to create derived "child" instances, which after use, are asked to return possible symbol additions to master instance. In either case benefit is that symbol table gets initialized so that further uses are more efficient, as eventually all symbols needed will already be in symbol table. At that point no more Symbol String allocations are needed, nor changes to symbol table itself.
Note that while individual SymbolTable instances are NOT thread-safe (much like generic collection classes), concurrently used "child" instances can be freely used without synchronization. However, using master table concurrently with child instances can only be done if access to master instance is read-only (i.e. no modifications done).
Modifier and Type | Class and Description |
---|---|
(package private) static class |
CharsToNameCanonicalizer.Bucket
This class is a symbol table entry.
|
private static class |
CharsToNameCanonicalizer.TableInfo
Immutable value class used for sharing information as efficiently
as possible, by only require synchronization of reference manipulation
but not access to contents.
|
Modifier and Type | Field and Description |
---|---|
private CharsToNameCanonicalizer.Bucket[] |
_buckets
Overflow buckets; if primary doesn't match, lookup is done
from here.
|
private boolean |
_canonicalize
Whether any canonicalization should be attempted (whether using
intern or not.
|
private int |
_flags |
private boolean |
_hashShared
Flag that indicates whether underlying data structures for
the main hash area are shared or not.
|
private int |
_indexMask
Mask used to get index from hash values; equal to
_buckets.length - 1 , when _buckets.length is
a power of two. |
private int |
_longestCollisionList
We need to keep track of the longest collision list; this is needed
both to indicate problems with attacks and to allow flushing for
other cases.
|
private java.util.BitSet |
_overflows
Lazily constructed structure that is used to keep track of
collision buckets that have overflowed once: this is used
to detect likely attempts at denial-of-service attacks that
uses hash collisions.
|
private CharsToNameCanonicalizer |
_parent
Sharing of learnt symbols is done by optional linking of symbol
table instances with their parents.
|
private int |
_seed
Seed value we use as the base to make hash codes non-static between
different runs, but still stable for lifetime of a single symbol table
instance.
|
private int |
_size
Current size (number of entries); needed to know if and when
rehash.
|
private int |
_sizeThreshold
Limit that indicates maximum size this instance can hold before
it needs to be expanded and rehashed.
|
private java.lang.String[] |
_symbols
Primary matching symbols; it's expected most match occur from
here.
|
private java.util.concurrent.atomic.AtomicReference<CharsToNameCanonicalizer.TableInfo> |
_tableInfo
Member that is only used by the root table instance: root
passes immutable state info child instances, and children
may return new state if they add entries to the table.
|
private static int |
DEFAULT_T_SIZE
Default initial table size.
|
static int |
HASH_MULT |
(package private) static int |
MAX_COLL_CHAIN_LENGTH
Also: to thwart attacks based on hash collisions (which may or may not
be cheap to calculate), we will need to detect "too long"
collision chains.
|
(package private) static int |
MAX_ENTRIES_FOR_REUSE
Let's only share reasonably sized symbol tables.
|
private static int |
MAX_T_SIZE
Let's not expand symbol tables past some maximum size;
this should protected against OOMEs caused by large documents
with unique (~= random) names.
|
Modifier | Constructor and Description |
---|---|
private |
CharsToNameCanonicalizer(CharsToNameCanonicalizer parent,
int flags,
int seed,
CharsToNameCanonicalizer.TableInfo parentState)
Internal constructor used when creating child instances.
|
private |
CharsToNameCanonicalizer(int seed)
Main method for constructing a root symbol table instance.
|
Modifier and Type | Method and Description |
---|---|
private java.lang.String |
_addSymbol(char[] buffer,
int start,
int len,
int h,
int index) |
private java.lang.String |
_findSymbol2(char[] buffer,
int start,
int len,
CharsToNameCanonicalizer.Bucket b) |
private void |
_handleSpillOverflow(int bindex,
CharsToNameCanonicalizer.Bucket newBucket) |
int |
_hashToIndex(int rawHash)
Helper method that takes in a "raw" hash value, shuffles it as necessary,
and truncates to be used as the index.
|
private static int |
_thresholdSize(int hashAreaSize) |
int |
bucketCount()
Method for checking number of primary hash buckets this symbol
table uses.
|
int |
calcHash(char[] buffer,
int start,
int len)
Implementation of a hashing method for variable length
Strings.
|
int |
calcHash(java.lang.String key) |
int |
collisionCount()
Method mostly needed by unit tests; calculates number of
entries that are in collision list.
|
private void |
copyArrays()
Method called when copy-on-write is needed; generally when first
change is made to a derived symbol table.
|
static CharsToNameCanonicalizer |
createRoot()
Method called to create root canonicalizer for a
JsonFactory
instance. |
protected static CharsToNameCanonicalizer |
createRoot(int seed) |
java.lang.String |
findSymbol(char[] buffer,
int start,
int len,
int h) |
int |
hashSeed() |
CharsToNameCanonicalizer |
makeChild(int flags)
"Factory" method; will create a new child instance of this symbol
table.
|
int |
maxCollisionLength()
Method mostly needed by unit tests; calculates length of the
longest collision chain.
|
boolean |
maybeDirty() |
private void |
mergeChild(CharsToNameCanonicalizer.TableInfo childState)
Method that allows contents of child table to potentially be
"merged in" with contents of this symbol table.
|
private void |
rehash()
Method called when size (number of entries) of symbol table grows
so big that load factor is exceeded.
|
void |
release()
Method called by the using code to indicate it is done with this instance.
|
protected void |
reportTooManyCollisions(int maxLen) |
int |
size() |
public static final int HASH_MULT
private static final int DEFAULT_T_SIZE
private static final int MAX_T_SIZE
static final int MAX_ENTRIES_FOR_REUSE
static final int MAX_COLL_CHAIN_LENGTH
Note: longest chain we have been able to produce without malicious intent has been 38 (with "com.fasterxml.jackson.core.main.TestWithTonsaSymbols"); our setting should be reasonable here.
Also note that value was lowered from 255 (2.3 and earlier) to 100 for 2.4
private final CharsToNameCanonicalizer _parent
release
),
parent's shared tables may be updated from the child instance.private final java.util.concurrent.atomic.AtomicReference<CharsToNameCanonicalizer.TableInfo> _tableInfo
private final int _seed
private final int _flags
private boolean _canonicalize
NOTE: non-final since we may need to disable this with overflow.
private java.lang.String[] _symbols
private CharsToNameCanonicalizer.Bucket[] _buckets
Note: Number of buckets is half of number of symbol entries, on assumption there's less need for buckets.
private int _size
private int _sizeThreshold
private int _indexMask
_buckets.length - 1
, when _buckets.length is
a power of two.private int _longestCollisionList
private boolean _hashShared
This flag needs to be checked both when adding new main entries, and when adding new collision list queues (i.e. creating a new collision list head entry)
private java.util.BitSet _overflows
private CharsToNameCanonicalizer(int seed)
private CharsToNameCanonicalizer(CharsToNameCanonicalizer parent, int flags, int seed, CharsToNameCanonicalizer.TableInfo parentState)
private static int _thresholdSize(int hashAreaSize)
public static CharsToNameCanonicalizer createRoot()
JsonFactory
instance. Root instance is never used directly; its main use is for
storing and sharing underlying symbol arrays as needed.protected static CharsToNameCanonicalizer createRoot(int seed)
public CharsToNameCanonicalizer makeChild(int flags)
Note: while this method is synchronized, it is generally not safe to both use makeChild/mergeChild, AND to use instance actively. Instead, a separate 'root' instance should be used on which only makeChild/mergeChild are called, but instance itself is not used as a symbol table.
public void release()
private void mergeChild(CharsToNameCanonicalizer.TableInfo childState)
Note that caller has to make sure symbol table passed in is really a child or sibling of this symbol table.
public int size()
public int bucketCount()
public boolean maybeDirty()
public int hashSeed()
public int collisionCount()
size()
- 1), but should usually be much lower, ideally 0.public int maxCollisionLength()
size()
- 1 in the pathological casepublic java.lang.String findSymbol(char[] buffer, int start, int len, int h)
private java.lang.String _findSymbol2(char[] buffer, int start, int len, CharsToNameCanonicalizer.Bucket b)
private java.lang.String _addSymbol(char[] buffer, int start, int len, int h, int index)
private void _handleSpillOverflow(int bindex, CharsToNameCanonicalizer.Bucket newBucket)
public int _hashToIndex(int rawHash)
public int calcHash(char[] buffer, int start, int len)
len
- Length of String; has to be at least 1 (caller guarantees
this pre-condition)public int calcHash(java.lang.String key)
private void copyArrays()
private void rehash()
protected void reportTooManyCollisions(int maxLen)