Refactoring otr4j
Sun, Jul 2, 2017 ❝Experiences refactoring and improving otr4j library.❞Contents
A while back I published the Object Oriented Programming series articles. As a use case to support this series, I’ve looked into the code base of the Java implementation of otr (off-the-record), called otr4j, and used it for verifying and validating ideas described in this series. This article describes the findings while refactoring and eventually improving otr4j.
The work discussed here can be found at https://gitlab.com/cobratbq/otr4j. (Most recent commit at time of writing c607f0d50b6e791a23be30ca7f123504a5bd4cf2)
Introduction
otr4j is a suitable project for refactoring with the added requirement of forming an overview of OOP-related issues, improvements and curiosities. A project that is too large, depending on how well it is modularized, may make you use all the mental energy for understanding the code base, leaving no room for observing the OOP interaction in the project. otr4j as a code base is small enough to be understood and evaluated at the same time. Going in, I was expecting to focus solely on OOP-like or non-OOP-like constructs. During refactoring it became clear that there was room for improving the library itself, and focus slowly shifted from purely refactoring to actual technical improvements and functional additions.
This article describes refactoring and improvements as a way to document the over 400 commits, some of which are very minor changes. During the whole process, I’ve taken an effort to make changes easily reviewable. This means that some minor changes do get their own commits to make changes transparent. Invariably there are some large commits, especially those that introduce the State pattern for Message states, Socialist Millionaire Protocol states and later on also Authenticated Key Exchange (AKE). See the OTRv3 specification for more information on messaging states, AKE states and socialist millionaire protocol (SMP) states.
Apart from syntactic changes, there are some semantic changes. Some of these semantic changes do result in modified behavior of otr4j, although only very minimally. Semantic changes are introduced in favor of a better and more stable protocol handling logic, and are documented here.
The process
The process I’ve started with consisted of 3 basic steps. From those basic steps, the rest followed naturally.
- Add braces to make scopes explicit.
- Make variables, fields and parameters
final
such that “moving parts” become more visible. - Observe code and add markers (FIXME, TODO) for any part of code that needs more attention in the future.
Steps 1 and 2 are used to generate some overview of complexity of the code base. Step 3 is crucial for tagging pieces of code that warrant future attention such that we can forget about them for now and keep a clear focus on the task at hand. As I was going through the initial process, I have read significant parts of the code base. However, every time I was able to focus on a single kind of improvement.
The goal is to refactor and improve the code base while preserving backward compatibility for as long as possible. We only trade off backward compatibility for soundness/correctness of the logic.
Next sections discuss the various kinds of improvements that were identified and performed. The sections are divided by type of improvement.
Results
This “engineering exercise” in refactoring the otr4j code base had the following results as a direct result:
- Discovered and fixed bugs in the code that verifies the computational components in the algorithm of the Socialist Millionaire Protocol. Specifically, fixed a deviation in the bounds checking code. The logic was off-by-one on both the upper and lower bound, for group elements as well as exponents.
- Call back to user code under controlled circumstances. This avoids bad user code from interfering with protocol message handling. In communication protocols we expect both endpoints to be in sync. We need to avoid unintended interruptions as they can interfere with message processing, causing the local protocol instance to become out-of-sync with the other endpoint.
- Deserialization of OTR-encoded data was refined to do extra size verification.
- Deserialization of OTR-encoded code had incorrect interpretation of (4-byte) integer-sized data, due to Java having only signed integers. Any integer of
0x80000000
and over would be interpreted as negative values and would be used as such in various pieces of code such as sizes for array allocations. - Error in interpretation of Instance tags due to Java’s signed integers. Over half of valid instance tags would be rejected as invalid. However, in the otr4j code, this issue would occur only for the local user’s instance tag.
- Made various values such as constants used in cryptographic logic final as to avoid modification. Additionally, reduced visibility of constants that are only relevant to the otr4j package itself, specifically arrays - which are always mutable - that are used for cryptographic purposes.
- Deleted completely unused source code.
- Formalized fact that communication protocols can have unexpected state changes. There is an intended race condition between message from remote users being processed and commands from the local user being processed.
IncorrectStateException
represents the call to a method at an incorrect time as would be possible when a state change occurred just before the method is executed. This is a checked exception to make it apparent that this is an event that realistically can happen and logic should be able to handle the exceptional cases when it occurs. - Better adherence to OTR specifications: any unexpected SMP message results in aborting SMP exchange.
- Restructuring of protocol implementations for Message state, Socialist Millionaire Protocol and Authenticated Key Exchange, represented as states from a state machine:
- Atomic state transitions.
- Logic bundled with data, isolated in a single state object, guarantees that we cannot misidentify currently active state.
- More efficient use of memory, since we only allocate memory relevant to active state.
- Most data stored in immutable fields, allowing for more straight-forward code.
- States correctly initialized on start.
- Introduction of
IncorrectStateException
checked exception to formalize possibility that message state transitions can happen ad-hoc. Calling a method in an incorrect state is not necessarily a programmer error. - Improved exception handling to prevent interrupting protocol message processing.
- Finalize methods that we need to be sure are executed as they are, for example to reset state and clear memory.
- Let methods that verify cryptographic data before use throw exceptions instead of returning result values. This way we cannot mistakenly forget to check the result or incorrectly interpret the result and we can be sure the process is interrupted when incorrect data is encountered.
- Fix bad logic when initiating new session. It would previously throw an exception if either OTRv2 or OTRv3 was disallowed, instead of when both OTRv2 and OTRv3 are disallowed.
- Improved on used return values by reducing unnecessary use of
null
. Now returns 0-sized arrays if suitable and resorts tonull
only if it actually makes sense. - In general, used more OOP-compliant code, for example to clear the authentication state and partially copy the authentication state from the master session.
- Extending static analysis capabilities by applying JSR-305 annotations
@Nonnull
and@Nullable
. - Added more tests.
Discussing the details
In this section we’ll go into the changes in more detail. Although most of the existing otr4j framework is in tact, I did make a number of small public API breaking changes. Let’s start with the trivial language features.
Language features
Explicit scopes
We add braces everywhere to make scopes explicit. This is done simply to prevent making mistakes when moving code around, by inadvertently overlooking an implicit end of scope.
final
variables, fields, parameters
Final variables make programs easy to read and predictable, since a variable is assigned exactly once. By making these variables final we can identify the “moving parts” in the application. Additionally, we have the advantage of having only to determine once whether a variable is (potentially) null
.
Apart from identifying variables that are used multiple times, it also uncovers variables/fields that should be used only once but aren’t, because final
cannot be applied without static analysis pointing out errors.
Among others, making important crypto-related constants final, such as the GENERATOR_S
constant from the Socialist Millionaire Protocol logic. Additionally, with more final fields we help static analyzers reach larger parts of the code.
final
methods, classes
We finalize methods that are not intended to be overridden. For this change we focus on methods that are expected to work as defined in the base class and results in problems if the base implementation logic is not executed.
- AuthContext:
reset()
is made final such that we can be sure that once reset is called, we have a clean resetted authentication context instance. - SM: All
proof...
andcheck...
methods. These methods are used to verify data after exchanges with the counter party. We must be sure that methods perform all the checks that we expect to verify exchange information that we rely on to verify the identity. - OtrCryptoEngine, and other utility classes: The classes contain only utility methods. There is no reason to extend from this class, so we finalize it to make it formal.
@Override
annotations
Everywhere where a method is overridden, the @Override
annotation is added. This ensures that if we ever change signatures, we detect overriding methods that have stopped overriding due to the method signature change.
Reducing visibility of constants
-
Reduce accessibility of methods that should only be needed in very local scope. (class or package)
-
The cryptographic logic contains a constant for the zero counter value. This is an array and was previously more visible than strictly necessary. Given that arrays are always mutable, it means that the zero counter could be redefined at any point in time.
Replace return codes with exceptions
For checking logic in the Socialist Millionaire Protocol, throw exceptions instead of returning a boolean value as a good/bad indicator. That way it is not possible to mistakenly forget to check the return value or to wrongly interpret the return value.
Extract constants for string literals
Extract constants for string literals that are used for requesting specific cryptographic algorithms, for example to be used as: key factories, hashes, HMAC factories, etc. These are defined as constants as they are fixed by the OTR specification.
Initialization and intelligent reduction of exceptions
The OTR specification is fixed. We have a set number of constants for cryptographic primitives. Therefore, we are able to catch a certain category of exceptions that are only relevant in the dynamic use case: NoSuchAlgorithmException. These exceptions are thrown in case we specify an unknown/invalid type of algorithm. However we know beforehand if the algorithm is available, as we only rely on a predefined fixed set. (That is, if the algorithm is unknown, it is a programmer mistake instead of a user mistake.)
To verify that the running JVM knows all required algorithms, we use a static initializer. The static initializer attempts instantiation, allowing us to discover missing features early. Throughout all further use of cryptographic logic, we can wrap the NoSuchAlgorithmException with an IllegalStateException as we can reasonably assume that the exception will not happen. This significantly reduces the amount of error handling.
Explicit serialVersionUID
Exceptions are required to be serializable. Originally, serialVersionUID
was almost never defined. As a consequence the JVM would on-the-fly generate an UID. The way this UID is generated, can be different for different JVMs and different versions of JVMs. By defining serialVersionUID
ourselves, we ensure that this is the actual UID that is used.
Discovering and fixing bugs
- Correctly close streams after processing. I.e. exactly once, instead of at every end of loop iteration or not at all.
- Remove private getters/setters that do not surve any purpose: by assigning to fields directly, we can make some variables
final
. - Catch most specific exceptions. If there are any remaining programming errors, we can find out now that we let more (unexpected) exceptions pass through. Furthermore, we avoid unnecessary wrapping of
OtrException
. - Improved thread-safety:
- Applied locking more consistently and in more relevant cases. (Collections of listeners, slave sessions, etc.)
- Copy listeners before processing to reduce time of locking.
- Made message state instance volatile such that we always have a guaranteed up-to-date message state.
- Initialize collections in fields to ensure they always exist. Guarantee non-null values in more cases. Simplifies code by removing null checks.
- Fix forgotten
break
in switch-statement that caused a warning to be shown in duplicate.
SMP: intermediate value verification
Fix incorrect handling of SMP values. Verification of exponent and group element were incorrect. Specifically, the upper and lower bound both were incorrectly checked resulting in an off-by-one in the verification logic.
Deserialization value cases
A bug in the deserialization of SMP communications causes data incorrectly interpreted as negative integer value causes exceptions during array allocation or array index addressing.
Integer value interpretation
Fix and document cases where 4-byte unsigned integer data is read from OTR communications. Currently, assumed integer limitation of Java Integer.MAX_VALUE
as maximum readable integer value. This is a limitation of otr4j and is documented. At this point there is no reason to jump through hoops in order to fix this limitation, as we are already working with very large values.
InstanceTag parsing
Fix OtrAssembler parsing code w.r.t. InstanceTag values. Integer.parseInt
only reads 31 bits. This code would reject more than 50% of valid InstanceTags. (This is a follow up of an earlier fix where InstanceTag values were generated in far too small range.)
Fixing encodings/locales
Fixed remaining unicode string handling issues, encoding-/Locale-related issues. Use of predefined encoding constants to simplify code and decrease amount of necessary exception handling. Furthermore, we ensure that the encoding/locale is specified everywhere, such that we are never OS-dependent for network communications / byte processing.
Fixing initiation of new session
Fix incorrect policy verification logic for starting a new session: only throw exception if both OTRv2 and OTRv3 are not supported. Previously threw an exception if either OTRv2 or OTRv3 was disallowed. This does not make sense, since we can perfectly operate with only one allowed protocol version as long as the other endpoint supports it.
Null handling
Fixed (for as far as possible) handling of null
in plain message. null
characters in message content are not allowed due to null being used as a separator of message and TLVs.
Missing verification of receiving counter
In order to reuse the same encryption key for multiple messages many ciphers rely on a counter value that is incremented. This counter ensures that the ciphertext is always slightly different even if everything else is the same, i.e. both secret key and plain text content. It is recommended to verify the counter value against all previous messages received with the same secret key. This functionality was missing.
Improving logic
Improved logging
Fixed logging statements to be (more) useful: log exceptions instead of e.printStackTrace()
, log exceptions with root cause, log more exceptions and add useful explanation to the nature of the cause.
Improved method arguments and return values
- Provide empty collections instead of
null
. - Return empty arrays when we are expected to return arrays. Reduce the amount of
null
values used in the code.
Better exception handling
Introduced checked exceptions for check...
methods of SMP logic. This way we need to actively work around the thrown exception in order to nullify a failed check. This prevents us from mistakenly forgetting to check the result.
Improved state transitions
Restructured code for Message state, AKE (Authenticated Key Exchange) and Socialist Millionaire Protocol to State pattern. As a welcome consequence we now group data with corresponding logic and we only need to instantiate fields when this particular state is created. We moved whole batches of logic into a specific state, reducing and simplifying the logic that is independent of the messaging state. Furthermore, many fields can become immutable (final
) for that particular state and do not need to be preserved over state transitions. State transitions become atomic. States themselves manage state transitions such that the states are self-sufficient.
Particularly for the Message state state implementations, we group all cryptographically-oriented logic with the message state encrypted, since that is the only state where this logic matters. Therefore we cannot mistakenly misidentify the state in which this logic can be active. This protects the programmer in case of enhancements. Some methods are only relevant in some states. In all other cases we signal with IncorrectStateException
.
Particularly to the Socialist Millionaire Protocol state implementations, for any unexpected message we immediately abort SMP and reset to initial state. This is the default implementation logic, available in the AbstractState
class. If no overriding implementation is provided by the particular state implementation, this will be the prevailing logic. Hence any unexpected method call results in an abort of the SMP exchange.
Redesigned slave session management
OTRv3 introduced support for slave sessions. This is the mechanism to cope with multiple chat clients responding to the same OTR query message. OTRv2 would do its best to perform authentication, but would get confusing/duplicate messages as more than 1 chat client would respond to the message that was sent to a “single user account”. By introducing slave sessions, it made it possible to distinguish between various chat clients on same user account.
Additionally, with (slave) session management now in place, it should even be possible to maintain an OTRv2 and multiple OTRv3 sessions simultaneously.
Improving semantics
Recognize multi-threading nature of protocol implementation
Introduction of IncorrectStateException
to formalize potential last-second state transitions. Given that this is a communication protocol, a state transition can happen at any moment as it might be initiated by the counter party. With the introduction of IncorrectStateException
we formalize those actions that may fail at the last moment due to an unexpected state transition.
Prevent courtesy calls to listeners from interfering with process
otr4j allows for an Engine Host and any number of listeners to be registered. These listeners will be informed at key events in the communication protocol. Interaction with listeners are improved in following ways:
- Listeners are called after crucial otr4j logic is executed: We first ensure correct and full execution before calling out to listeners, for as far as this is possible.
- Calls to listeners (EngineHost and other listeners) are executed safely. Faulty implementations of listeners that unexpectedly throw an exception will not be able to interfere with the execution of the protocol logic.
Any exceptions will be caught and logged to identify the faulty listener. Execution will proceed as normal, since the protocol is not expected to deviate from the specification in case of errors in the client. We ensure that clients at both ends are kept in sync regardless of client errors. - Given previous improvement, we can now be sure that resets/clean-ups are executed.
Trigger listeners after successful processing
State implementations are now completely self-sufficient. Triggering events to Engine Host and other interested listeners, will happen after the message is fully processed. This may result in slightly different behavior.
For one particular case, this difference is mitigated. This is the case for aborting an SMP exchange. To mitigate the change in behavior we let the abort call return false
if no state transition was necessary, i.e. no SMP exchange was previously in progress, or true
if an SMP exchange was previously in progress. We only inform listeners in case we actually interrupted an active SMP exchange.
Improved static analysis
In an earlier section we described how we use final
to create an overview of variables/fields/parameters used once and those that are updated and reused multiple times. Using final
also improves static analysis.
Nullability analysis
Introducing JSR-305. Using annotations @Nonnull
and @Nullable
to indicate guarantees regarding null
values. This helps analysis of potential null
-related bugs.
Miscellaneous
- Removed unused utility class once used in SM protocol implementation.
- Removed unused
main()
methods that were once used during initial development. - Remove 7 files with regard to
OtrKeyManager
. This is dead code that was completely disconnected from the other otr4j source code. - Removed implied modifiers in interface definitions.
- Some checked casting to satisfy static analysis. Current code relies on a data value to indicate the type of object. As we do not check the type before casting, this results in static analysis warnings. By checking before casting, we fix these warnings.
- Replaced obsolete collections: Vector, Hashtable.
- Verify references (checks for
null
) before adding them to collections. - Fix incorrect variables in logging, incorrect variable names.
- Short-hand notations for readability.
Future work
Some plans that are still brewing for improving otr4j (in order as they popped up in my mind):
- Integrate this work in jitsi’s otr plugin (and refine the otr plugin in the process).
- Investigate and implement otr version 4 support, as the spec seems to be stabilizing. There are some required dependencies still missing though.
- Hook otr4j up to a fuzzer and investigate edge case parsing/processing issues. There are bound to be edge cases (actually I think there are some known ones) that trigger otr4j logic for user messages.
- Implement a miniature “simulator” that sets up a basic server and connects two otr chat clients and lets them converse. This would allow us to do interoperability testing between otr implementations. A small script would be required to instruct client agents how to behave in order to perform automated chat actions. The runner can set up a trivial chat server in order to facilitate the conversation as well as mimic transport-level issues such as mangled and dropped packets.