Converting emails from top-posted to bottom-posted

October 29, 2017

I have been an adherent of bottom-posting, or, to be more precise, of interleaved-posting for as long as I can remember. I don't recall why I selected this posting style when first I started sending emails — I guess it just felt more intuitive and natural.

These days I find top-posted messages so annoying to deal with, that I manually convert them to bottom-posted before replying. Unfortunately, many modern email clients and web interfaces promote a top-posting style, so I end up performing such conversions more often than I would like. A few weeks ago I received an 8 level deep top-posted email and decided that it is time for me to welcome our automated email conversion overlords. The result is the top2bottom command-line tool.

There are many forms of top-posting, but the one I have to deal with the most, and therefore the one that the top2bottom tool handles, looks like:

2017-10-28 B wrote:
> bbbbb
>
> bbbbb
>
> 2017-10-28 A wrote:
>
>> aaaaaa
>> aaaaaa
>>
>

The number of quote-prefix markers (>) in the beginning of each line indicates the quote-depth of that line.

This form consists of two kinds of line blocks:

attribution blocks, denoted by aD, where D is the quote-depth of the block
body blocks, denoted by bD, where D is the quote-depth of the block

In a top-posted message the blocks are interleaved:

b0, a0, b1, a1, b2, a2, ...

To convert such a message to bottom-posted we need to move all attribution blocks to the top in quote-depth order, followed by all body blocks in reverse quote-depth order:

a0, a1, a2, ..., b2, b1, b0

The block reordering operation can be easily implemented by sorting with an appropriate comparison function. The more interesting and non-trivial part is figuring out which lines belong to which block. To do this, we first need to define some terms:

An empty line is a line that has no message content, although it may have one or more quote-prefix markers (>), i.e., a non-zero quote-depth.
A paragraph is a sequence of non-empty, consecutive lines of the same quote-depth.

Now we can provide precise definitions for attribution and body blocks:

An attribution block of quote-depth D is the sequence of all consecutive lines of quote-depth D which start from the beginning of the last paragraph before a quote-depth increase from D to D+1, and end at the last line before the quote-depth increase.
A body block of quote-depth D is the sequence of all lines (not necessarily consecutive) of quote-depth D that do not belong to an attribution block.

Based on these definitions, I experimented with a number of classification algorithms. For my first attempt I used a chain of functional operations to process the message, but I found it unintuitive to express the intricacies of block detection in this manner. My next attempt was based on a finite state machine and performed a single pass of the message from bottom to top. The finite state machine approach worked, but it was more complex and difficult to reason about than I would like. Finally, I developed an algorithm that I found to be both simpler and more obviously correct than the alternatives.

The algorithm works by processing lines from top to bottom, marking them as body lines of their respective quote-depth. When a quote-depth increase is detected, and before processing the line with the increased quote-depth, the algorithm backtracks until it reaches the beginning of a paragraph, marking all lines it visits as attribution lines. It then continues processing from the point it stopped before backtracking.

The empty line (if any) just before an attribution block needs special care. It is an artifact of top-posting and serves to separate the preceding message body from the attribution lines. In top2bottom I opted to repurpose this line as an extra space after a quote-depth change in the bottom-posted output. This works well because top-posted replies usually don't start with an empty line, so adding this line makes the bottom-posted version look better. In algorithmic terms, this line is marked as belonging to a special before-body block (denoted by bD-), which is placed just before the bD block in the final output.

Here is the algorithm in action:

top2bottom-algorithm

The algorithm works well if the input is top-posted, but what happens if the input has some other form? It turns out that the algorithm works well for two additional classes of messages that are not strictly top-posted and are often encountered in the wild. The first class consists of messages that end with a cascade of empty lines of decreasing quote-depth, which is a result of (inadvertently) adding empty lines to the end of a message when replying. The example used above to show the algorithm in action actually belongs to this class. The second class comprises messages that started as (strictly) bottom-posted, but for which the replies changed to top-posted at some point.

There is, however, a third class of messages that is also frequently seen, but the algorithm I described so far cannot handle without some enhancements. This class is composed of messages that started as interleaved-posted and then turned to top-posted. To handle this class we need a way to detect the section of the message that is interleaved-posted and ensure that we don't reorder its parts, but rather treat it as an indivisible whole.

Fortunately, there is an easy way to decide which lines of the message belong to the interleaved section. The quote-depth curve of interleaved messages has a characteristic pattern of multiple peaks (lines with the highest quote-depth locally) and valleys (lines with lower quote-depths between successive peaks). In a deeply interleaved message the base quote-depth of the interleaved section is the quote-depth of the lowest valley. All lines with quote-depth equal to or higher than the base quote-depth belong to the interleaved section of the message. In the following example, the lowest (and only) valley has a quote-depth of 2, so we treat the marked part of the message, which consists of all lines with quote-depth 2 or more, as indivisible:

2017-10-28 C wrote:      0
> ccccc                    1
>                          1
> 2017-10-28 B wrote:      1      ___
>> 2017-10-28 A wrote:       2     |
>>> aaaaa                      3   |
>>> aaaaa                      3   |
>>                           2     |
>> bbbbb                     2 ----|
>>                           2     |
>>> aaaaa                      3   |
>>                           2     |
>> bbbbb                     2    _|_
>                          1

The latest version of top2bottom implements the aforementioned enhancement, and, with it, can handle almost all of the top-posted messages of this form that I have encountered. The few messages which top2bottom refuses to convert are either badly formatted or use a different form of top-posting.

I have been using top2bottom for a few weeks now and I am very pleased with it. I am using Vim for composing and replying to emails, so my preferred way to convert emails is by using the Vim filter command: :%!top2bottom. I hope you find top2bottom useful!

A closing note: while doing some research for this post I came across another, much older program that also performs top to bottom email conversion, and which, unsurprisingly, is also named top2bottom. The code for it was posted in an old Red Hat list post. I have tried it on a few sample top-posted emails, but it doesn't seem to work well, at least not for the kinds of emails that I get. YMMV.