Efficient saving of multi-source buffers

Preface

Several years ago I was working on libbls, a library implementing editable buffers that can efficiently hold contents originating from both memory and file sources. Typical clients of this libraries are programs that need to edit arbitrary portions of very large files, like hex editors or wave editors, while minimizing the required memory usage.

One of the more interesting problems I encountered was devising an efficient way to save buffers to their corresponding files in place, while minimizing the extra memory (or disk space) used in the process. This post describes how this problem, which turns out to be NP-hard, can be modeled as a graph, and how we can provide a reasonable solution by using a variation of a standard graph algorithm.

Introduction

Buffers in libbls are described in terms of segments that contain data from either memory or file sources. In order to conserve memory (and sometimes to even make the editing feasible) the data from file sources is not loaded into main memory. Instead, the buffer keeps information about the contained files and ranges and reads the data in vm-page-sized chunks when needed. So, a buffer could look like:

B:  |F1:0-9   |M1:10-18|F1:10-19 |M2:10-19|F2:0-6|

Where F1 and F2 are file sources, M1 and M2 are memory sources and S:B-E denotes a segment representing bytes from the range [B, E] of source S.

In the simple case, saving a buffer to a file consists of just reading data from the various sources and writing them out to the target file. Things become more complicated when the target file happens to be one of the file sources used in the buffer.

This post aims to illustrate the issues that can arise in such cases and propose an elegant way to resolve them.

Illustrating the problem

Saving the buffer to a file typically involves reading each segment in order and writing it out to the file. However, if the segment source is the same as the file that we are writing to, and depending on the ordering of segments, we may face a serious problem.

To illustrate the problem, consider a buffer B containing segment [0-9] of file F at range [10-19] of the buffer, and assume that we want to write the buffer back to file F:

    S
F:  |0-9      |.........|

              S             
B:  |.........|F:0-9    |
    ^         ^         ^
    0         10        20

Writing the buffer segments in order leads to a situation where we first write out buffer range [0-9] to file F, overwriting the contents of the file at that range. So, when moving on to the next segment, trying to write out the range [10-19] of the buffer, the data we are going to read from the file (and therefore write) is not the original data of [F:0-9], but the data belonging to the previous segment of the buffer we just wrote at positions [0-9].

The solution in this case is to first write out range [10-19] of the buffer to the file, so that the data we use is the original file data, and only then write out any other segments.

The problem can become even more complicated when we have more segments from the target file. A particularly interesting case arises when a segment from the buffer which belongs to the target file happens to overlap with another segment of F which is also present in the buffer:

    S                   T
F:  |0-9      |.........|20-29    |.........|
                      
            T                     S 
B:  |.......|F:20-29  |...........|F:0-9    |
    ^       ^         ^           ^
    0       8         18          30

In this case segment S in F overlaps with segment T in B. We can simply try to adapt the strategy used in the previous case and first write out the two target file segments. This works but only if we are extra careful. In this case, if we first write segment T then when we try to read the data of segment S we will read wrong data (file range 8-9 will contain data from segment T). If we do it the other way around everything works wonderfully.

Taking the last example one step further, consider what happens if we have cyclic overlaps:

    S                   T
F:  |0-9      |.........|20-29    |....|

            T                S 
B:  |.......|F:20-29  |......|F:0-9    |
    ^       ^         ^      ^     ^
    0       8         18     25    30

Now segment S in F overlaps with segment T in B, and segment T in F overlaps with segment S in B. In this case there is no way to order the writes of segments S and T so that we end up with the correct data.

A straightforward but wasteful way of tackling this problem is to save the whole buffer to a temporary file and then move the file to its final location. This works, and in some cases is preferred since it has the benefit of atomicity, but requires extra space for the temporary file. When buffers reach many GiB in size this method may prove to be unfeasible.

A more efficient method is to try to eliminate the cyclic overlaps by removing at least one of the overlaps involved. In the previous case we can replace segment T in B with two segments M and T1 so that no overlap occurs. Segment M will contain a part of T's data but will actually store them in memory (or a temporary file if necessary) and the T1 segment will just be a T minus the range stored in M. This approach still requires some amount of extra memory, but in most cases this amount is much smaller than the size of the whole buffer. Using this scheme we have:

    S                     T1
F:  |0-9      |...........|22-29  |....|

            M T1              S 
B:  |.......|.|F:22-29|......|F:0-9    |
    ^       ^         ^      ^     ^
    0       8         18     25    30

We have managed to eliminate the first overlap and now we can safely save the buffer by first writing T1 then S and then the remaining segments of B.

Modeling the problem

The examples presented in the previous section show representative but simple cases of the problems we can run into. In order to be able to handle cases of arbitrary complexity it is useful to create a model of the problem. We can then use this model to come up with algorithms that provide generic solutions.

The model we are going to present uses graphs. In these overlap graphs vertices represent segments from the file we are going to save to, that are also present in the buffer we are going to save. Edges between vertices denote segment overlap: an edge from vertex P to vertex Q denotes that segment P in the target file overlaps with segment Q in the buffer. Every edge carries with it the size of the overlapping range.

Here are some buffers and their respective overlap graphs (P* denotes a self-loop on P):

    P              Q                   R
F:  |0-9      |....|15-24    |.........|35-42  |

                             P
B1: |........................|F:0-9    |.......|   P

              P              Q
B2: |.........|F:0-9     |...|F:15-24  |.......|   Q --> P

         P                Q
B3: |....|F:0-9    |......|F:20-29  |..........|   P* Q*

    R        P                    Q
B4: |F:35-42|F:0-9    |...........|F:20-29  |..|   R <-- P* <-- Q
                                                    \___________^

Solving the problem

Using the overlap graph we can now answer the question: in what order should the vertices (segments) be written to the file so that the correct data is written?

If the graph has no cycles then we can process the vertices in topological order. Topological order guarantees that each vertex is processed only when its dependencies have already been processed and written to file, and thus that no unwritten file segment in the buffer overlaps with the destination range of that vertex.

As a special case, we can process a vertex that has a self-loop but no other incoming edge. This is achieved by first writing to the file the non-overlapping part of the vertex and then the rest.

In order to handle graphs with cycles (except self-loops) we must find a way to break the cycles. This can be achieved by removing the correct edges. An edge from P to Q can be removed by replacing Q in the buffer by the segments Q1 and M as described previously. M contains an in memory copy of data of the overlapping range. Q1 is the part of Q that remains. This technique eliminates the overlap (because the overlapping part is now stored in memory) and removes the edge. By choosing the edges to remove wisely, we can minimize the amount of extra memory we are going to need.

Once we have an acyclic graph (self-loops allowed) we can process it as described previously.

Let's manually apply this method to an example:

    P              Q                   R
F:  |0-9      |....|15-24    |.........|35-42  |

    R        P                    Q
B4: |F:35-42|F:0-9    |...........|F:20-29  |..|   R <-- P* <-- Q
                                                    \___________^

We first have to make the graph acyclic. We have a few choices for which edge to break:

To minimize the required extra memory we break the Q->P edge by storing the last three bytes of P in memory as segment M, so we get:

    P1             Q                   R
F:  |0-6   |.......|15-24    |.........|35-42  |

    R       P1     M              Q
B4: |F:35-42|F:0-6 |..|...........|F:20-29  |..|   Q <-- R <-- P1

Note that coincidentally we also resolved the self-loop of P. Now that we have an acyclic graph we can process the buffer segments in topological order, thus we first store P1 in range [8-14] of the file, next is R at [0-7] and finally Q at range [30-40] followed by all other portions of the buffer.

Implementation Details

libbls implements the solution described above when saving a file in place. The high level logic for the algorithm resides in bless_buffer_save. The implementation of the algorithm within that function can be conceptually split into three steps. Some implementation details for each step are given below, along with links to the functions that provide the relevant functionality.

Step 1: Creating the overlap graph

Relevant functions:

buffer_file.c: create_overlap_graph
overlap_graph.c: overlap_graph_add_segment

To create the overlap graph we have to scan the buffer segment collection for segments that belong to the file we want to save to. Each new segment that is found must be checked against all previous found file segments for overlap. This is an O(n + k*f(k)) algorithm where n is the number of segments in the segment collection and k the number of segments belonging to the file. If we check for overlap naively then f(k) = O(k) and therefore the algorithm is O(n + k*k). We can use an interval tree so that f(k) = O(logk) and the algorithm becomes O(n + k*logk). As we don't expect k to grow too large, the implementation currently uses the naive approach for simplicity.

Step 2: Making the graph acyclic

Relevant functions:

overlap_graph.c: overlap_graph_remove_cycles

In the second step the goal is to remove cycles from the graph, while trying to minimize the total weight of the removed edges in order to minimize extra memory usage. However, it turns out that the problem we are trying to solve, minimum feedback arc set, is NP-Hard! Don't despair though; we can get reasonable results with only little extra work.

The undirected counterpart of the minimum feedback arc set problem is the minimum (or maximum) spanning tree (MST) problem, which can be efficiently solved by using various methods. We can use these methods to give a solution (but not the best) for our problem, too. We can do that because MST algorithms remove all undirected cycles and therefore all directed ones, too. The caveat is that they remove more that we need them to; undirected cycles that are not directed.

MST algorithms work by processing edges in non-decreasing or non-increasing order of weight and adding them to the already formed tree (Prim) or trees (Kruskal) if they aren't already connected to them by an undirected path. Because our goal is to avoid directed cycles we can be more lax with the rules of adding an edge. A simple but useful observation is that if the destination node of an edge has no outgoing edges then a directed cycle cannot be formed, regardless of whether there is a pre-existing undirected path between the two nodes. Similarly, if the source node of an edge has no incoming edges a directed cycle cannot be formed either, regardless of the pre-existence of an undirected path between the source and destination nodes.

We can use the previous observations to decrease the number of edges that are incorrectly deleted by the MST algorithms when acting on directed graphs. Although we don't completely eliminate the erroneous deletions, these rules give reasonable results while keeping the complexity the same as in the undirected case.

Of the MST algorithms, Kruskal's algorithm can be used unchanged in directed graphs because it works directly on edges and doesn't care about their direction. It also works on disconnected graphs, since it finds a spanning forest. For these reasons, it was selected as the basis for the algorithm used by libbls.

Step 3: Removing edges and saving to file

Relevant functions:

overlap_graph.c: overlap_graph_get_removed_edges
buffer_file.c: break_edge
buffer_util.c: segcol_store_in_memory, segcol_store_in_file
overlap_graph.c: overlap_graph_get_vertices_topo

In the previous step we calculated which edges would need to be removed to create an acyclic graph. In this step we first perform the actual removal of these edges by storing the associated data in memory or a temporary file as described previously.

Removing the edges has the unfortunate side effect of further altering the graph as segments are removed, added or changed in the segment collection. This means that after removing the edges we must re-calculate the overlap graph for the segment collection. There may be some way to avoid re-calculating the whole graph, making only local changes to the existing graph as vertices are altered but we will use the simple (although more costly) way for now.

We then use this new overlap graph, which is guaranteed not to have any cycles, to get the vertices in topological order and save them to the target file.

mda-rs: Custom Mail Delivery Agents in Rust

I had been stubbornly using procmail for two decades to filter my email, before deciding that it's finally time for a change. Procmail served me well through the years, but, little by little, the annoyances started adding up. The fact that procmail is unmaintained since 2001 inspired less and less confidence as years passed, and the requirement in many cases for external processing before any useful matching could be performed (e.g., dealing with MIME content transfer encoded data, or non us-ascii character sets) was a constant pain point. As the complexity of my rules grew, I even had to resort to an external program (a custom C++ program in my case) to perform some of the mailbox selection logic, using the AUTOFOLDER feature of procmail.

At that point, and given all the functionality that I had to implement myself, I seriously started questioning the value procmail was providing to me and started looking for alternatives. I evaluated fdm and maildrop, finding in both a lot that I liked; first and foremost a less arcane filtering language. In the end, I found maildrop to be a closer match to my preferences and requirements, and I especially appreciated the automatic MIME content decoding.

I briefly considered switching to maildrop, but a few experiments on my set of filtering rules indicated that maildrop's performance was significantly worse compared to procmail, even though for procmail I had to call out to a few more external programs to achieve similar functionality. I also gave fdm a go, and it was even slower than maildrop. I receive a lot of emails each day, mostly from various FOSS mailing lists, and the performance penalty adds up. While waiting for a few dozen seconds daily wouldn't have been the end of the world, the thought that my original and much older solution was faster, disturbed me.

Meanwhile, I noticed that the control flow statements in maildrop's filtering language were similar to that of a general purpose programming language, and, in fact, with the integration with regular expressions, they looked a bit like perl. So, I thought, what if instead of using a domain specific language, I could use a general purpose language, augmented by some kind of library to implement the same filtering and delivery functionality. My thoughts then drifted to other things, but the seed of that idea had already been planted in my mind it seems, as a few years later I found myself implementing the code that would become mda-rs.

With mda-rs I wanted to create an experience as close as possible to using an interpreted domain specific language, the approach follow by typical MDAs, while still having the performance and power of a full, compiled language at the fingertips. One aspect of this experience was providing an API that feels like natural fit for the intended purpose. The other aspect was providing a straightforward way to build a custom MDA. For this second aspect, the simplicity of Rust's cargo was one of the reasons I decided to use Rust for this project.

Another decision catering to a similar ease-of-use goal was that the user shouldn't be required to use external programs to transform the data just so they could perform basic matching. To this end, mda-rs, like maildrop, normalizes the email before processing, by decoding and converting all text MIME content parts (including MIME encoded-words in headers) to plain UTF-8.

Of course, I also wanted the resulting custom MDAs to be fast; performance was my main disappointment with other MDAs after all. Performance requires an efficient implementation, but also and an API that encourages performant use. An example of the effect of such a concern on the mda-rs API, is that the API provides access to the header fields by name, so that one can perform targeted per-header-field string searches, which can be much faster than regular expression searches of the whole header.

Finally, an important requirement for all MDAs is that the email delivery is durable, i.e., that no email will be lost in case of a system failure. In other words, at any point in time at least one of the local filesystem and the email server should have a complete and fully accessible copy of the email.

While investigating the best way to provide such durability guarantees, I noticed that all three MDAs mentioned previously are fsyncing the written email file, as expected. However, they are not fsyncing the containing directory, which could potentially lead to the file not appearing on the filesystem 1. The exact guarantees in this scenario seem to be dependent on the filesystem, but, to provide maximum durability, mda-rs fsyncs both the file and the containing directory by default. This does incur a performance penalty, so I have also included an option to use the file-sync-only approach if preferred.

Since mda-rs was written to cater primarily to my personal needs, it currently has some limitations, or unimplemented functionality, if you will. The most prominent one is that it delivers to maildir only, since that's the only mailbox format I use. The second is that there is no built-in way to change the email data (e.g., add a header field) except by filtering through an external tool, since this is another feature I don't use.

Here is a small taste of how a custom MDA would look like with mda-rs:

use std::path::PathBuf;

use mda::{Email, EmailRegex, Result, DeliveryDurability};

fn main() -> Result<()> {
    let root = PathBuf::from("/tmp/my-personal-mail");

    let mut email = Email::from_stdin_filtered(&["/usr/bin/bogofilter", "-ep"])?;
    // Use quicker (but possibly less durable) delivery.
    email.set_delivery_durability(DeliveryDurability::FileSyncOnly);

    let from = email.header_field("From").unwrap_or("");
    let bogosity = email.header_field("X-Bogosity").unwrap_or("");

    if bogosity.contains("Spam, tests=bogofilter") ||
       from.contains("@banneddomain.com") {
        email.deliver_to_maildir(root.join("spam"))?;
        return Ok(());
    }

    let cc = email.header_field("Cc").unwrap_or("");
    let to = email.header_field("To").unwrap_or("");

    if to.contains("myworkemail@example.com") ||
       cc.contains("myworkemail@example.com") {
        if email.body().search("URGENCY RATING: (CRITICAL|URGENT)")? {
            email.deliver_to_maildir(root.join("inbox/myemail/urgent"))?;
        } else {
            email.deliver_to_maildir(root.join("inbox/myemail/normal"))?;
        }
        return Ok(());
    }

    email.deliver_to_maildir(root.join("inbox/unsorted"))?;

    Ok(())
}

and a corresponding minimal Cargo.toml:

[package]
name = "my-mda"
version = "0.1.0"
edition = "2018"

[dependencies]
mda = "0.1"

To provide an idea of the performance gains to expect, I benchmarked a us-ascii only version of my personal mail filtering rules on a sample of 250 of my recently received emails using all the aforementioned MDAs. Of course, the performance results will vary depending on both the rules and the email themselves, but in my experience what is presented below seems to be a typical outcome. I'd be very interested to know how mda-rs worked for others on the performance front.

In the benchmark I included runs both with and without filtering for spam, since such filtering takes up a significant amount of processing and affects the relative results. For mda-rs I included two versions, one that mostly uses per-header-field searches, and a second one that uses regular expressions exclusively, and thus is a bit closer to how the other MDAs work. For each mda-rs version I ran the benchmark both in file sync only mode, which is how the others MDAs work, and full sync (file and directory) mode, which is the default for mda-rs. Note that, for this benchmark, both maildrop and mda-rs performed MIME decoding and translation to UTF-8 (internally and automatically), whereas neither procmail nor fdm did so, and no external program was used to provide such functionality. I verified that the delivery results were the same for all MDAs.

mda-benchmark

mda-rs wins hands down when operating in file sync mode, while at the same time doing more work (normalizing the email) compared to the next faster contestant, procmail. In full sync mode, the extra directory syncs prove to be expensive enough to overshadow other mda-rs performance gains, and procmail takes the lead. Still, even in this case, the performance is acceptable and much better compared to both maildrop and fdm.

If you would like to learn more about how to use mda-rs, you can find detailed documentation at: https://docs.rs/mda

The mda-rs project repository is at: https://github.com/afrantzis/mda-rs

The mda-rs crates.io entry is at: https://crates.io/crates/mda

Enjoy!

1 See https://lwn.net/Articles/457667/

Dynamic programming for fun and profit

Imagine that you decide to make an investment. You have 1720 currency units (CU) at your disposal and the investment options that are available are:

Option#min amountmax amountinterestsign up bonus
150991%0.5
21002991.1%1
33004991.6%2.2
45009991.8%2.5
5100019992%4

You may use multiple investment options, each one possibly multiple times. Each option accepts a minimum and maximum amount of CUs, and provides an interest on that amount plus an additional bonus amount which does not depend on the invested amount. How will you invest your money? Take your time and continue reading when ready...

Let see how you did! If you invested all 1720 CU in option 5 then... you chose poorly. Your profit is going to be 38.4 CU — not bad, but you can do better. If you invested 1020 in option 5 and 700 in option 4, that's a bit better at 39.5 CU, but still not optimal. What about 1020 in option 5, 500 in option 4 and 200 in option 2? That's actually a bit worse at 39.1 CU! I'll spare you the agony and reveal the best solution: 1020 in option 1, 300 in option 3 twice, and 100 in option 2, giving a profit of 40.5 CU. This is about 5.5% better than the straightforward solution of assigning everything to option 5.

As this short exercise has hopefully exhibited, finding the optimal portofolio of options is far from trivial. Is there a way to programmatically find the optimal solution?

The road to dynamic programming

First, let's define what are trying to find. A portofolio is a multiset of ( o,n ) pairs, each pair representing that we are assigning n CU to option o . For this solution to be acceptable, all the pairs need to be valid and the total sum of all pairs must not exceed the total amount of money we want to use. We are trying to find the portofolio which provides the maximum profit.

Brute force

To understand the solution space a bit more let's devise a way to enumerate all the portofolios, which is in fact not trivial.

One way to think about this is that to produce a portofolio we first have to split our total amount in some way, and assign an option to each element in the split. Each such split is formally called a partition of n . The number of partitions for a number n is given by the partition function p ( n ) which becomes spectacular very quickly. For each such partition we also need to factor in all the different acceptable option permutations.

Another way to think about this, which turns out to be more helpful, is with a recursive, constructive view of the problem. Let P n be the set of portfolios for n total CU. From this set, the subset of porfolios that use option o with k CU consists of all portofolios with n k total CU combined with the option ( o,k ) . Expressing this with set notation we have ( is the multiset sum operator, which in our case is used to add items to the multiset):

P n ( o,k )= { Q { ( o,k ) }| Q P n k }

The whole set P n is the union of all P n ( o,k ) sets for all valid combinations of o and k :

O k = { Options that accept k }
P n = 1 k n o O k P n ( o,k )

Using this recursive set description of the solution space we can write a program to enumerate all the solutions for a particular amount and set of options. This description provides a straightforward, albeit rather inefficient, way to solve this problem: by brute force — evaluate all solutions and use the best. In pseudocode this would look like:

solve_brute_force(n):
    solutions = {}

    for i in [1..n]:
        for o in OPTIONS:
            if o accepts i:
                for s in (solve_brute_force(n-i) ∪ {∅}):
                    solutions = solutions ∪ (s ⊎ {(o,k)})

    return solutions

solve_driver(n):
    return argmax(solve_brute_force(n), profit)

Optimal substructure

Let's see if this problem has any interesting properties we can take advantage of to find the optimal solution more efficiently.

If we look at P n ( o,k ) we see that we are combining ( o,k ) with all portofolios for n k , and we ultimately check all these combinations to find the best solution. This is wasteful, however, since the choice of ( o,k ) doesn't affect the profit of any of the members of P n k . To put it more formally, the profit function is an additive multiset function:

Pr of it ( P n ( o,k ))= Pr of it ( Q { ( o,k ) } )= Pr of it ( Q )+ Pr of it (( o,k )) , Q P n k

We can thus get the best total value by using one of the members of the P n ( o,k ) set with the highest profit. Any other choice from that set would yield a lower total profit. We have thus shown that our problem exhibits the optimal substructure property: an optimal solution includes optimal solutions to subproblems.

Armed with this revelation we can now produce a much more efficient formulation:

P n ( o,k )= Q { ( o,k ) } ,Q is any one el ement of ar g max S P n k Pr of it ( S )

This leads us to the following pseudocode:

solve_optimal_substructure(n):
    solutions = {}

    for i in [1..n]:
        for o in OPTIONS:
            if o accepts i:
                solutions = solutions ∪
                            (solve_optimal_substructure(n-i) ⊎
                             {(o,k)})

    return argmax(solutions, profit)

solve_driver(n):
    return solve_optimal_substructure(n)

Overlapping subproblems

Another interesting thing to note is that to solve for n we need to solve for n 1 .. . 1 . Similarly, to solve for n 1 we need to solve for n 2 .. . 1 and so on. We are solving the same problems over and over again! Our problem thus exhibits the property of overlapping subproblems. We can take advantage of this property by storing results in a cache and reusing them instead of recalculating them.

Updating our original pseudocode to use a cache we have:

solve_overlapping_subproblems(n):
    if n in CACHE: return CACHE[n]

    solutions = {}

    for i in [1..n]:
        for o in OPTIONS:
            if o accepts i:
                for s in (solve_overlapping_subproblems(n-i) ∪ {∅}):
                    solutions = solutions ∪ (s ⊎ {(o,k)})

    CACHE[n] = solutions
    return CACHE[n]

solve_driver(n):
    return argmax(solve_overlapping_subproblems(n), profit)

Dynamic programming

If we combine both optimizations, taking advantage of both the optimal substructure property and the overlapping subproblems property, we reach the dynamic programming solution:

solve_dp_recursive(n):
    if n in CACHE: return CACHE[n]
    solutions = {}

    for i in [1..n]:
        for o in OPTIONS:
            if o accepts i:
                solutions = solutions ∪
                            (solve_dp_recursive(n-i) ⊎ {(o,k)})

    CACHE[n] = argmax(solutions, profit)
    return CACHE[n]

solve_driver(n):
    return solve_dp_recursive(n)

Alternatively, we can express this without recursion:

solve_dp_iterative(n):
    solutions = {}

    for i in [1..n]:
        for o in OPTIONS:
            if o accepts i:
                solutions = solutions ∪ (CACHE[n-i] ⊎ {(o,k)})

    return argmax(solutions, profit)

solve_driver(n):
    for i in [1..n]:
        CACHE[i] = solve_dp_iterative(i)
    return CACHE[n]

Scaling down

Instances of this problem with a large n can still be prohibitive to solve even when using the dynamic programming approach I just described. Not all is lost, however, if we are willing to relax our requirement for a guaranteed optimal solution.

One approach to cutting down our solution time, while still producing a possibly good solution, is to scale down our problem. We do this by dividing the start, end and bonus values of all options, and also the target number for which we solve for by a particular scale factor. Another way to view this is that scaling down by scale corresponds to performing the algorithms with a minimum step of value scale instead of value 1.

For example if we scaled down our original problem by 10 we would solve for 172 (= 1720 / 10) and use the following scaled down options:

Option#min amountmax amountinterestsign up bonus
1591%0.05
210291.1%0.1
330491.6%0.22
450991.8%0.25
51001992%0.4

When we find the optimal solution we can scale it back up using the same scale factor.

The expectation is that the optimal solution for the scaled down version of our problem is going to be in the proximity of the optimal solution for the original version, and assuming that the Profit function is generally smooth, the profit is also going to be near the optimal one.

Getting greedy

Another approach to solving this problem faster is to explore whether there is greedy solution for it. Below I will describe an approach that works in many cases, but is not guaranteed to provide the optimal solution.

Taking inspiration from the greedy solution to the fractional knapsack problem, at each step we greedily select to put as many CUs as possible in the option with the best marginal profit. The key observation here is that in our case each option has two different marginal profits. The first one involves investing in a new instance of an option using its minimum acceptable amount. The second one involves investing more in an instance of an option we have already invested in. In the first case the sign up bonus kicks in and increases the marginal profit. In the second case the marginal profit is simply the interest.

For our original scenario we have:

Option#min amountmax amountmin marginal profitadd marginal profit
150992%1%
21002992.1%1.1%
33004992.3333%1.6%
45009992.3%1.8%
5100019992.4%2%

For a total amount of 1720 this method works flawlessly. We first select to add ( o 5 , 1000) to our portofolio for a marginal profit of 2.4%. From the remaining 720 we add ( o 3 , 300) for a marginal profit of 2.333%. From the remaining 420 we again choose ( o 3 , 300) . We now have 120 left, for which we choose ( o 3 , 100) , and the final 20 we add to the ( o 5 , 1000) instance we already have.

Unfortunately, this method doesn't find the optimal solution in all cases. Take for example a total amount of 500. The greedy algorithm chooses ( o 3 , 300) , ( o 2 , 100) , ( o 2 , 100) for a total profit of 11.2. Alas, the optimal solution is simply ( o 4 , 500) for a total profit of 11.4.

Perhaps there is some other greedy approach that provides optimal solutions in all cases. Still, the described greedy approach is useful if we want to quickly get a possibly good solution. We could also try both the greedy approach and the scaling approach to raise our confidence level in the generated solution.

Can we do better?

We have mostly focused on solving our investment problem using a dynamic programming approach. Taking a step back, we may ask if this is indeed the best way to get an optimal solution. Can we approach this problem from a completely different angle to improve performance?

It turns out that the answer is yes! With some creativity our problem can be expressed as an integer linear program (ILP), for which we have algorithms that work quite efficiently in practice. I plan to present this approach in more detail in a future post.

An ounce of action...

If you want to try out the ideas mentioned in this post, I have created a program implementing all the algorithms and variations. You can find it at: https://github.com/afrantzis/invest.

Enjoy!

More posts...