Interactive time series graph of confirmed COVID-19 cases per country
March 9, 2020
I have been looking for more information about the trends of the number of
confirmed COVID-19 cases per country. I could find the numbers for each day,
but I wanted a visual representation of the information in order to readily
compare the trends in various regions.
John Hopkins University (JHU) has set up a great
dashboard
tracking the COVID-19 cases worldwide. It contains a time series graph of the
overall cases, but doesn't provide such graphs for each country.
Thankfully, JHU have made all their data publicly available (for
educational/academic purposes) in their JHU CSSE COVID-19
repository.
Since I couldn't find a visualization that matched my needs I decided
to create my own interactive graph:
Interactive time series graph of confirmed COVID-19 cases per country
The graph is implemented as a web page that uses client-side processing (i.e.,
javascript) to plot the time series for one or more countries. The JHU
data is fetched and processed by the client on each page load, so the graph
will display the latest data without requiring any server-side updates.
For example, this is a plot from that page for France, Italy and Spain as of
2020-03-08:

Given the time offset between the trend lines, it's not always clear how the
trends compare. To be able to compare trends more easily, I added an option to
synchronize the displayed time series so that they start at (approximately) the
same confirmed case count. Here is the same series as above but synchronized
at 200 cases:

This plot indicates that France and Italy are following roughly the same trend,
whereas Spain is faring a bit better.
Before you draw any conclusions about trends that show up after
synchronization, keep in mind two important caveats.
First, since our sampling granularity is one day, and jumps in confirmed cases
between successive days can be large, starting points for each series after
synchronization may in fact be different enough to skew results. For example,
we may want to synchronize at around 200, but if one series jumps from 50 to
300, the starting point for that series is not going to be near 200.
Second, using different synchronization points may provide different views of
the trends. The same France-Italy-Spain plot synchronized at 40 cases gives:

In this plot we can see that France and Spain initially followed a similar
trend, but then France started diverging upwards. This is somewhat apparent in
the original, non-synchronized plot, but only because the France and Spain
trend lines have a small time offset.
I hope people find this interactive graph informative!
Efficient saving of multi-source buffers
December 16, 2019
Preface
Several years ago I was working on libbls, a library implementing
editable buffers that can efficiently hold contents originating from both
memory and file sources. Typical clients of this libraries are programs that
need to edit arbitrary portions of very large files, like hex editors or wave
editors, while minimizing the required memory usage.
One of the more interesting problems I encountered was devising an efficient
way to save buffers to their corresponding files in place, while minimizing the
extra memory (or disk space) used in the process. This post describes how this
problem, which turns out to be NP-hard, can be modeled as a graph, and how we
can provide a reasonable solution by using a variation of a standard graph
algorithm.
Introduction
Buffers in libbls are described in terms of segments that contain data from
either memory or file sources. In order to conserve memory (and sometimes to
even make the editing feasible) the data from file sources is not loaded into
main memory. Instead, the buffer keeps information about the contained files
and ranges and reads the data in vm-page-sized chunks when needed. So, a buffer
could look like:
B: |F1:0-9 |M1:10-18|F1:10-19 |M2:10-19|F2:0-6|
Where F1 and F2 are file sources, M1 and M2 are memory sources and S:B-E
denotes a segment representing bytes from the range [B, E] of source S.
In the simple case, saving a buffer to a file consists of just reading data
from the various sources and writing them out to the target file. Things become
more complicated when the target file happens to be one of the file sources
used in the buffer.
This post aims to illustrate the issues that can arise in such cases and
propose an elegant way to resolve them.
Illustrating the problem
Saving the buffer to a file typically involves reading each segment in order
and writing it out to the file. However, if the segment source is the same as
the file that we are writing to, and depending on the ordering of segments, we
may face a serious problem.
To illustrate the problem, consider a buffer B containing segment [0-9] of file
F at range [10-19] of the buffer, and assume that we want to write the buffer
back to file F:
S
F: |0-9 |.........|
S
B: |.........|F:0-9 |
^ ^ ^
0 10 20
Writing the buffer segments in order leads to a situation where we first write
out buffer range [0-9] to file F, overwriting the contents of the file at that
range. So, when moving on to the next segment, trying to write out the range
[10-19] of the buffer, the data we are going to read from the file (and
therefore write) is not the original data of [F:0-9], but the data belonging to
the previous segment of the buffer we just wrote at positions [0-9].
The solution in this case is to first write out range [10-19] of the buffer to
the file, so that the data we use is the original file data, and only then
write out any other segments.
The problem can become even more complicated when we have more segments from
the target file. A particularly interesting case arises when a segment from the
buffer which belongs to the target file happens to overlap with another segment
of F which is also present in the buffer:
S T
F: |0-9 |.........|20-29 |.........|
T S
B: |.......|F:20-29 |...........|F:0-9 |
^ ^ ^ ^
0 8 18 30
In this case segment S in F overlaps with segment T in B. We can simply try to
adapt the strategy used in the previous case and first write out the two target
file segments. This works but only if we are extra careful. In this case, if
we first write segment T then when we try to read the data of segment S we will
read wrong data (file range 8-9 will contain data from segment T). If we do it
the other way around everything works wonderfully.
Taking the last example one step further, consider what happens if we have
cyclic overlaps:
S T
F: |0-9 |.........|20-29 |....|
T S
B: |.......|F:20-29 |......|F:0-9 |
^ ^ ^ ^ ^
0 8 18 25 30
Now segment S in F overlaps with segment T in B, and segment T in F overlaps
with segment S in B. In this case there is no way to order the writes of
segments S and T so that we end up with the correct data.
A straightforward but wasteful way of tackling this problem is to save the
whole buffer to a temporary file and then move the file to its final location.
This works, and in some cases is preferred since it has the benefit of
atomicity, but requires extra space for the temporary file. When buffers reach
many GiB in size this method may prove to be unfeasible.
A more efficient method is to try to eliminate the cyclic overlaps by removing
at least one of the overlaps involved. In the previous case we can replace
segment T in B with two segments M and T1 so that no overlap occurs. Segment M
will contain a part of T's data but will actually store them in memory (or a
temporary file if necessary) and the T1 segment will just be a T minus the
range stored in M. This approach still requires some amount of extra memory,
but in most cases this amount is much smaller than the size of the whole
buffer. Using this scheme we have:
S T1
F: |0-9 |...........|22-29 |....|
M T1 S
B: |.......|.|F:22-29|......|F:0-9 |
^ ^ ^ ^ ^
0 8 18 25 30
We have managed to eliminate the first overlap and now we can safely save the
buffer by first writing T1 then S and then the remaining segments of B.
Modeling the problem
The examples presented in the previous section show representative but simple
cases of the problems we can run into. In order to be able to handle cases of
arbitrary complexity it is useful to create a model of the problem. We can then
use this model to come up with algorithms that provide generic solutions.
The model we are going to present uses graphs. In these overlap graphs
vertices represent segments from the file we are going to save to, that are
also present in the buffer we are going to save. Edges between vertices denote
segment overlap: an edge from vertex P to vertex Q denotes that segment P in
the target file overlaps with segment Q in the buffer. Every edge carries with
it the size of the overlapping range.
Here are some buffers and their respective overlap graphs (P* denotes a
self-loop on P):
P Q R
F: |0-9 |....|15-24 |.........|35-42 |
P
B1: |........................|F:0-9 |.......| P
P Q
B2: |.........|F:0-9 |...|F:15-24 |.......| Q --> P
P Q
B3: |....|F:0-9 |......|F:20-29 |..........| P* Q*
R P Q
B4: |F:35-42|F:0-9 |...........|F:20-29 |..| R <-- P* <-- Q
\___________^
Solving the problem
Using the overlap graph we can now answer the question: in what order should
the vertices (segments) be written to the file so that the correct data is
written?
If the graph has no cycles then we can process the vertices in topological
order. Topological order guarantees that each vertex is processed only when its
dependencies have already been processed and written to file, and thus that no
unwritten file segment in the buffer overlaps with the destination range of
that vertex.
As a special case, we can process a vertex that has a self-loop but no other
incoming edge. This is achieved by first writing to the file the
non-overlapping part of the vertex and then the rest.
In order to handle graphs with cycles (except self-loops) we must find a way to
break the cycles. This can be achieved by removing the correct edges. An edge
from P to Q can be removed by replacing Q in the buffer by the segments Q1 and
M as described previously. M contains an in memory copy of
data of the overlapping range. Q1 is the part of Q that remains. This technique
eliminates the overlap (because the overlapping part is now stored in memory)
and removes the edge. By choosing the edges to remove wisely, we can minimize
the amount of extra memory we are going to need.
Once we have an acyclic graph (self-loops allowed) we can process it as
described previously.
Let's manually apply this method to an example:
P Q R
F: |0-9 |....|15-24 |.........|35-42 |
R P Q
B4: |F:35-42|F:0-9 |...........|F:20-29 |..| R <-- P* <-- Q
\___________^
We first have to make the graph acyclic. We have a few choices for which edge
to break:
- Q->P, a 3 byte overlap
- P->R, a 7 byte overlap
- R->Q, a 5 byte overlap
To minimize the required extra memory we break the Q->P edge by storing the
last three bytes of P in memory as segment M, so we get:
P1 Q R
F: |0-6 |.......|15-24 |.........|35-42 |
R P1 M Q
B4: |F:35-42|F:0-6 |..|...........|F:20-29 |..| Q <-- R <-- P1
Note that coincidentally we also resolved the self-loop of P. Now that we have
an acyclic graph we can process the buffer segments in topological order, thus
we first store P1 in range [8-14] of the file, next is R at [0-7] and finally Q
at range [30-40] followed by all other portions of the buffer.
Implementation Details
libbls implements the solution described above when saving a file in place. The
high level logic for the algorithm resides in
bless_buffer_save. The implementation of the algorithm
within that function can be conceptually split into three steps. Some
implementation details for each step are given below, along with links to the
functions that provide the relevant functionality.
Step 1: Creating the overlap graph
Relevant functions:
buffer_file.c: create_overlap_graph
overlap_graph.c: overlap_graph_add_segment
To create the overlap graph we have to scan the buffer segment collection for
segments that belong to the file we want to save to. Each new segment that is
found must be checked against all previous found file segments for overlap.
This is an O(n + k*f(k)) algorithm where n is the number of segments in the
segment collection and k the number of segments belonging to the file. If we
check for overlap naively then f(k) = O(k) and therefore the algorithm is
O(n + k*k). We can use an interval tree so that f(k) = O(logk) and the
algorithm becomes O(n + k*logk). As we don't expect k to grow too large,
the implementation currently uses the naive approach for simplicity.
Step 2: Making the graph acyclic
Relevant functions:
overlap_graph.c: overlap_graph_remove_cycles
In the second step the goal is to remove cycles from the graph, while trying to
minimize the total weight of the removed edges in order to minimize extra
memory usage. However, it turns out that the problem we are trying to solve,
minimum feedback arc set, is NP-Hard! Don't despair though; we can get
reasonable results with only little extra work.
The undirected counterpart of the minimum feedback arc set problem is the
minimum (or maximum) spanning tree (MST) problem, which can be efficiently
solved by using various methods. We can use these methods to give a solution
(but not the best) for our problem, too. We can do that because MST algorithms
remove all undirected cycles and therefore all directed ones, too. The caveat
is that they remove more that we need them to; undirected cycles that are not
directed.
MST algorithms work by processing edges in non-decreasing or non-increasing
order of weight and adding them to the already formed tree (Prim) or trees
(Kruskal) if they aren't already connected to them by an undirected path.
Because our goal is to avoid directed cycles we can be more lax with the rules
of adding an edge. A simple but useful observation is that if the destination
node of an edge has no outgoing edges then a directed cycle cannot be formed,
regardless of whether there is a pre-existing undirected path between the two
nodes. Similarly, if the source node of an edge has no incoming edges a
directed cycle cannot be formed either, regardless of the pre-existence of an
undirected path between the source and destination nodes.
We can use the previous observations to decrease the number of edges that are
incorrectly deleted by the MST algorithms when acting on directed graphs.
Although we don't completely eliminate the erroneous deletions, these rules
give reasonable results while keeping the complexity the same as in the
undirected case.
Of the MST algorithms, Kruskal's algorithm can be used unchanged in directed
graphs because it works directly on edges and doesn't care about their
direction. It also works on disconnected graphs, since it finds a spanning
forest. For these reasons, it was selected as the basis for the algorithm used
by libbls.
Step 3: Removing edges and saving to file
Relevant functions:
overlap_graph.c: overlap_graph_get_removed_edges
buffer_file.c: break_edge
buffer_util.c: segcol_store_in_memory,
segcol_store_in_file
overlap_graph.c: overlap_graph_get_vertices_topo
In the previous step we calculated which edges would need to be removed to
create an acyclic graph. In this step we first perform the actual removal of
these edges by storing the associated data in memory or a temporary file as
described previously.
Removing the edges has the unfortunate side effect of further altering the
graph as segments are removed, added or changed in the segment collection. This
means that after removing the edges we must re-calculate the overlap graph for
the segment collection. There may be some way to avoid re-calculating the
whole graph, making only local changes to the existing graph as vertices are
altered but we will use the simple (although more costly) way for now.
We then use this new overlap graph, which is guaranteed not to have any cycles,
to get the vertices in topological order and save them to the target file.
mda-rs: Custom Mail Delivery Agents in Rust
October 7, 2019
I had been stubbornly using procmail for two decades to filter my email, before
deciding that it's finally time for a change. Procmail served me well through
the years, but, little by little, the annoyances started adding up. The fact
that procmail is unmaintained since 2001 inspired less and less confidence as
years passed, and the requirement in many cases for external processing before
any useful matching could be performed (e.g., dealing with MIME content
transfer encoded data, or non us-ascii character sets) was a constant pain
point. As the complexity of my rules grew, I even had to resort to an external
program (a custom C++ program in my case) to perform some of the mailbox
selection logic, using the AUTOFOLDER feature of procmail.
At that point, and given all the functionality that I had to implement myself,
I seriously started questioning the value procmail was providing to me and
started looking for alternatives. I evaluated fdm and maildrop, finding in both
a lot that I liked; first and foremost a less arcane filtering language. In the
end, I found maildrop to be a closer match to my preferences and requirements,
and I especially appreciated the automatic MIME content decoding.
I briefly considered switching to maildrop, but a few experiments on my set of
filtering rules indicated that maildrop's performance was significantly worse
compared to procmail, even though for procmail I had to call out to a few more
external programs to achieve similar functionality. I also gave fdm a go, and
it was even slower than maildrop. I receive a lot of emails each day, mostly
from various FOSS mailing lists, and the performance penalty adds up. While
waiting for a few dozen seconds daily wouldn't have been the end of the world,
the thought that my original and much older solution was faster, disturbed me.
Meanwhile, I noticed that the control flow statements in maildrop's filtering
language were similar to that of a general purpose programming language, and,
in fact, with the integration with regular expressions, they looked a bit like
perl. So, I thought, what if instead of using a domain specific language, I
could use a general purpose language, augmented by some kind of library to
implement the same filtering and delivery functionality. My thoughts then
drifted to other things, but the seed of that idea had already been planted in
my mind it seems, as a few years later I found myself implementing the code
that would become mda-rs.
With mda-rs I wanted to create an experience as close as possible to using an
interpreted domain specific language, the approach follow by typical MDAs,
while still having the performance and power of a full, compiled language at
the fingertips. One aspect of this experience was providing an API that feels
like natural fit for the intended purpose. The other aspect was providing a
straightforward way to build a custom MDA. For this second aspect, the
simplicity of Rust's cargo was one of the reasons I decided to use Rust for
this project.
Another decision catering to a similar ease-of-use goal was that the user
shouldn't be required to use external programs to transform the data just so
they could perform basic matching. To this end, mda-rs, like maildrop,
normalizes the email before processing, by decoding and converting all text
MIME content parts (including MIME encoded-words in headers) to plain UTF-8.
Of course, I also wanted the resulting custom MDAs to be fast; performance was
my main disappointment with other MDAs after all. Performance requires an
efficient implementation, but also and an API that encourages performant use.
An example of the effect of such a concern on the mda-rs API, is that the API
provides access to the header fields by name, so that one can perform targeted
per-header-field string searches, which can be much faster than regular
expression searches of the whole header.
Finally, an important requirement for all MDAs is that the email delivery is
durable, i.e., that no email will be lost in case of a system failure. In other
words, at any point in time at least one of the local filesystem and the email
server should have a complete and fully accessible copy of the email.
While investigating the best way to provide such durability guarantees, I
noticed that all three MDAs mentioned previously are fsyncing the written email
file, as expected. However, they are not fsyncing the containing directory,
which could potentially lead to the file not appearing on the filesystem .
The exact guarantees in this scenario seem to be dependent on the filesystem,
but, to provide maximum durability, mda-rs fsyncs both the file and the
containing directory by default. This does incur a performance penalty, so I
have also included an option to use the file-sync-only approach if preferred.
Since mda-rs was written to cater primarily to my personal needs, it currently
has some limitations, or unimplemented functionality, if you will. The most
prominent one is that it delivers to maildir only, since that's the only
mailbox format I use. The second is that there is no built-in way to change the
email data (e.g., add a header field) except by filtering through an external
tool, since this is another feature I don't use.
Here is a small taste of how a custom MDA would look like with mda-rs:
use std::path::PathBuf;
use mda::{Email, EmailRegex, Result, DeliveryDurability};
fn main() -> Result<()> {
let root = PathBuf::from("/tmp/my-personal-mail");
let mut email = Email::from_stdin_filtered(&["/usr/bin/bogofilter", "-ep"])?;
// Use quicker (but possibly less durable) delivery.
email.set_delivery_durability(DeliveryDurability::FileSyncOnly);
let from = email.header_field("From").unwrap_or("");
let bogosity = email.header_field("X-Bogosity").unwrap_or("");
if bogosity.contains("Spam, tests=bogofilter") ||
from.contains("@banneddomain.com") {
email.deliver_to_maildir(root.join("spam"))?;
return Ok(());
}
let cc = email.header_field("Cc").unwrap_or("");
let to = email.header_field("To").unwrap_or("");
if to.contains("myworkemail@example.com") ||
cc.contains("myworkemail@example.com") {
if email.body().search("URGENCY RATING: (CRITICAL|URGENT)")? {
email.deliver_to_maildir(root.join("inbox/myemail/urgent"))?;
} else {
email.deliver_to_maildir(root.join("inbox/myemail/normal"))?;
}
return Ok(());
}
email.deliver_to_maildir(root.join("inbox/unsorted"))?;
Ok(())
}
and a corresponding minimal Cargo.toml:
[package]
name = "my-mda"
version = "0.1.0"
edition = "2018"
[dependencies]
mda = "0.1"
To provide an idea of the performance gains to expect, I benchmarked a us-ascii
only version of my personal mail filtering rules on a sample of 250 of my
recently received emails using all the aforementioned MDAs. Of course, the
performance results will vary depending on both the rules and the email
themselves, but in my experience what is presented below seems to be a typical
outcome. I'd be very interested to know how mda-rs worked for others on the
performance front.
In the benchmark I included runs both with and without filtering for spam,
since such filtering takes up a significant amount of processing and affects
the relative results. For mda-rs I included two versions, one that mostly uses
per-header-field searches, and a second one that uses regular expressions
exclusively, and thus is a bit closer to how the other MDAs work. For each
mda-rs version I ran the benchmark both in file sync only mode, which is how
the others MDAs work, and full sync (file and directory) mode, which is the
default for mda-rs. Note that, for this benchmark, both maildrop and mda-rs
performed MIME decoding and translation to UTF-8 (internally and
automatically), whereas neither procmail nor fdm did so, and no external
program was used to provide such functionality. I verified that the delivery
results were the same for all MDAs.

mda-rs wins hands down when operating in file sync mode, while at the same time
doing more work (normalizing the email) compared to the next faster contestant,
procmail. In full sync mode, the extra directory syncs prove to be expensive
enough to overshadow other mda-rs performance gains, and procmail takes the
lead. Still, even in this case, the performance is acceptable and much better
compared to both maildrop and fdm.
If you would like to learn more about how to use mda-rs, you can find detailed
documentation at:
https://docs.rs/mda
The mda-rs project repository is at:
https://github.com/afrantzis/mda-rs
The mda-rs crates.io entry is at:
https://crates.io/crates/mda
Enjoy!
See https://lwn.net/Articles/457667/