Allo-Media

An event driven architecture — part 3

2022-09-26T00:00:00+00:00

In this new post, let’s talk about the actual implementation of core principles the event driven architecture.

Topology

The topology describes how the bus is implemented on the message broker.

As message broker, we chose RabbitMQ for its reliability record and ease of use.

In RabbitMQ, the topology is set up by instantiating exchanges and queues.

Exchanges are kind of routers, and queues are bound — using subscriptions to particular routing keys — to them by client applications to store their messages waiting for processing. It is a good practice to consider the queues as private to the logical service (but are shared by the workers of that logical service). We use the message type name as routing key.

The topology is made of three exchanges:

a reliable events exchange with:
- persistent consumer queues (survive broker and service restarts)
- persistent messages (survive broker and service restarts)
- message processing acknowledgment by clients
- message confirmation by broker (for early detection of network transient failures)
- topic routing (to allow wildcard monitoring)
a reliable commands exchange with:
- persistent consumer queues (survive broker and service restarts)
- persistent messages (survive broker and service restarts)
- message processing acknowledgment by clients
- message confirmation by broker (for early detection of network transient failures)
- topic routing
a logs exchange with:
- no persistence (to avoid over-flooding the broker memory in case nobody is consuming the logs)
- no message acknowledgment or confirmation (for speed)
dead letter routing: when a message is permanently refused, it is automatically routed to the dead-letter exchange.

The Result of a Command is sent to the commands exchange too.

Instances of a same logical service are called workers of that service, and they share the same queue. The broker guarantees that a message is processed by one and only one worker of the pool attached to the queue.

The message processing acknowledgment by clients is a very useful mechanism to ensure no data is lost — that is, a message is guaranteed to be processed at least once — and to allow efficient load balancing. Indeed, the message broker won’t remove a message from the queue until the worker who took it for processing tells it that it has finished its job. During this time, the message is reserved. If the worker who took the message goes down before acknowledging, the message becomes available for any other worker, or the same worker when it comes back. Moreover, the broker knows at any given time which workers are busy and which ones are idle, so it can better share the load between them.

Note that this is the base topology. As the topology is created by the services themselves instead of a central configuration, they can extend it locally (i.e. on their side) for their own needs. That also means they must all agree on the base topology exposed above to join the bus.

Behaviors

The topology describes how the broker routes the messages and what guarantees it must provide.

The EDA also requires that the services implement some basic rules to ensure reliability and performance:

a message must be acknowledged once it is completely processed and never earlier;
the service must keep sent messages in cache until they are acknowledged by the broker, and resend them otherwise;
the service should automatically reconnect to the broker if it is disconnected, without losing its cache;
in case of unexpected failure when processing a message, the service should requeue it once before permanently rejecting it;
the service should catch unrecoverable errors (e.g. illegal messages, inconsistent data…), log them and acknowledge the faulty message so that it is not requeued and not routed to the dead-letters.

Framework and tools

To avoid code duplication, we developed mini-frameworks (in Python, Elixir and Rust) that implement this base topology and all the required behaviors of the services. We’ll talk about them in the next post: Framework and tools!

An event driven architecture — part 2

2022-02-02T00:00:00+00:00

In the previous post of this series, we explained why we ditched our old architecture based on synchronous REST services for a completely asynchronous event-driven architecture.

Today, we address the core design principles that were crucial in the success of this enterprise.

Business Services and Data Processing Services

We make a distinction between Business Services and Data Processing Services (aka utility services) to cleanly separate business logic from data processing complexity.

Data Processing Services

Data Processing Services are expected to be pure, stateless services that provide some kind of algorithmic data processing (computations, transformations…). Moreover, they are also context free: they should not depend on business rules, assumptions or external data sources. All they need to do their processing must be in the message they receive. They should not have to query a tier to get more data. Data Processing services are kind of universal libraries and can even be provided by tiers.

Examples of Data-Processing services:

A speech to text service has many different applications. All it needs as inputs are audio and a language reference. It doesn’t need to persist any data.
An image thumbnailing service. It only takes an image and target dimensions as inputs. It has no side effects and may be used in many different businesses.

Business Services

Business Services implement the customers’ workflows and only focus on business rules and requirements to orchestrate and implement the value addition upon our customers’ audio and data. They make use of the Data Processing services as a library for that. They are very specific to us: you would never want to externalize your business services.

Business Services persist the data they produce and are their unique trusted source of truth.

Business Services build and maintain their own customer configuration from events on the bus.

Examples of Business Services:

At Allo-Media, we have a business service to tag incoming calls. It listens for call transcripts and publishes qualification tags. It knows about our customer needs and tag the calls accordingly. It persists the tags and is the unique source of trust for them.
A shopping cart service for an online shop. For each online user, it maintains the state of their shopping cart by listening to UI events like ItemSelected, ItemRemoved or stock events like ItemStockLeft…

All services must be idempotent, that is, if they receive twice the exactly same message, they must behave identically and produce the same outputs.

Events, Commands and Results

In the same way we have two different kinds of services, we have two different kinds of messages: Commands (and their results) and Events

Events

Events are business messages published on the bus by business services and telling the world what happened.

A business service owns the type of Events it emits. It knows nothing about the services that will process them. It subscribes to the types of Events it needs but knows nothing about their origins.

An Event type defines the meaning of the events of that type and their data schema. They must be documented.

The type of an actual event message is given by its name (aka. routing key because it is used by the subscription routing). The event type name must be in the form SubjectPastParticiple. For example, ConversationStarted, CustomerCreated, ShoppingCartValidated… If you’re not able to immediately give a name to your event type, it means it is not well defined, or that it is not an event. Maybe, you need to refine your service or split it, as you may not have analyzed your value chain deeply enough?

Commands and results

Commands are utility messages consumed by Data Processing Services. Imagine an order you pass to a provider. You don’t know who will complete it, you don’t know how and when either, but you’ll get what you want in your letter box sometime later.

A Data Processing Service owns the types of the commands it consumes and their results. A command is always addressed to the logical service that owns it.

A Command type defines the meaning of the command, its data schema and its result data schema. It must be documented.

The type of an actual command message is given by its name. It is in the form VerbObject. For example: AnnotateText, TranscribeAudio… As commands are addressed to a particular logical service, the routing key of a command is in the form logical_service_name.commandname. For example: asr.TranscribeAudio.

The command contains the return “address” to which the result is to be sent and a reference set by the sender that is returned as-is, along with the command outcome. That reference is called the correlation identifier and it is very important for the sender: as all communications are asynchronous, the service requesting the command needs a way to reconcile the received result with the initial request it made.

A Result is a message associated and specific to each command and that contains the result of the process — that can be the successful outcome or an error — and the correlation identifier. Result messages can’t exist without a previous command.

Error results are expected and documented: they are “normal” errors, not bug reports. Bug exceptions must not return an error result. In case of unexpected error, the service will requeue the input command to retry it once, and if a second try raises an unexpected error again, the message is refused and goes into the dead letter queue for investigation. The exceptions are always logged.

Logs

We can also have logging messages to easily collect application logs.

All the messages that “cascade” from the same source event, share a common identifier, called the conversation identifier, which has the following properties:

it is unique in time;
it is created by an Event (never by Commands) that is published for reasons external to the bus and not as a reaction to other Events; we call that Event the initial event.
Any message (Event, Command or Result) created as a reaction to another message M, takes and repeats the conversation ID of M as is.

All consequent messages of a given initial event share the same conversation id, and no other event does. That way, we can easily trace and debug the actual pipeline of each incoming call for example.

Finally, the message schemas must be forward compatible:

a new version of a Message schema for an application can add fields but must not remove or redefine existing ones;
the implementation of a Message decoder must ignore unknown fields without crashing.

The detailed documentation of the actual messages must be kept up to date in an easily reachable place by the developers.

In the next post in this series, we’ll see how we implemented those principles and behaviors in the actual architecture.

An event driven architecture

2020-02-17T11:53:00+00:00

At Allo-Media, like many other businesses, our value chain looks like a pipeline: we collect conversations (mainly through phone) sent to us by our customers, we transcribe them, we tag the transcripts with named entities, we anonymize both the transcript and the audio, then we qualify the content with semantic tags, and finally we index them and provide a UI and API to consult, search, analyze the conversations. All those steps are completed automatically by NLP and AI algorithms.

Such pipelines are well suited for service based architectures. If you need to add a new feature, you introduce a new service into the pipeline.

Our first take at it was based on REST services.

Unfortunately, as you can see on the schema above, that approach had many drawbacks:

it introduces strong coupling between components, as almost each service has to know about the other related services, their addresses, their purposes, their APIs…;
load balancing requires ad hoc solutions (like for the Transcription Pool Manager with Celery);
high availability is tricky because of the synchronous communication: if the requested service is down, the caller has to implement complex “retry later” strategies or give up! And so on for each service.
upgrading or adding new services is a lot of work as it impacts other services and requires careful coordinated releases. Plus, you have to provide them with IP and DNS addresses.

So one year later, as our activity grew and development accelerated, we quickly realized that we needed:

maximum service decoupling;
easy distribution;
no-brainer load balancing;
one to one, one to many, many to one, many to many asynchronous communications;
high availability: hot restart of services, transparent addition or removal of service instances (workers), resilience to (reasonable) downtime of some services;
support for heavy payloads (megabytes of mp3 audio);
no data loss, whatever happens.

Enter the event driven architecture

The best way to achieve those goals is to free your mind from the classical pipeline point of view and instead see the value chain as an ecosystem of business services, each focused on providing a specific value and reacting to events (inputs) and producing new events (outputs). This new metaphor has not only technical benefits, but also business and organizational ones. By reasoning in terms of business units of your value chain, its easier to identify the people involved, the business experts who are the references for the job, the exact value added by the service, etc…

Here is the schema of our new architecture:

In this new architecture, events are precisely defined messages that streams on a message bus, and each logical service (implementing one such business service as explained above) subscribes to the events that are relevant to it, without needing any knowledge about what produced them and how. They also push their own events on the bus, without caring about what consumes them.

In that way, we completely decouple the services between each other and the message broker running the message bus provides us with load balancing, distribution and high availability for free.

Now, the messages are the API, the only business and technical reference.

After much thinking and experiments, we came with core design principles that are very important for the success of such an event driven architecture and after 4 months of production use, we are very glad we complied with them from the start. But that’s the subject of another blog post coming soon. Stay tuned!

ElasticSearch Percolator Use Case for Document Classification

2019-12-05T11:00:00+00:00

Currently at Allo-Media, we use Elasticsearch in its general workflow which is to create an index and store documents holding our phone call audio transcripts metadata, and then allowing to search through these documents given some business criteria like: “Give me all phone calls from client Acme, where the customer speaks about the French strike”.

The percolator feature from Elasticsearch allows to make a reverse search. We store search queries as documents in its own index, and then we can percolate new call documents and retrieve what search queries match. One use case to use the percolator is document classification.

For example, say that we want to tag with Check sent all documents mentioning that the user has already sent a bank check. We would have the following search query:

("I've sent" | "I've already sent") ("check")

So first, we need to create an index to store the search queries with the following mapping:

PUT /search-perco
{
  "mappings": {
    "_doc": {
      "properties": {
        "tag_uuid": {
          "type": "text"
        },
        "tag_name": {
          "type": "text"
        },
        "content": {
          "type": "text"
        },
        "query": {
          "type": "percolator"
        }
      }
    }
  }
}

The tag_* fields are used for document classification
The query field of type percolator is used to index the search query documents, storing a query DSL in JSON
The content field is used to preprocess the percolating documents.

Once the index is created, we can now store our search query documents, like the following one:

PUT /search-perco/_doc/1?refresh
{
  "query": {
      "bool": {
          "must": [
              {
                  "simple_query_string": {
                      "query": "("I've sent" | "I've already sent") ("check")",
                      "fields": [
                          "content"
                      ],
                      "default_operator": "and"
                  }
              }
          ]
      }
  },
  "tag_uuid": "2f86ad85-4c09-4ef3-bb6e-100d129018e9",
  "tag_name": "Check sent",
}

And if we search through this index, we will retrieve our newly added search query document:

GET search-perco/_search
{
  "query": {"match_all": {}}
}

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "search-perco",
        "_type": "_doc",
        "_id": "1",
        "_score": 1,
        "_source": {
          "query": ...
          "tag_uuid": "2f86ad85-4c09-4ef3-bb6e-100d129018e9",
          "tag_name": "Check sent",
        }
      }
    ]
  }
}

Now it’s time to percolate call documents via the percolate query:

GET /search-perco/_search
{
  "_source": [
    "tag_uuid",
    "tag_name"
  ],
  "query": {
    "percolate": {
      "field": "query",
      "documents": [{`
        "unique_id": "2f86ad85-4c09-4ef3-bb6e-100d129018e7",
        "timestamp": "2018-01-02T18:13:30+00:00",
        "duration": 322,
        "transcribed": true,
        "client_name": "Acme",
        "content": "I've already sent to you a bank check last week..."
      }]
    }
  },
  "highlight": {
    "fields": {
      "content": {}
    }
  }
}

Elasticsearch providing the following response:

{
  "took": 37,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.8630463,
    "hits": [
      {
        "_index": "search-perco",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.8630463,
        "_source": {
          "tag_name": "Check sent",
          "tag_uuid": "2f86ad85-4c09-4ef3-bb6e-100d129018e9"
        },
        "fields": {
          "_percolator_document_slot": [
            0
          ]
        },
        "highlight": {
          "content": [
            "<em>I've already sent</em> to you a bank <em>check</em> last week..."
          ]
        }
      }
    ]
  }
}

So here we see that our call document matched the search query tagged Check sent. We can use the highlighter to highlight the terms that have matched from the search query documents. The field _percolator_document_slot is useful when we send several documents to the documents field of the percolate query. And max_score and _score gives you the relevance score of matched documents. You can disable the score computing when using the percolate query using a filter context.

We can also percolate existing documents by providing the index where they are stored, and their ids:

GET /search-perco/_search
{
    "query" : {
        "percolate" : {
            "field": "query",
            "index" : "call-index",
            "id" : "2"
        }
    }
}

You should care about optimizing text analysis during percolate time as suggested by the docs Percolator optimization.

Elasticsearch documentation:

Stateful components in Elm

2019-07-16T09:00:00+00:00

It’s often claimed that Elm developers should avoid thinking their views as stateful components. While this is indeed a general best design practice, sometimes you may want to make your views reusable (eg. across pages or projects), and if they come with a state… you end up copying and pasting a lot of things.

We recently published elm-daterange-picker, a date range picker written in Elm. It was the perfect occasion to investigate what a reasonable API for a reusable stateful view component would look like.

Many component/widget-oriented Elm packages feature a rather raw Elm Architecture (TEA) API, directly exposing Model, Msg(..), init, update and view, so you can basically import what defines an actual application and embed it within your own application.

With these, you usually end up writing things like this:

import Counter


type alias Model =
    { counter : Counter.Model
    , value : Maybe Int
    }


type Msg
    = CounterMsg Counter.Msg


init : () -> ( Model, Cmd Msg )
init _ =
    ( { counter = Counter.init, value = Nothing }
    , Cmd.none
    )


update : Msg -> Model -> ( Model, Cmd Msg )
update msg model =
    case msg of
        CounterMsg counterMsg ->
            let
                ( newCounterModel, newCounterCommands ) =
                    Counter.update counterMsg
            in
            ( { model
                | counter = newCounterModel
                , value =
                    case counterMsg of
                        Counter.Apply value ->
                            Just value

                        _ ->
                            Nothing
              }
            , newCommands |> Cmd.map CounterMsg
            )


view : Model -> Html Msg
view model =
    div []
        [ Counter.view model.counter
            |> Html.map CounterMsg
        , text (String.fromInt model.value)
        ]

This certainly works, but let’s be frank for a minute and admit this is super verbose and not very developer friendly:

You need to Cmd.map and Html.map here and there
You need to pattern match Counter.Msg to intercept whatever event interests you…
… meaning Counter exposes all Msgs, which are implementation details you now rely on.

There’s another way, which Evan explained in his now deprecated elm-sortable-table package. Among the many good points he has, one idea stroke me as brilliantly simple yet effective to simplify such stateful view components API design:

State updates can be managed right from event handlers!

Let’s imagine a simple counter; what if when clicking the increment button, instead of calling onClick with some Increment message, we would call a user-provided one with the new counter state updated accordingly?

-- Counter.elm
view : (Int -> msg) -> Int -> Html msg
view toMsg counter =
    button [ onClick (toMsg (counter + 1)) ]
        [ text "increment" ]

Or if you want to use an opaque type, which is an excellent idea for maintaining the smallest API surface area:

-- Counter.elm
type State
    = State Int

view : (State -> msg) -> State -> Html msg
view toMsg (State value) =
    button [ onClick (toMsg (State (value + 1))) ]
        [ text "increment" ]

Note that as we’re dealing with a counter state, we didn’t bother having anything else than a simple Int for representing it. But you could of course have a record or anything you want.

Handling internal state update could be just creating internal and unexposed Msg and update functions:

-- Counter.elm
type State
    = State Int

type Msg
    = Dec
    | Inc

update : Msg -> Int -> Int
update msg value =
    case msg of
        Dec ->
            value - 1

        Inc ->
            value + 1

view : (State -> msg) -> State -> Html msg
view toMsg (State value) =
    div []
        [ button [ onClick (toMsg (State (update Dec value))) ]
            [ text "decrement" ]
        , button [ onClick (toMsg (State (update Inc value))) ]
            [ text "increment" ]
        ]

We should also expose helpers to retrieve (or set) values from the opaque State type:

-- Counter.elm
getValue : State -> Int
getValue (State value) =
    value

So for instance, to use this Counter component in your own application, you just have to write this:

import Counter

type alias Model =
    { counter : Counter.State
    , value : Maybe Int
    }


type Msg
    = CounterChanged Counter.State


init : () -> ( Model, Cmd Msg )
init _ =
    ( { counter = Counter.init, value = Nothing }
    , Cmd.none
    )


update : Msg -> Model -> ( Model, Cmd Msg )
update msg model =
    case msg of
        CounterChanged state ->
            ( { model | counter = state, value = Counter.getValue state }
            , Cmd.none
            )


view : Model -> Html Msg
view model =
    div []
        [ Counter.view CounterChanged model.counter
        , text (String.fromInt model.value)
        ]

Notice how our update function is dramatically simpler to write and to understand. Also, no need to import (and rely) a lot from the package module, which makes it both easier to consume & maintain thanks to to the opaque State type encapsulating implementation details.

Of course a counter wouldn’t be worth creating a package for it, though this may highlight the concept better. Don’t hesitate reading elm-daterange-picker’s source code and demo code to look at a real world application of this design principle.

Text2num version 1.0.0 released!

2018-10-02T11:53:00+00:00

The output of speech-to-text systems are entirely made of words, without punctuation or capitalization. This makes visual scanning for numbers quite cumbersome, especially in transcriptions of real life dialogues as they also contain a lot of gibberish words — like « heu, ben, bah… » — and the syntax or grammar is not always correct. Besides that, text mining tools and techniques often, if not always, expect numbers to be in decimal digit representation.

That’s why we decided to add a transformation pass to our speech-to-text engine in order to convert all spoken numbers into their digit spelling.

Here at Allo-Media, we are fond of Open Source Software. So we first looked at the state of the art of Python libraries for parsing words into numbers. There was at least one for the English language, but we didn’t find any for French. Therefore, we decided to build our own library and contribute it back to the community.

We could have ported the Word2number library, but it has some flaws:

It is unable to detect by itself the bounds of a number expression;
its algorithm is weak (ex: w2n.word_to_num('hundred five fifty') == 550);
French has some pecularities like « quatre-vingt-dix-neuf » vs « nonante-neuf ».

So we started a linguistic parser from scratch that is able to identify numbers and correctly isolate contiguous ones in a sequence. Moreover, we wanted it to be able to parse different flavors of french (e.g. soixante-dix and septante for 70, etc…).

If you are interested in linguistics, septante for 70 and nonante for 90 are used in Belgium, Switzerland, Luxembourg, Aosta Valley, Jersey French and to a lesser extend in French regions of Savoie, Franche-Compté and even sometimes in Lorraine and Provence (source Wikipedia). The rest of the French speaking world uses respectively soixante-dix and quatre-vingt-dix. The usage area of huitante and octante instead of quatre-vingts is more restricted yet.

As French spelling is a touchy topic, the parser is tolerant and accepts both the 1990 spelling reform and prior rules. It even has an optional relaxed mode that parses quatre vingt as quatre-vingt for cases where you prefer to use some punctuation or timing information to help disambiguate and compensate for a wobbly transcription.

Here are two samples of what you can expect from it:

>>> from text_to_num import text2num
>>> text2num('quatre-vingt-quinze')
95

>>> text2num('nonante-cinq')
95

>>> text2num('mille neuf cent quatre-vingt dix-neuf')
1999

>>> text2num("cinquante et un million cinq cent soixante dix-huit mille trois cent deux")
51578302

>>> text2num('cent cinq cinquante')
AssertionError

>>> from text_to_num import alpha2digit
>>> alpha2digit('cent cinq cinquante')
'105 50'

>>> sentence = (
...         "Huit cent quarante-deux pommes, vingt-cinq chiens, mille trois chevaux, "
...         "douze mille six cent quatre-vingt-dix-huit clous.\n"
...         "Quatre-vingt-quinze vaut nonante-cinq. On tolère l'absence de tirets avant les unités : "
...         "soixante seize vaut septante six.\n"
...         "Nombres en série : douze quinze zéro zéro quatre vingt cinquante-deux cent trois cinquante deux "
...         "trente et un.\n"
...         "Ordinaux: cinquième troisième vingt et unième centième mille deux cent trentième.\n"
...         "Décimaux: douze virgule quatre-vingt dix-neuf, cent vingt virgule zéro cinq ; "
...         "mais soixante zéro deux."
...     )
>>> print(alpha2digit(sentence))

842 pommes, 25 chiens, 1003 chevaux, 12698 clous.
95 vaut 95. On tolère l'absence de tirets avant les unités : 76 vaut 76.
Nombres en série : 12 15 004 20 52 103 52 31.
Ordinaux: 5ème 3ème 21ème 100ème 1230ème.
Décimaux: 12,99, 120,05 ; mais 60 02.

As you see, we support decimal numbers as well as ordinal numbers.

The algorithm is quite robust and is based on the observation that big numbers are structured like a sum of decreasing powers of thousand in the language, each power of thousand being multiplied by a number from 1 (maybe omitted) to 999. The problem is thus “reduced” to recognizing powers of thousands (mille, million, milliard) and being able to parse numbers from 1 to 999.

Example: trois millions cinq cent vingt-trois mille deux cent quarante -> 3 × 1 000 000 + 523 × 1000 + 240.

Parsing numbers between 1 and 999 is more difficult. The basic idea is that we expect between 0 and 9 hundreds, followed by a ten expression (vingt, trente, …) or none and some optional units (from 1 to 9) or extended units (from 1 to 19). The “hard” part is to detect illegal combinations and the end of the number.

As the needs arise, we may develop parsers for other languages on this base, including a robust English one with all the desired features.

The library is distributed under the MIT license.

If you are interested in more details or want to contribute, you can check the sources on GitHub and the contribution guide.

If you just want to use it, it’s just a pip install text2num away and the documentation is on ReadTheDocs.

Enjoy!

From python to Go to Rust: an opinionated journey

2018-03-22T08:00:00+00:00

When looking for a new backend language, I naturally went from Python to the new cool kid: Go. But after only one week of Go, I realised that Go was only half of a progress. Better suited to my needs than Python, but too far away from the developer experience I was enjoying when doing Elm in the frontend. So I gave Rust a try.

Away from Python,

For backend development, I’ve mainly been using Python 3 for the past three years. From admin scripts to machine learning to Flask/Django applications, I’ve done a lot of Python lately, but at some point, it didn’t feel right anymore. Well, to be honest, it’s not really at “some totally random point” that it started not to feel right anymore, it was when I started to enjoy programming with a strongly typed language: Elm.

I had the famous feeling “when it compiles it works”, and once you’ve experienced that, there is no way back. You try stuff, you follow the friendly compiler error messages, you fix things, and then tada, it works!

Ok so, at this point I knew what I wanted from the “perfect” backend language:

Static and strong typing
Most of the stuff checked at compile time (please, no exceptions!)
No null
No mutability
Handle concurrency nicely

I see you coming: “hey, this is Haskell”! Yeah indeed, but for whatever reason, I’ve never managed to get anything done with Haskell (and I’ve been trying a lot). This is maybe only me, but from an outsider, the Haskell mindset seems elitist, the documentation and practical examples are lacking and it’s hardly accessible to a beginner. Learn you a Haskell for great good is awesome but very long to read and too abstract for me (you don’t build anything for real during the book).

“Hey, and what about Scala?!”. What do you mean by Scala? The better Java? The functional programming language with Scalaz? The Object Orienting Programming Functional language that may or may not fail at runtime with a java.lang.NullPointerException and needs a 4GB JVM running? I tried it some years ago and definitely, this is a no go for me.

After discussing with a few people, I decided to give Go a try. It has a compiler, no exceptions, no null (but null values) and can handle concurrency nicely.

Into Go,

I decided to rewrite an internal project that was already done in Python using Go. Just to get a feeling of the differences between the two.

First feeling: learning Go was so easy. In one evening, I was able to compile a Proof Of Concept of the project with basic features developed and some tests written. This was a very pleasant feeling, I was adding features very fast. The compiler messages were helpful, everything was fine.

And at some point, the tragedy started. I needed to add a field to some struct, so I just modified the struct and was ready to analyze the compiler messages to know where this struct was used in order to add the field where it was needed.

I compiled the code and … no error message. Everything went fine. But?! I just added a field to a struct, the compiler should say that my code is not good anymore because I’m not initializing the value where it should be!

The problem is that, not providing a value to a struct is not a problem in Go. This value will default to it’s zero value and everything will compile. This was the show stopper for me. I realized that I couldn’t rely on the compiler to get my back when I was doing mistakes. At this point, I was wondering: why should I bother learning Go if the compiler can’t do much better than Python and mypy? Of course concurrency is much better with Go, but the downside of not being able to rely on the compiler was too much for me.

Don’t get me wrong, I still think that Go is a progress compared to Python and I would definitively recommend people to learn Go instead of Python if they had to pick one of the two. But for my personal case, as someone who already knew Python and wanted something a lot safer, Go didn’t bring enough to the table in that specific domain.

Into Rust.

So Go was not an option anymore as I realized that what I really needed was a useful compiler: a compiler that should not rely on the fact that I know how to code (as it has been proven to be false a lot of times). That’s why I took a look at Rust.

Rust was not my first choice because it advertises itself as a “system language”, and I’m more of a web developer than a system one. But it had some very compelling selling points:

No null values but an Option type (checked at compile time)
No exceptions but a Result type (checked at compile time)
Variables are immutable by default
Designed with concurrency in mind
Memory safe by design, no garbage collector

I decided to rewrite the same program than the one I did in Python and Go. The onboarding was a lot harder than with Go. As I did with Go, I tried to go head first, but it was too hard: I needed some new concepts specific to Rust like ownership or lifetimes to understand the code I was seeing on StackOverflow. So I had no choice but to read the Rust Book, and it took me two weeks before I could start writing some code (remember that with Go it took me one evening).

But after this steep initial learning curve, I was enjoying writing Rust code, and I’m still enjoying it. With Rust, I don’t have to trust myself, I just have to follow the compiler and if I do so, it will most likely work if it compiles. In the end, this is the main feeling I was looking for when searching for a new backend language.

Of course, Rust has a lot of downsides:

It’s pretty new and things are moving very fast. I’m using futures-rs and hyper.rs in my project, and finding good documentation was really hard (kudos to the people on irc.mozilla.org#rust-beginners for the help).
It forces you to think of things you’re not used to when coming from more high-level languages: how is the memory managed (with lifetimes and ownership).
Compiler messages are not always straightforward to understand, especially when you’re combining futures and their strange long types.
Mutability is allowed, so you can get smashed with side effects

But, it also has a lot of upsides:

It’s amazingly fast
Tooling is good (cargo, rustfmt)
Most of the things are checked at compile time
You can potentially do whatever you want with it, from a browser, to a web app, to some game.
Community is welcoming
It’s backed by Mozilla

Wrapping up

Go is cool but doesn’t provide enough type safety for me. I would rather stick with Python and its ecosystem than risking re-writing stuff in Go if I don’t need concurrency. If I need concurrency I would still not use Go as its lack of type safety will surely hit me back at some point.

Rust is the perfect candidate for concurrency and safety, even if the futures-rs crate (this is how we call libs in Rust) is still early stage. I suspect that Rust could become the defacto standard for a lot of backend needs in the future.

For a more in depth blog post discussing the differences between Go and Rust, be sure to check this amazing post by Ralph Caraveo (@deckarep) : Paradigms of Rust for the Go developer.

At the very least, I think that I’ve found in Rust my new favorite language for the backend.

Brace yourself data selection, industrialization is coming!

2018-03-22T08:00:00+00:00

Industrialization is one of the most challenging problems for a start-up like ours. In fact, the research world doesn’t have the same priorities concerning time and cost optimization. Whereas industry is limited by these factors. And this matter struck us when we thought about building language models (referred as LM in the following) massively, especially regarding data selection.

Historically, our LMs were crafted one by one with love, with a nice cup of human intervention in between. Meaning that we had to experiment to find the best system empirically. And this is so not compatible with automation.

What is data selection you may ask? However, first thing first.

ASR: Automatic Speech Recognition Prelude

In ASR, we usually consider building two independent modules which will be mixed together later on. Each module in itself is very dependent of the language we want to recognize.

The first one is called acoustic model. It represents the way of speaking. What sounds can be put together to form a word, a sentence. In fact, we represent the language by a serie of phonemes. If we look in Wiktionary it’s clearer isn’t it? Don’t confuse with syllable through. We can take an example. The word ‘through’ (1 syllable) consists of three sounds, three phonemes: ‘th’ ‘r’ ‘oo’. So the goal is to model the sounds as a sequence of phonemes.
The second one is the language model, which we want to build automatically. It models the distribution of words for a given language. Thanks to those probabilities, the LM helps in picking the best correspondence between a sequence of phonemes and the words/sentences.

In order to build these modules, we need data and the more we have and the more they are relevant, the better! That’s why we need data selection: we need a mean to retrieve relevant data adapted to the context of the recognition. In fact a lawyer and a baker don’t speak the same language: they don’t use the same lexicon. Data selection is picking the good data that match the domain within millions of examples through the usage of various automatic algorithms.

How we used to do

As we discussed above, data selection is a very important step while building a system. Like many, we used the Moore-Lewis method which was also adapted for bilingual use (like translation) by Axelrod et al. in Domain Adaptation via Pseudo In-Domain Data Selection. These are very effective ways to select data using two corpora (in and out domain) by comparing cross-entropies. In-domain meaning that the corpus have specific data, that are relevant with the context, the domain of recognition as explained before with the lawyer/baker thing. Whereas out-of-domain is just a pool of random data, meaning there is relevant and no-relevant data in it! Then about cross-entropy, it’s a measure that help choosing well-matched data for the desired output. Thanks to some relevant segment, we compare each segment in the pool of data to retrieve the closest ones to the initial data.

So, using the cross-entropy to select, it’s not really scalable because the algorithm can’t decide when to stop on its own and he has an annoying tendency to promote very short to short sentences meaning our corpus isn’t really relevant to conversations. Moreover, something hit us hard. This paper turned out to be eight this year and we have never looked for another method before. So we asked ourselves: has any new work been done in data selection since this paper? And is there any relevant work ready for a more industrialized turn?

Searching…. Finding!

After browsing 178 papers quoting the Moore-Lewis one, a title caught our eyes: Cynical Selection of Language Model Training Data. The name was so catchy, we had to explore it. Written by Amittai Axelrod (remember we mentioned him above), we decided to give it a shot here because the paper was full of good promises … And seemed compatible with industrialization! Unlike the previous methods, the algorithm stops by itself when it has the (supposed) optimal selection, letting us continue our road toward automating.

How does it work? How did we make it work?

The goal is to select data from our out-of-domain corpora that can extend our in-domain data. Suppose you have a small in-domain corpora, which you are a hundred percent positive that is representative. The algorithm will take this corpus and a more generic one, where you don’t know what’s relevant or not. It will then select the sentences that match the specific one using an implementation of the Alexrod’s paper cited above. The script can take arguments which are detailed in the header of the script. It only requires the two corpora to work:

./cynical-selection.py --task inDomainFile.txt --unadapted outDomainFile.txt

and returns you a list of sentences along with their scores in a ‘.jaded’ file constructed as follows:

model score sentence score (penalty + gain) length penalty sentence gain sentence id (in the selection) sentence id (in the unadapted corpora) best word word gain sentence for example:

2.659289425334946 2.659289425334946 5.71042701737487 -3.0511375920399235 1 1 vous -0.12597986190092164 merci à vous tous

5.318578850669892 2.659289425334946 5.71042701737487 -3.0511375920399235 2 26978 vous -0.12597986190092164 et vous avez maintenant

7.9778682760048385 2.659289425334946 5.71042701737487 -3.0511375920399235 3 26979 vous -0.12597986190092164 puisque vous avez des

In the end, we didn’t lose any performance using this method, we even gained accuracy most of the time. But the important part is that it allowed us to automatize this treatment, taking us one step closer to industrialization.

To conclude

This method allows us to focus on other parts of our systems, making us more productive and more serene towards the building of language model. So it’s a success captain!

Chaining HTTP requests in Elm

2018-02-05T08:00:00+00:00

Preliminary note: in this article we’ll use Elm decoders, tasks, results and leverage the Elm Architecture. If you’re not comfortable with these concepts, you may want to check their respective documentation.

Sometimes in Elm you struggle with the most basic things.

Especially when you come from a JavaScript background, where chaining HTTP requests are relatively easy thanks to Promises. Here’s a real-world example leveraging the Github public API, where we fetch a list of Github events, pick the first one and query some user information from its unique identifier.

The first request uses the https://api.github.com/events endpoint, and the retrieved JSON looks like this:

[
    {
        "id": "987654321",
        "type": "ForkEvent",
        "actor": {
            "id": 1234567,
            "login": "foobar",
        }
    },
]

I’m purposely omitting a lot of other properties from the records here, for brevity.

The second request we need to do is on the https://api.github.com/users/{login} endpoint, and its body looks like this:

{
    "id": 1234567,
    "login": "foobar",
    "name": "Foo Bar",
}

Again, I’m just displaying a few fields from the actual JSON body here.

So we basically want:

from a list of events, to pick the first one if any,
then pick its actor.login property,
query the user details endpoint using this value,
extract the user real name for that account.

Using JavaScript, that would look like this:

fetch("https://api.github.com/events")
    .then(responseA => {
        return responseA.json()
    })
    .then(events => {
        if (events.length == 0) {
            throw "No events."
        }
        const { actor : { login } } = events[0]
        return fetch(`https://api.github.com/users/${login}`)
    })
    .then(responseB => {
        return responseB.json()
    })
    .then(user => {
        if (!user.name) {
            console.log("unspecified")
        } else {
            console.log(user.name)
        }
    })
    .catch(err => {
        console.error(err)
    })

It would get a little fancier using async/await:

try {
    const responseA = await fetch("https://api.github.com/events")
    const events = await responseA.json()
    if (events.length == 0) {
        throw "No events."
    }
    const { actor: { login } } = events[0]
    const responseB = await fetch(`https://api.github.com/users/${login}`)
    const user = await responseB.json()
    if (!user.name) {
        console.log("unspecified")
    } else {
        console.log(user.name)
    }
} catch (err) {
    console.error(err)
}

This is already complicated code to read and understand, and it’s tricky to do using Elm as well. Let’s see how to achieve the same, understanding exactly what we’re doing (we’ve all blindly copied and pasted code in the past, don’t deny).

First, let’s write the two requests we need; one for fetching the list of events, the second to obtain a given user’s details from her login:

import Http
import Json.Decode as Decode

eventsRequest : Http.Request (List String)
eventsRequest =
    Http.get "https://api.github.com/events"
    (Decode.list (Decode.at [ "actor", "login" ] Decode.string))

nameRequest : String -> Http.Request String
nameRequest login =
    Http.get ("https://api.github.com/users/" ++ login)
        (Decode.at [ "name" ]
            (Decode.oneOf
                [ Decode.string
                , Decode.null "unspecified"
                ]
            )
        )

These two functions return Http.Request with the type of data they’ll retrieve and decode from the JSON body of their respective responses. nameRequest handles the case where Github users don’t have entered their full name yet, so the name field might be a null; as with the JavaScript version, we then default to "unspecified".

That’s good but now we need to execute and chain these two requests, the second one depending on the result of the first one, where we retrieve the actor.login value of the event object.

Elm is a pure language, meaning you can’t have side effects in your functions (a side effect is when functions alter things outside of their scope and use these things: an HTTP request is a huge side effect). So your functions must return something that represents a given side effect, instead of executing it within the function scope itself. The Elm runtime will be in charge of actually performing the side effect, using a Command.

In Elm, you’re usually going to use a Task to describe side effects. Tasks may succeed or fail (like Promises do in JavaScript), but they need to be turned into an [Elm command] to be actually executed.

To quote this excellent post on Tasks:

I find it helpful to think of tasks as if they were shopping lists. A shopping list contains detailed instructions of what should be fetched from the grocery store, but that doesn’t mean the shopping is done. I need to use the list while at the grocery store in order to get an end result

But why do we need to convert a Task into a command you may ask? Because a command can execute a single thing at a time, so if you need to execute multiple side effects at once, you’ll need a single task that represents all these side effects.

So basically:

We first craft Http.Requests,
We turn them into Tasks we can chain,
We turn the resulting Task into a command,
This command is executed by the runtime, and we get a result

The Http package provides Http.toTask to map an Http.Request into a Task. Let’s use that here:

fetchEvents : Task Http.Error (List String)
fetchEvents =
    eventsRequest |> Http.toTask

fetchName : String -> Task Http.Error String
fetchName login =
    nameRequest login |> Http.toTask

I created these two simple functions mostly to focus on their return types; a Task must define an error type and a result type. For example, fetchEvents being an HTTP task, it will receive an Http.Error when the task fails, and a list of strings when the task succeeds.

But dealing with HTTP errors in a granular way being out of scope of this blog post, and in order to keep things as simple and concise as possible, I’m gonna use Task.mapError to turn complex HTTP errors into their string representations:

toHttpTask : Http.Request a -> Task String a
toHttpTask request =
    request
        |> Http.toTask
        |> Task.mapError toString

fetchEvents : Task String (List String)
fetchEvents =
    toHttpTask eventsRequest

fetchName : String -> Task String String
fetchName login =
    toHttpTask (nameRequest login)

Here, toHttpTask is a helper turning an Http.Request into a Task, transforming the Http.Error complex type into a serialized, purely textual version of it: a String.

We’ll also need a function allowing to extract the very first element of a list, if any, as we did in JavaScript using events[0]. Such a function is builtin the List core module as List.head. And let’s make this function a Task too, as that will ease chaining everything together and allow us to expose an error message when the list is empty:

pickFirst : List String -> Task String String
pickFirst logins =
    case List.head logins of
        Just login ->
            Task.succeed login

        Nothing ->
            Task.fail "No events."

Note the use of Task.succeed and Task.fail, which are approximately the Elm equivalents of Promise.resolve and Promise.reject: this is how you create tasks that succeed or fail immediately.

So in order to chain all the pieces we have so far, we obviously need glue. And this glue is the Task.andThen function, which can chain our tasks this fancy way:

fetchEvents
    |> Task.andThen pickFirst
    |> Task.andThen fetchName

Neat. But wait. As we mentioned previously, Tasks are descriptions of side effects, not their actual execution. The Task.attempt function will help us doing that, by turning a Task into a Command, provided we define a Msg that will be responsible of dealing with the received result:

type Msg
    = Name (Result String String)

Result String String reflects the result of the HTTP request and shares the same type definitions for both the error (a String) and the value (the user full name, a String too). Let’s use this Msg with Task.attempt:

fetchEvents
    |> Task.andThen pickFirst
    |> Task.andThen fetchName
    |> Task.attempt Name

Here:

We start by fetching all the events,
Then if the Task succeeds, we pick the first event,
Then if we have one, we fetch the event’s user full name,
And we map the future result of this task to the Name message.

The cool thing here is that if anything fails along the chain, the chain stops and the error will be propagated down to the Name handler. No need to check errors for each operation! Yes, that looks a lot like how JavaScript Promises’ .catch works.

Now, how are we going to execute the resulting command and process the result? We need to setup the Elm Architecture and its good old update function:

module Main exposing (main)

import Html exposing (..)
import Http
import Json.Decode as Decode
import Task exposing (Task)


type alias Model =
    { name : Maybe String
    , error : String
    }

type Msg
    = Name (Result String String)

eventsRequest : Http.Request (List String)
eventsRequest =
    Http.get "https://api.github.com/events"
        (Decode.list (Decode.at [ "actor", "login" ] Decode.string))

nameRequest : String -> Http.Request String
nameRequest login =
    Http.get ("https://api.github.com/users/" ++ login)
        (Decode.at [ "name" ]
            (Decode.oneOf
                [ Decode.string
                , Decode.null "unspecified"
                ]
            )
        )

toHttpTask : Http.Request a -> Task String a
toHttpTask request =
    request
        |> Http.toTask
        |> Task.mapError toString

fetchEvents : Task String (List String)
fetchEvents =
    toHttpTask eventsRequest

fetchName : String -> Task String String
fetchName login =
    toHttpTask (nameRequest login)

pickFirst : List String -> Task String String
pickFirst events =
    case List.head events of
        Just event ->
            Task.succeed event

        Nothing ->
            Task.fail "No events."

init : ( Model, Cmd Msg )
init =
    { name = Nothing, error = "" }
        ! [ fetchEvents
                |> Task.andThen pickFirst
                |> Task.andThen fetchName
                |> Task.attempt Name
          ]

update : Msg -> Model -> ( Model, Cmd Msg )
update msg model =
    case msg of
        Name (Ok name) ->
            { model | name = Just name } ! []

        Name (Err error) ->
            { model | error = error } ! []

view : Model -> Html Msg
view model =
    div []
        [ if model.error /= "" then
            div []
                [ h4 [] [ text "Error encountered" ]
                , pre [] [ text model.error ]
                ]
          else
            text ""
        , p [] [ text <| Maybe.withDefault "Fetching..." model.name ]
        ]

main =
    Html.program
        { init = init
        , update = update
        , subscriptions = always Sub.none
        , view = view
        }

That’s for sure more code than with the JavaScript example, but don’t forget that the Elm version renders HTML, not just logs in the console, and that the JavaScript code could be refactored to look a lot like the Elm version. Also the Elm version is fully typed and safeguarded against unforeseen problems, which makes a huge difference when your application grows.

As always, an Ellie is publicly available so you can play around with the code.

Simple disk encryption tutorial with archlinux

2018-02-01T07:00:00+00:00

We all love archlinux, or if we don’t, we’re using Fedora or Debian, and trolling is (almost) out of the scope of this article.

But let’s be honest, even if the wiki is great, it can be intimidating sometimes. That’s what happened to me yesterday. Here at AlloMedia, for security reasons, we’re encrypting every laptop disk by default. As I’m using archlinux, I went to the wiki to follow how to “just” encrypt my disk. And well, the page is a little bit overcrowded, at the very least.

You have first to read about 10 pages of documentation, to learn that you now have to choose between 6 methods (Loop-AES, dm-crypt +/- LUKS, Truecrypt, eCryptfs, EncFS) and read every *#! page to understand which one you may want to choose. I’ve choosen for you.

Lvm on Luks

This is shipped with the kernel and seems to be the “default” on other distributions. It totally fits my needs: encrypt the whole system, swap included, and decrypt the system on boot using a passphrase.

If that’s what you want to do too, follow the white rabbit, Neo.

Following the rabbit

We will assume that you can erase your disk and start with a fresh install, if it’s not the case, this article may not be for you. For the sake of this article, we will use /dev/nvme0n1 as the main disk of the laptop. You may have something different like /dev/sda, that’s fine, just replace /dev/nvme0n1 by /dev/sda in the rest of the article.

First, follow the Archlinux installation guide to the point just before Format the partitions, where they are telling you to modify the partition tables using fdisk or parted. Here, you will need to erase all your partitions and create what’s needed for the encryption.

Clean and safely erase your disk

First, use fdisk or gdisk (if you’re using UEFI) to wipe out what’s on your disk, i.e. removing all existing partitions (of course, this will delete all the data on your disk…).

For example, for gdisk:

gdisk /dev/nvme0n1

GPT fdisk (gdisk) version 1.0.3

Partition table scan:
    MBR: protective
    BSD: not present
    APM: not present
    GPT: present

Found valid GPT with protective MBR; using GPT.

Command (? for help):

Use p to print your partition schema, and d to delete partitions. Once it’s done, use w to write your changes to the disk (that is to say, again, deleting all the data on your disk) and quit gdisk.

Every page on the archlinux wiki says you should first be sure that no previous data will still be readable on your disk (if you have a new computer with nothing on it, this doesn’t apply to you).

So we will put random stuff on our disk to be sure to overwrite everything that may still be on it. You can read the wiki page or just run the following command:

dd if=/dev/urandom > /dev/nvme0n1

Partitionning

We now have a clean disk, let’s create what’s needed for our encrypted system, that is to say 2 partitions: a partition for /boot (that will not be encrypted) and another one for our encrypted volumes (where we will later put / and our swap).

Here is what we want to have (output of my gdisk with the p command):

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048         1050623   512.0 MiB   EF00  EFI System
   2         1050624      1000215182   476.4 GiB   8E00  Linux LVM

First, create the partition where /boot will be mounted of type 8300 (512Mo is a good size) following the archlinux wiki. I’m assuming you’re using a system compatible with UEFI, if it’s not the case, you may want to document yourself a little bit more using the wiki. Format the partition using FAT32.

mkfs.fat -F32 /dev/nvme0n1p1

Create the other partition of code 8E00 using the remaining space.

You should now have only 2 partitions, one for /boot that will not be encrypted, and another one that you will first encrypt, and then put your volumes on it (/ and swap). In my case, the first partition that will be used for /boot is named /dev/nvme0n1p1, and the other one /dev/nvme0n1p2. You may have something like /dev/sda1 and /dev/sda2 if your partition naming scheme is not the same than mine.

You can then follow the (LVM on LUKS section)[https://wiki.archlinux.org/index.php/Dm-crypt/Encrypting_an_entire_system#LVM_on_LUKS] section.

I don’t like having separate partitions for / and /home. Every time I’ve done that, I always regretted the amount of space I allocated for each. So now, I’m only creating one / partition with everything inside.

In short, below are the commands you should be running for your encrypted volumes (I’m creating a 8Go swap partition).

Crypt the partition and open it with your key:

cryptsetup luksFormat --type luks2 /dev/nvme0n1p2
cryptsetup open /dev/nvme0n1p2 cryptolvm

Create the LVM volumes on it (swap and root):

pvcreate /dev/mapper/cryptolvm
vgcreate MyVol /dev/mapper/cryptolvm
lvcreate -L 8G MyVol -n swap
lvcreate -l 100%FREE MyVol -n root

Format the root and swap volumes:

mkfs.ext4 /dev/mapper/MyVol-root
mkswap /dev/mapper/MyVol-swap

Mount the file systems:

mount /dev/mapper/MyVol-root /mnt
swapon /dev/mapper/MyVol-swap

The arch wiki tells you to format you boot partition using ext2, but for me this was a bad idea, as I want the UEFI manager of my Dell XPS 9550 to be able to boot on my /boot partition. So, as I said above, I formatted this partition using FAT32.

Mount the /boot partition:

mkdir /mnt/boot
mount /dev/nvme0n1p2 /mnt/boot

You can then follow the (mkinitcpio part of the archlinux wiki)[https://wiki.archlinux.org/index.php/Dm-crypt/Encrypting_an_entire_system#Configuring_mkinitcpio_2].

Be sure to have something like that in your mkinitcpio.conf file:

HOOKS=(... keyboard keymap block encrypt lvm2 ... filesystems ...)

Then continue to install you system normally. Of course, be sure to configure your grub accordingly to your encrypted setup by following the wiki.

For the record, here is my /etc/defaults/grub file (it’s used to generate the /boot/grub/grub.cfg file by using grub-mkconfig -o /boot/grub/grub.cfg):

# GRUB boot loader configuration

GRUB_DEFAULT=0
GRUB_TIMEOUT=1
GRUB_DISTRIBUTOR="Arch"
GRUB_CMDLINE_LINUX_DEFAULT="resume=/dev/mapper/MyVol-swap nouveau.modeset=0 i915.preliminary_hw_support=1 acpi_backlight=vendor acpi_osi=Linux"
#GRUB_CMDLINE_LINUX_DEFAULT=""
#GRUB_CMDLINE_LINUX=""
GRUB_CMDLINE_LINUX="cryptdevice=/dev/nvme0n1p2:cryptolvm"

GRUB_ENABLE_CRYPTODISK=y

# Preload both GPT and MBR modules so that they are not missed
GRUB_PRELOAD_MODULES="part_gpt part_msdos"

# Uncomment to enable booting from LUKS encrypted devices
#GRUB_ENABLE_CRYPTODISK=y

# Uncomment to enable Hidden Menu, and optionally hide the timeout count
#GRUB_HIDDEN_TIMEOUT=5
#GRUB_HIDDEN_TIMEOUT_QUIET=true

# Uncomment to use basic console
GRUB_TERMINAL_INPUT=console

# Uncomment to disable graphical terminal
#GRUB_TERMINAL_OUTPUT=console

# The resolution used on graphical terminal
# note that you can use only modes which your graphic card supports via VBE
# you can see them in real GRUB with the command `vbeinfo'
GRUB_GFXMODE=auto

# Uncomment to allow the kernel use the same resolution used by grub
GRUB_GFXPAYLOAD_LINUX=keep

# Uncomment if you want GRUB to pass to the Linux kernel the old parameter
# format "root=/dev/xxx" instead of "root=/dev/disk/by-uuid/xxx"
#GRUB_DISABLE_LINUX_UUID=true

# Uncomment to disable generation of recovery mode menu entries
GRUB_DISABLE_RECOVERY=true

# Uncomment and set to the desired menu colors.  Used by normal and wallpaper
# modes only.  Entries specified as foreground/background.
#GRUB_COLOR_NORMAL="light-blue/black"
#GRUB_COLOR_HIGHLIGHT="light-cyan/blue"

# Uncomment one of them for the gfx desired, a image background or a gfxtheme
#GRUB_BACKGROUND="/path/to/wallpaper"
#GRUB_THEME="/path/to/gfxtheme"

# Uncomment to get a beep at GRUB start
#GRUB_INIT_TUNE="480 440 1"

# Uncomment to make GRUB remember the last selection. This requires to
# set 'GRUB_DEFAULT=saved' above.
#GRUB_SAVEDEFAULT="true"

Enjoy your encrypted archlinux!