Over the past five years I’ve helped architect and develop Ansible automation for several large software projects. These projects have required the ability to deploy large suites of software, written in multiple languages, across a diverse OS baseline and Ansible has been up to the task for all of them. However I’ve also run into my fair share of snags and issues with deploying Ansible at scale, and learned a few things that might save you a few late nights on your project.

General Best Practices

01. Minimize the number of variable assignment locations.

A lot of Ansible’s flexibility comes from being able to define multiple sets of variables in your configuration and overriding them based on node type. However, with great power comes great responsibility, and I have found that there are three locations that I use the most out of all of the possible options.

  • Role Defaults - Used for defining role interfaces as well as providing a sane starting point for your code. If you intend a value to be passed into a role you should define some sort of default for it and if necessary use the ansible.builtin.assert module to check for correct values.
  • Inventory Group Variables - The various files in group_vars are where most of your actual project and environment configuration is located. A best practice here is to try to define your variables at the most general group that does not require you to override them again later. This way when you need to change the value you only need to do it once and it propagates to all of your sub-groups. Also, remember that you can define group variables as a single YAML file with the same name as the group or as a directory named after the group within the group_vars directory. When using the sub-directory layout, Ansible will load any YAML files it finds in that directory as part of the same group configuration.
  • –extra-vars Parameters - These variables have the highest precedence but also only apply to the current execution of the playbook so I use them for passing control flags. This way it is best to write your roles and playbooks so that they define the desired final state of your hosts and use CLI passed control flags to handle any intermediate states you have to move through first.

02. Define values in a single location.

In a similar vein to placing group_vars where they won’t need to be overridden, it is best practice to have a single source of truth for pieces of information and to reduce the levels of indirection the user has to go through to get to a concrete value. For example it is far better to define your DNS servers like this:

# group_vars/all.yml
dns_servers:
  - 8.8.8.8
  - 8.8.4.4

# role_a/defaults/main.yml
role_a_dns_servers: "{{ dns_servers | join(' ') }}"

# role_b/defaults/main.yml
role_b_dns_servers: "{{ dns_servers | join(',') }}"

Than it is to define them like this:

# group_vars/all.yml
dns_servers:
  - 8.8.8.8
  - 8.8.4.4
role_a_dns_servers: 8.8.8.8 8.8.4.4
role_b_dns_servers: 8.8.8.8,8.8.4.4

This usually means you need to consider more what the variable is (a single value, a list of values, a group of values that only make sense as a group) and pick an appropriate YAML construct accordingly. Do not necessarily define your variables based on their most common representation in the final configuration. You have an entire templating language handy to render the values in their final form, but that same system is far less flexible with massaging data from one format into another.

03. Reduce instances of variable renaming and indirection.

To reiterate, variable indirection and renaming can make debugging very difficult. Only rename a variable when you need to modify a global or shared value to be acceptable to your role, see above, or when you want a shared variable to be the default value for your role but also wish to provide the option to users to override that value when using your role. Think of it like default function parameters, if it’s something the user might want to change, you should provide a parameter with a default. If it’s a value that should always mirror the global state, just call on the global state variable name instead.

04. Avoid complex key-value data structures.

Lastly, nested key-value data structures like the one shown below can end up being more effort to implement than they are worth.

site_config:
  dns:
    core:
      - 192.168.1.1
    upstream:
      - 8.8.8.8

With these structures you usually end up forcing your users to define a bunch of dummy keys or needing to write additional playbook logic to handle when structures within your object aren’t defined. In the example above it would be easier in the long run to define that structure as two descriptively named top level lists. That way filters like is defined can check if they exist and the user doesn’t need to remember the exact data structure format. Generally, they’re only worth the hassle when all or most of the fields they contain are required by the role they’re used in and there is no easy way to separate the structure into multiple calls to the same role.

nginx_sites:
  - name: site_a
    type: static
    root: /var/www/a
  - name: site_b
    type: proxy_pass
    proxy_host: http://localhost:8080

05. Write code that is idempotent.

Many of Ansible’s core and community modules are written so that if a file or service is already in the desired state then no action is taken. This is known as idempotency and it is a huge advantage when writing Ansible code. It means that if a playbook fails due to a temporary issue it is possible to just restart from the beginning and you’ll just pick up where you left off. Much nicer than needing to destroy servers and start over. Some modules like ansible.builtin.command and ansible.builtin.shell are not idempotent by default and require the user to implement this helpful property themselves.

06. Leverage community provided code.

In a similar spirit to the previous tip, I would strongly recommend leveraging the huge number of Ansible modules available in both the core distribution and in Collections on Ansible Galaxy. This code is better integrated with Ansible than custom calls to ansible.builtin.command and has been peer reviewed by the community for things like task idempotency. This can be particularly helpful when dealing with operating systems that aren’t a core competency (e.g. Windows) since the modules provided in the ansible.windows and community.windows provide a standard Ansible API so a deep knowledge of PowerShell is not necessarily required.

Just remember when using modules defined in a Collection you have to use the “fully qualified module name” to call them (e.g. ansible.windows.win_command). This lets you be specific about which implementation of that module name you actually want to execute. While using the fully qualified name isn’t required as of Ansible 2.9 for modules in the ansible.builtin collection, it is good practice to use this fully qualified name for all of your module calls to future proof your code for when this syntax becomes required.

07. Do not suppress error messages.

While Ansible supports suppressing errors through directives like ignore_errors: true or failed_when: false it is rarely the right thing to do. By ignoring errors you can end up separating the cause and effect in later malfunctions where dependencies are not completed correctly. A better choice is to implement a failed_when that excludes only the states that are actual errors or using the block and rescue commands to handle errors.

- name: This command has non-zero error codes that are not fatal.
  ansible.builtin.command: >
    /usr/local/bin/foo
  register: foo_result
  failed_when: foo_result.rc not in [0, 101]

- name: This command can be rescued and complete successfully.
  block:
    - name: This is the command that might fail.
      ansible.builtin.command: >
        /usr/local/bin/bar
  rescue:
    - name: This is how to get the host to the correct state if bar fails.
      ansible.builtin.command: >
        /usr/local/bin/bar-fix

Using these techniques both makes your playbook more stable but it also makes debugging easier since when things fail execution stops instead of continuing on to fail at a dependent location later.

08. Write your tasks defensively.

Temporary hiccups like network instability happen no matter how well we try to plan for them. Unfortunately, these kinds of issues can cause your playbooks to fail even when your configuration is correct. The solution is defensive coding and building in tolerances for this class of intermittent problem. For example when installing packages with a Linux package manager you can add something similar to the following to seamlessly handle network instability without risking an actual missing package slipping by.

- name: Install a large package from an unstable repository.
  ansible.builtin.package:
    name: baz
    state: present
  register: baz_install_result
  until: baz_install_result is success
  delay: 5
  retries: 10

09. Use the right templating tool for the job.

While ansible.builtin.lininfile is great for small changes to a file it isn’t the best way to build a complete configuration file or make major edits. If you find yourself manipulating the same file with lineinfile more than a few times it may be best to just bite the bullet and use ansible.builtin.template instead.

10. Do not be “clever” with Jinja filters.

Jinja is the templating library that underpins all of Ansible and it very powerful. Be careful about being “clever” with Jinja templating expressions and filters in your code to avoid debugging nightmares down the road. While the tool is very powerful, it is at its core a templating language, not a fully featured programming or data manipulation language. If you are faced with using a complex series of filters to massage data into the format you need, take a step back and see if there’s another way to approach the problem or structure the input data that avoids the issue.

11. Use ansible-lint to check your code style.

The ansible-lint utility allows you to check that your playbooks and roles conform to the Ansible community’s code style. This makes your code more maintainable both by encouraging best practices and promoting a common style that will be familiar to new developers who may know Ansible but not your project’s codebase. If you need to suppress a warning, a comment in the code as to why it was necessary is helpful for the future maintenance programmer. This tool also integrates well with CI/CD utilities like Jenkins and can emit JUnit style test results for tracking and gate checking.

Project Structure

12. Use Collections to organize your codebase.

In Ansible 2.9 and beyond a new organizational object has been added to the Ansible universe: the Collection. These collections allow you to package groups of related roles, modules, and files into a single distributable bundle. This can be very useful for packaging your work for external consumption, as you can have things like a foo.core and foo.utilities collection to help developers know what code you intend for their consumption. By keeping the API of roles in foo.utilities consistent you can encourage external developers and consumers of your code to use and contribute back to what already exists instead of creating their own version of existing functionality. This reduces the maintenance burden for your entire organization and promotes code reuse.

13. Package playbooks in Collections where appropriate.

While it is not yet an official part of the Ansible Collection specification, I like adding a playbooks directory into the collection to help organize my playbooks. This can be particularly helpful when you have a core set of configuration playbooks and another set of utility and maintenance playbooks which you want to keep separate. If you aren’t comfortable straying from the Collection specification this can also be accomplished with simple directory structures at the top level of the project.

14. Use a shared variables role instead of repeating static project configuration.

A common problem in large Ansible projects is that you need to share bits of data between multiple roles as well as multiple deploy environments. One way to handle this is to copy large blocks of shared values in group_vars from environment to environment. However, this method requires that when you want to make a global change to all environments you have to touch every configuration set to make the update.

A better way to implement this is through a shared variable role containing the default shared values as its role defaults. These defaults can be split across multiple files to logically separate different groups of variables to improve readability. Since they are loaded at the second lowest precedence they can be easily overridden in any environment that needs to without repeating the unchanged values. This role can then be referenced as a dependency of roles that utilize the global values directly or loaded in a playbook where the values are needed with its variables exposed to subsequent tasks.

Inventory Management

15. Use the YAML format for static inventories instead of INI.

A lot of Ansible examples show the INI format for static inventory files but I think the format causes a lot of problems. Unlike the YAML format for inventory files the INI format has issues resolving groups and dependencies based on the order in which they are declared in the file.

# This will fail to parse with two unknown groups for `bar` and `baz`
[foo]
[foo:children]
bar
baz

[bar]

[baz]

The YAML format avoids these problems and makes the syntax for adding group or host variables in the inventory much simpler as well.

# This will parse just fine.
all:
  children:
    foo:
      children:
        bar:
        baz:
    bar:
    baz:

16. Utilize dynamic inventories with care.

When using Ansible alongside other tools like Terraform dynamic inventories are very useful. An integrated Ansible inventory plugin provides much finer control over how hosts are parsed and added to the inventory without the overhead of argument parsing for a script based solution. You do need to be careful though as variables attached to hosts within plugins may be added as host variables, and thus at higher precedence, than your group variables causing unexpected behaviour.

Roles As Reusable Components

17. Have each role do one thing well.

Ansible roles are a large part of what makes it so flexible with the ability to package resources and tasks together to distribute a single unit of functionality. Key to doing this well is make sure that your roles are self contained and do one thing well. This could be configuring a particular shared file, or a service, or set of services which are always deployed together in your project. Think of your roles like black-box functions where only the variables explicitly passed in effect the configured functionality that is passed out. Additionally do not try to make one role that does everything. The less branching and runtime logic that is used the easier the role will be to reason about and debug. Better to have multiple roles, each with it’s own dedicated purpose than a super role that tries to be everything for every host.

18. Do not use inheritance in roles.

Along with the idea of having each role do one thing well, it is important to realize that Ansible is not a fully featured programming language and does not support concepts like class inheritance well. If you implement a role, do not branch into more specific sub-roles to implement the functionality for different host types. Ansible already has the ability to run certain roles on certain hosts in the form of host selectors in playbooks. It is better to have a play that runs the common configuration code and then a separate play for each sub-group that calls the specific configuration roles there instead.

This isn’t to say that calling roles from within others is a bad practice, but rather that they should be used to bring in pieces of functionality that are useful outside the scope of the calling role.

19. Define variables used in the role in the role defaults file.

When writing a role all the variables that you plan on using should be defined in the roles default variables file. This includes values that need to be provided the user which can be provided as null. If a user provided variable is required a call to ansible.builtin.assert at the start of the role’s tasks can provide a descriptive error message if something is missing.

# foo/defaults/main.yml
foo_required: ~

# foo/tasks/main.yml
- name: Assert that all required inputs have been provided.
  ansible.builtin.assert:
    that:
      - foo_required is not none

20. Prefix role variables with the full role name.

When defining variables in a role the should be prefixed with the full role name to make it easier to figure out where different variables originate. This is in part a holdover from earlier versions of Ansible where it tended to leak variables to the global scope and cause runtime name collisions. While more recent versions of Ansible are better about keeping variables in the correct scope this practice helps with debugging when things go wrong.

21. Use named loop variables for loops inside of a role.

One place that Ansible is still not good about variable name collisions is with loops. When Ansible loops, by default it puts the current element in the item special variable. This is all well and good until you start nesting loops or looping over roles that themselves contain loops. To avoid this, roles should always use a prefixed loop variable so that they do not accidentally overwrite an external value midway through execution.

# foo/tasks/main.yml
- name: This is a safe loop in a role.
  ansible.builtin.command: >
    echo 
  loop: ""
  loop_control:
    loop_var: foo_command

22. Split role defaults across multiple files for easier reading.

Similar to how Ansible allows you to split group variables across multiple files, role defaults can be split as well to make them more manageable. If you split your variables based on task or function within the role, especially if there are a lot of them, it can make it easier to read and include appropriate comments on them as well as simplifying merge requests.

23. Avoid sharing responsibility for a file between roles.

If at all possible, multiple roles should not be manipulating the same file on a single host. This can result in roles “fighting” over the state or contents of the file and breaking idempotency by making or reporting changes even when they are not required.

Putting It All Together With Playbooks

24. Avoid using task delegation.

Task delegation through the delegate_to keyword can be very powerful but frequently causes trouble in large codebases. Issues of figuring out at runtime where the code is actually going to execute and what variables are going to be in scope makes code called with this keyword very hard to debug. Whenever possible, use a separate play in the playbook instead of using delegate_to if you need to perform tasks on another machine. It’s better to have a play to configure the first part of a target, a second to set up a dependency, and a third to finish the configuration than jamming the dependency into the initial play with delegation.

25. Use the task keyword for all tasks in a play.

Historically, Ansible didn’t support including a role directly as a task and thus utilized multiple play level keywords to organize the flow of execution.

- name: This is an old style play.
  hosts: localhost
  pre_tasks:
    - name: This executes after fact gathering.
      foo:
  roles:
    - role_a
  tasks:
    - name: This executes after all roles have been executed in order.
      bar:
  post_tasks:
    - name: This executes after tasks.
      baz:

However, in recent versions of Ansible roles can be imported directly as a task which has rendered these top level keys fairly obsolete. Instead of using them, just call all your tasks in order under the tasks key as seen below.

- name: This is a new style play.
  hosts: localhost
  tasks:
    - name: The first task.
      foo:
    - name: We then import our role.
      include_role:
        name: role_a
    - name: The second task.
      bar:
    - name: The third task.
      baz:

This has the same execution flow as the previous version but is much easier to follow as it does not require you to memorize the execution sequence of the top level keywords.

Conclusion

While none of these tips are hard and fast rules for writing your Ansible automation, they do help when it comes to scaling up your project’s complexity. Generally use them as a guide and be sure to document for future developers why you did things differently. Trust me, both they and your future self will thank you!