CIS Compliance Benchmark for AWS

Keeping the cloud infrastructure secure is an amazingly critical requirement these days to reassure users. But the process for that is complicated and time-consuming. Finding all vulnerable points is difficult at the first place. Although there are some benchmarks provided by security authorities, such as CIS, it is not easy to apply them in our cloud infrastructure. I thought that until TODAY.

I discovered a mod for steampipe to run CIS compliance benchmark in public cloud infrastructure like AWS and found it helpful to reveal the vulnerability our service may suffer from.

turbot/steampipe-mod-aws-compliance

How To Use

We need to install the AWS plugin and mod in addition to steampipe itself.

# Install steampipe
$ brew tap turbot/tap
$ brew install steampipe

# Install aws plugin
$ steampipe plugin install aws

Get the mod for compliance benchmark.

$ git clone [email protected]:turbot/steampipe-mod-aws-compliance
$ cd steampipe-mod-aws-compliance

$ steampipe check all

That’s it. It runs more than hundreds of benchmark suites in your service, and you definitely see a lot of red messages in the console :).

Major Dataflow Analysis Patterns

Dataflow analysis is a technique to collect information about the possible state that the program can take at each point of the flow of the program. This analysis lets us know, for instance, which variables are alive at the specific point of the program or how many times a variable is used in the program.

Many compilers use this technique for fundamental transformation, such as register allocation and optimization of the program. However, to this day, I have thought data flow analysis is complicated and messy to learn. There are many types of analysis, and each one has a specific algorithm to complete the analysis.

But this time, I have acquired the categorization of four major types of data flow analysis. We can write down the code for dataflow analysis mechanically by using this categorization. It is always fun to understand things that look tough at first. If you are interested in how the significant dataflow analysis works and eager to write the code for that, this article is for you.

Control Flow Graph

A control flow graph (CFG) provides us the basis for this type of analysis. We can do the majority type of data flow analysis on this data structure by iteratively walking through the basic blocks in the graph. For example, we have the following C/C++ program.

int x = 5;
int y = 1;
while (x != 1) {
  y = x * y;
  x -= 1;
}

We can abstractly represent this program by using the CFG as follows.

CFG

Each blue block represents a code line, and basic blocks are the blue blocks until it reaches the end of the program or branch condition.

We can accomplish the following four types of data flow analysis by using this CFG.

Let’s take a look at how the reaching definition goes as an example.

Reaching Definitions

Reaching definition analysis clarifies which assignments (definition) have been made and not overwritten for each program point.

Program Point

At the program point P1, for example, the assignment x = 5 reaches. It means the value assigned by x = 5 is alive at that point. On the other hand, the value is not active anymore at the point of P2 because x -= 1 overwrites the value for x. This analysis helps us find the usage of uninitialized variables in the program.

We can easily see which variable definition reaches that point at a glance. But how can we accomplish the same thing programmatically? Here comes the iterative algorithm.

Iterative Algorithm

We can describe the pseudo-code for the algorithm to complete this type of analysis as follows.

for n in nodes:
  IN[n] = N/A
  OUT[n] = N/A

while IN[] and OUT[] has been changed:
  for n in nodes:
    IN[n]= union of OUT[n'] for all predecessors of n
    OUT[n] = (IN[n] - KILL[n]) + GEN[n]

IN[] and OUT[] are the set collecting the fact at the program point. IN[] is for the fact at the entry of the control graph node, OUT[] is for the exit side. For example, if a definition x = 5 is reaching at the entry of node 4, IN[4] should contain x = 5.

KILL[] is a set of definitions overwritten by the node. GEN[] contains definitions assigned by that node. We should be able to get the collection of reaching definitions for each node in the graph if we run the program it converges.

Formal Operation

You may notice that the algorithm’s core consists of only two lines, collecting the union of OUT of predecessors and calculating the OUT set for each node. This operation can be mathematically described.

$ \text{IN[n]} = \bigcup_{n’ \in \text{pred(n)}} \text{OUT[n’]} $

$ \text{OUT[n]} = (\text{IN[n]} - \text{KILL[n]}) \cup \text{GEN[n]} $

These formulas remind us that the iteration goes forward from the predecessor to the node. We can calculate the input of the node from the output set of predecessors, and the node’s output is based on the input and what type of assignment the node does. This type of algorithm is categorized as a type of FORWARD analysis algorithm.

You can also see that the node’s input is collected from the union of the output of predecessors. It indicates that the fact satisfied in one of the predecessors can also be satisfied in the node. This type of analysis is called the MAY type of analysis because it does not require all the predecessors’ satisfaction.

As you may already notice, it looks like we can have other types of dataflow analysis. Can we construct the BACKWARD and MUST type of analysis in this manner?

Yes, we can.

Four Patterns of Major Dataflow Analysis

The following table lists the patterns we categorize the four dataflow analysis introduced at the beginning of the post.

  MAY MUST
FORWARD Reaching Definition Available Expressions
BACKWARD Live Variables Very Busy Expressions

Although we omit the detail and meaning of each dataflow analysis here, you can get how the algorithm for them looks like. For instance, We can write the formula of available expression systematically:

$ \text{IN[n]} = \bigcap_{n’ \in \text{pred(n)}} \text{OUT[n’]} $

$ \text{OUT[n]} = (\text{IN[n]} - \text{KILL[n]}) \cup \text{GEN[n]} $

In short, we can replace the part of the formula according to the category the analysis falls into. MAY uses union operator (\(\bigcup\)), MUST uses intersection operator (\(\bigcap\)). For the forward analysis, the input is computed from the output of predecessors while the backward analysis gets the output from the input of successors. That’s it. These four algorithms should not show much difference.

Dataflow analysis seems complicated at a glance. But if you install this table in your brain, you can quickly write down the algorithm mechanically.

Terraformer accelerates the Terraform migration process

Although infrastructure-as-code is an excellent concept we all should obey, migration of infrastructure set up to written code is always challenging. The final infrastructure should be identical to the existing one to avoid the wide range of catastrophic failures. But manually writing the Terraform code (or any other infrastructure automation tool) is troublesome and error-prone. It can often force us to write and check much code to ensure it is consistent with the current infrastructure setup.

I found Terraformer is helpful for this purpose. It automatically loads states and writes the initial HCL code as we can get started on it.

Terraformer

Terraformer provides two options, import and plan. import imports the current TF state and writes the HCL code for us. plan only loads the state and shows what type of resource it is going to create. We can run plan first as we use Terraform to check the result of the command will work as expected.

$ terraformer --help
Usage:
   [command]

Available Commands:
  help        Help about any command
  import      Import current state to Terraform configuration
  plan        Plan to import current state to Terraform configuration
  version     Print the version number of Terraformer

Flags:
  -h, --help      help for this command
  -v, --version   version for this command

Use " [command] --help" for more information about a command.

The following command will allow you to import AWS infrastructure.

$ terraformer plan aws

If you want to use the specific profile of the AWS role, we have a --profile option.

$ terraformer plan aws --profile <Your Profile>

Terraformer will dump the state file in the directory named generated as default. Next, you can generate the corresponding HCL code by the import command if the state looks okay.

$ terraformer import aws --profile <Your Profile>

One note is that it will create a brand-new infrastructure set, not modify the existing one. So please adjust the resource names as you like because all have prefixes tfer-- and auto-generated identifiers.

What is Induction Variable

I found an unfamiliar method when I looked into the LLVM documentation, getInductionVariable. I could see a similar name in MLIR scf (Static Control Flow) dialect as well.

The operation defines an SSA value for its induction variable.

What is an induction variable?

Wikipedia gave me a clear description of what the induction variable is.

In computer science, an induction variable is a variable that gets increased or decreased by a fixed amount on every iteration of a loop or is a linear function of another induction variable

That’s not limited to the value written in the for loop incrementally updated. All variables updated iteratively in the loop can be seen as induction variables.

for (i = 0; i < 10; ++i) {
    j = 17 * i;
}

i' and j` are both induction variables in the previous case.

Induction variables are represented as region argument in MLIR mlir::scf::ForOp. Hence they are assumed to be passed from the outside of the region.

Essential ways to make your Rails faster

There is no reason to keep the application slow. If we have room to improve the performance of our application without any drawback, we should do. It makes our users happier, and your application becomes attractive and prevents people from leaving for alternatives due to the performance issue.

But what should we do? Ideally, we should carefully measure the profile of the application runtime and detect the bottleneck to be improved. As Donald Knuth said, the premature optimization is the root of all evil. We must not do any optimization blindly. That is the fundamental principle.

But I understand the situation; we may want to quickly know the essential tips to apply to every kind of web application. That is the general method simply applicable regardless of the type of application. This is the list of a few tips every Rails developer should know to keep your application performant. Please keep them in your mind every time your write your Rails application.

Data Schema

Before we begin the journey, let’s define our database schema first as our example.

CREATE TABLE `states` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `name` varchar(255) NOT NULL,
  `created_at` datetime NOT NULL,
  `updated_at` datetime NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

CREATE TABLE `cities` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `state_id` bigint(20) DEFAULT NULL,
  `name` varchar(255) NOT NULL,
  `created_at` datetime NOT NULL,
  `updated_at` datetime NOT NULL,
  PRIMARY KEY (`id`),
  KEY `index_cities_on_state_id` (`state_id`),
  CONSTRAINT `fk_rails_cc74ecd368` FOREIGN KEY (`state_id`) REFERENCES `states` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

CREATE TABLE `offices` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `city_id` bigint(20) DEFAULT NULL,
  `name` varchar(255) NOT NULL,
  `address` varchar(255) DEFAULT NULL,
  `zip_code` varchar(255) DEFAULT NULL,
  `created_at` datetime NOT NULL,
  `updated_at` datetime NOT NULL,
  PRIMARY KEY (`id`),
  KEY `index_offices_on_city_id` (`city_id`),
  CONSTRAINT `fk_rails_52308f6f48` FOREIGN KEY (`city_id`) REFERENCES `cities` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

Our company has several offices across states in the country. An office locates in a city in a state. Hence we have associates between offices and cities, cities and states. Let’s consider the case we want to generates the list of all office entities with the city and state name where it locates.

Avoid N+1 Query

We can naively do so as follows.

Office.all.each do |office|
  puts office.city.name
end

This code exactly does we want. But it’s slow because it issues redundant queries to the backend database system. Offices.all get the list of all office entities by running 1 SQL. Let’s say we get 100 offices here. Unfortunately, 100 queries to get each city entity follows. In total, we need to run 100 + 1 queries. If we put a code like office.city.state.name, the number exponentially grows, as you can imagine. That’s a problem with the N+1 query.

But no worries, we have a quick way to fix that. include

Office.include(:city).all.each do |office|
  puts office.city.name
end

This code issues a query to get all city entities associated with the office, not iterating one by one. That means we only run 1+1 queries even as a whole.

SELECT cities.* FROM cities
  WHERE (cities.id IN (1,2,3,4,5,6,7,8,9,10))

Use pluck (and join)

Another performance problem likely to occur is the slowdown due to fetching many columns. If we have an entity having many attributes, getting these columns is time-consuming, and instantiation of ActiveRecord matters. We can want to pick up only the specific columns without instantiating ActiveRecord for every record when we only print the result (e.g., generate CSV).

pluck enables us to issue a SQL fetching only specified columns from the underlying table. We can avoid SELECT * like query in short. We also need to use join instead of include to specify the column in the external tables (i.e., cities). The following code only issues 1 query to join every necessary table with the selected columns. It’s a pretty bare minimum query to accomplish what we want.

columns = [
  'offices.name',
  'cities.name'
]
Office.joins(:city).pluck(*columns).each do |office|
  puts office.city.name
end

References