Extractors

Created: May 14, 2021, Updated: November 28, 2022

As stated in the chapter Project Design, extractors are components, that download data from your data source. In this chapter, we will update our Bizzflow project so that it know how to extract data from Relational Dataset Repository maintained by Faculty of Information Technology at Czech Technical University in Prague. The dataset we are going to use is ClassicModels.

Right now, this is what our Airflow UI looks like:

Airflow UI
Airflow UI

Our goal in this chapter is to add an extractor configuration, so that our Airflow UI looks like this:

Airflow UI with extractor DAG
Airflow UI with extractor DAG

Putting the information together

In the previous chapter Project Design, we said that we need at least three pieces of information about our data source in order to make Bizzflow extract data from it - data source type credentials and the source (source tables). Let’s take a look at the way we can find this out.

Data source type

First, we need to know, what is the database backend. From the description of the data source, we see that the database lives within MySQL (MariaDB) server. Let’s look up MySQL within Bizzflow extractors repository.

Search for MySQL
Search for MySQL

There is only a single result - MySQL extractor.

Search result
Search result

Open the repository and take note of the repository name. You can find it at either at the end of the URL or next to the branch name (as seen in the picture below).

Component name
Component name

For the MySQL extractor it is ex-mysql. We will need it to tell Bizzflow which component to use when connecting to our data source.

Credentials

Luckily, we do not have to go to Hell and back for this as the maintainers of the Relational Dataset Repository put the credentials in the bottom of the dataset page.

Dataset credentials
Dataset credentials

Source tables

For the purpose of this demo, let’s download every single table, these are namely orderdetails, orders, payments, products, customers, productlines, employees and offices.

Creating the configuration

We have all the information we need, so let’s put it all together.

Create a new empty file within extractors/ directory in the Bizzflow project repository and name it classicmodels.json. The structure of extractor configurations JSON files is fairly easy:


extractors/classicmodels.json

{
  "type": "{component id}",
  "config": {
    /* Component-specific extractor configuration */
  }
}

We already know component id from MySQL Extractor component repository - ex-mysql, so we may as well fill it in:


extractors/classicmodels.json

{
  "type": "ex-mysql",
  "config": {}
}

This way, Bizzflow will know to use MySQL Extractor when processing this data source.

For the component-specific config we need to take a look at the component repository again and scroll down to the example configuration:

MySQL extractor component configuration example
MySQL extractor component configuration example

This is the object we will need to “nest” underneath the config key in our extractor configuration:

{
  "$schema": "https://gitlab.com/bizzflow-extractors/ex-mysql/-/raw/master/config.schema.json",
  "user": "patrik",
  "password": "12345",
  "host": "mysql-db.cz",
  "database": "customers",
  "query": {
    "table_name": "SELECT * FROM table_name"
  }
}

All we need to do is fill in our real credentials we got in one of the previous steps, the configuration will look like this:

{
  "$schema": "https://gitlab.com/bizzflow-extractors/ex-mysql/-/raw/master/config.schema.json",
  "user": "guest",
  "password": "relational",
  "host": "relational.fit.cvut.cz",
  "database": "classicmodels",
  "query": {
    "orderdetails": "SELECT * FROM `orderdetails`",
    "orders": "SELECT * FROM `orders`",
    "payments": "SELECT * FROM `payments`",
    "products": "SELECT * FROM `products`",
    "customers": "SELECT * FROM `customers`",
    "productlines": "SELECT * FROM `productlines`",
    "employees": "SELECT * FROM `employees`",
    "offices": "SELECT * FROM `offices`"
  }
}

But wait. I don’t think it is a good idea to store password (even though a public one) in the file I am about to commit to a repository (even though it may not be public). The thing is:

The question remains: how do we tell Bizzflow how to connect to the data source if we may not tell it the password?

Luckily, Bizzflow comes armed with a security measure just for this use-case. Open your Flow UI and select Vault.

Click ➕ Create new..., fill in ID and Value so that the id is something easily distinctible and relatable to extractor configuration and value is our password "relational" and save it.

Flow UI vault with credentials filled
Flow UI Vault Credentials

The way we tell Bizzflow to look into the connection instead of just plainly using the password is by using a special hash string #!#:{connection id}, in our case this will be #!#:classicmodels.

If you now paste your configuration into our config key within the JSON file, it should look like this:


extractors/classicmodels.json

{
  "type": "ex-mysql",
  "config": {
    "$schema": "https://gitlab.com/bizzflow-extractors/ex-mysql/-/raw/master/config.schema.json",
    "user": "guest",
    "password": "#!#:classicmodels",
    "host": "relational.fit.cvut.cz",
    "database": "classicmodels",
    "query": {
      "orderdetails": "SELECT * FROM `orderdetails`",
      "orders": "SELECT * FROM `orders`",
      "payments": "SELECT * FROM `payments`",
      "products": "SELECT * FROM `products`",
      "customers": "SELECT * FROM `customers`",
      "productlines": "SELECT * FROM `productlines`",
      "employees": "SELECT * FROM `employees`",
      "offices": "SELECT * FROM `offices`"
    }
  }
}

If you commit the changes and merge them to master branch of your Bizzflow project repository, you should be ready to go along to your Airflow UI.

Running the task

Wait. It still looks like this, doesn’t it?

Airflow UI
Airflow UI

We wanted to add the extractor configuration, how is it possible, that the DAG is not there yet?

That’s right, we need to run 90_update_project DAG first.

Click on the play button (Trigger DAG) on the line of the 90_update_project DAG and confirm running it with Trigger button.

You should see a bright green circle on the 90_update_project line:

Airflow - Running task
Airflow - Running task

Try refreshing the page after a while and you should now see the extractor DAG like this:

Airflow with extractor DAG
Airflow with extractor DAG

Again, the DAG will probably appear being Off, switch it to On right now.

Nothing should prevent us from running the component now. Click the play button (Trigger DAG) for our new 20_Extractor_ex-mysql_classicmodels DAG and confirm again by clicking Trigger. The extractor should now be running. You can either check it via Consoles -> Latest tasks or on the main Airflow UI screen by the light green circle. Once the circle disappears (or the task status in Latest tasks turns to success), everything should be done.

Airflow running task
Airflow running task
Airflow comploted task
Airflow comploted task

See the results

Go to Flow UI’s Storage page and click Refresh Kexes. Kex in_ex_mysql_classicmodels should appear. When you click the kex name, you should see the tables that the kex contains.

Flow UI storage with extracted tables
Flow UI storage with extracted tables

Well done

We have our extractor configuration, let’s take a deep breath and transform the 💩 out of our data!