Environment

This chapter should sum up all the neccessary things you should have set up right now. This guide does not cover installing and setting things up. If you need to install everything first, please refer to our Getting started checklist.

Database access tool

You should have some kind of database tool installed. The most commonly used are DBeaver and DataGrip. This guide will show how to set up your sandbox connection using DBeaver, but everything should work as well with DataGrip and all other tools that support Google BigQuery.

Bizzflow project repository

During installation, a Bizzflow project repository was created for you, or maybe you are using your own. Either way, the repository should look the same.

Bizzflow project repository
Bizzflow project repository

Above is what the repository will look like in Gitlab. If you are using Github, Bitbucket or any other git host and your structure looks the same, you are good to go!

Apache Airflow UI access

Apache Airflow is the heart of Bizzflow. It manages scheduling of tasks and provides us with a nifty UI we will use to control what happens in our project.

You should be able to access your Airflow web interface. If you had someone else install Bizzflow for you, they should be able to let you know how to access it. It will look something like this:

Airflow
Airflow

Main layout

The main layout consists of a navigation bar at the top of the page with various links we will go through in some of the next steps in this guide. The part we are interested in right now is the list of DAGs.

Right after installation, you should see two DAGs - 90_update_project and 90_update_toolkit. The number prefix serves to sort the DAGs in a way it makes the most sense and you can as well ignore it.

On / Off button [1]

This button serves to either enable or disable a DAG.

DAG [2]

This is the DAG’s name to help you better navigate between them (soon there will be a lot more than just those two).

Schedule [3]

Once you decide to let Airflow run your tasks periodically based on a schedule, you will see the settings here.

You will find useful links here, such as list of latest DAG runs and so long.

What we are interested in right now is the first one, the little play button . This makes it possible to run your DAG at any time. But more on that later.

Consoles [5]

This is a Bizzflow-exclusive navigation bar extension with useful links.

Airflow consoles
Airflow consoles
  • Cooltivator
    • A useful application to help you clean your data. See here
  • Latest Tasks
    • A link to the list of tasks sorted by their execution date. You will need this a lot when working with Airflow.
  • Flow UI
    • A UI created to make your life easier. Read more here.

What the heck is a kex

You may be asking yourself, what the heck is a Kex?

As you may have seen in the Bizzflow’s Key Concept in Bizzflow wiki, your storage (Data warehouse) is split horizontally to stages (raw, input, transform, output and datamart). For better clarity Bizzflow also uses vertical splitting of the stages, meaning there may be more related units of data in a single stage. E.g. raw stage serves for storing raw data from your data sources, but since there may be more data sources, it only makes sense to store them separated. This is what Kexes are all about. You have raw data, but you have accounts and transactions from your database and a balance sheet from a ERP. This would result in three tables across two Kexes:

  • raw_database.accounts
  • raw_database.transactions
  • raw_erp.balance_sheet

If you still have no clue what we are talking about, please check Bizzflow’s wiki for ETL Process Structure as this chapter covers everything you may possibly need to know about Kexes.

Kexes are nothing but database schemas in the background, but since the terminology does not fully address the relation of the underlying data, we decided to call them Kexes.

Good to go

If you checked you’ve got an easy access to all the things listed above, you are good to go!