Extractors
Created: May 20, 2021, Updated: June 20, 2022
Extractors are components within your data pipeline that copy the data from your data source to Bizzflow's data storage (analytical warehouse).
Extractors' configuration is maintained in /extractors
directory within the project repository’s root.
Each extractor’s task is described in a separate JSON
(or YAML
) file, that needs to follow structure described
below.
For list of available extractor components, see Data Sources.
Extractor Configuration
Every extractor configuration needs a separate file. Length of filenames is limited, for more please see
Naming.
Configuration inside the file needs two keys: type
and config
.
type
tells Bizzflow which component to use and config
is then passed to the component when running.
/extractors/example.json
{
"type": "ex-mysql",
"config": {
// mysql component-specific configuration
}
}
Or the same example in YAML
:
/extractors/example.yaml
type: ex-mysql
config:
# mysql component-specific configuration
How to find out the component-specific configuration
In every single one of the officially supported extractors
there will be a description within component’s repository README.md
file and a configuration sample. In most
cases we also include JSON
schema in a separate file.
Storing credentials and sensitive data
You should never store credentials or sensitive data in your git repository. Bizzflow comes prepared for this.
Anytime you would need to input a password to the configuration file, you can instead refer to an encrypted
Airflow Connection
data using #!#:connection_id
whereas the #!#:
tells Bizzflow not to interpret
the following string literally but instead search for a connection with id connection_id
in your Airflow
Connections and use its password. See the Basic Tutorial for more.
Custom Extractor Component Configuration
If you want to use your own instead of Bizzflow’s public components see Component configuration.
Example: Setting up MySQL
extractor
Let’s say, for the sake of our example, that our database maindata
is running on a server named supermysqldb.com
.
Following is an example of how to tell Bizzflow to extract tables users
and invoices
from the database.
We created Airflow Connection with id maindata
containing our password to the database.
The MySQL extractor’s repository contains an example,
so that specification of the extractor in a JSON
file should be fairly simple.
/extractors/superdb.json
{
"type": "ex-mysql",
"config": {
"user": "mario",
"password": "#!#:maindata",
"host": "supermysqldb.com",
"database": "maindata",
"query": {
"users": "SELECT * FROM `users`",
"invoices": "SELECT * FROM `invoices`"
}
}
}
Or using YAML
:
/extractors/superdb.yaml
type: ex-mysql
config:
user: mario
password: "#!#:maindata"
host: supermysqldb.com
database: maindata
query:
users: SELECT * FROM `users`
invoices: SELECT * FROM `invoices`
*
in SELECT
statements. If the structure of the tables change
suddenly (missing a column, extra column added), you want either the pipeline to fail as soon as possible
(before extracting data) or not fail at all (skipping the new column because it is not named in the SELECT
statement). We only use *
in examples so that they are more readable.