Data Automation – Why it is important? Different Processes & Tools
Why is data automation important?
If you see the above stats, every 60 secs, 1.8TB of data is created!
All this data goes through different stages of data automation: Data collection(includes tracking setup), data storing, data analysis, data transformation, and data visualization. When company/professionals speak about big data, personalization, algorithms, machine learning, Artificial intelligence, etc. All this relies on this data. In the past, most of the decision and logic created were based on a manual calculation of data combined with human instinct. So when this data is out of control in terms of scale, It is not going to be easy managing it manually, this is where the process of automation comes into the picture. In recent years, through the advancement in data processing, most of the stages are automated thus allows us to scale and make more insights easily. This data automation process is achieved through the custom logic of information gathering and different automation systems.
“The world’s most valuable resource is no longer oil, but data” – Economist*
Reference Architecture of Data Systems
In big enterprise companies, a data team is split into few more division to take care of different stages of data. This below reference architecture can help you understand the stages in data from collection to visualization. For small and medium level companies, most of the stages here are done behind by readymade tools.
You may not need to understand each piece in this architecture, what you need to understand is the different stages in data systems and in each stage, you can automate according to your business needs. We will go through all of these stages with a simple example. At the end of the article, we will cover a few reference architecture.
Simple use case of different stages in Data Systems / Automation
STORY: Kamal runs a big wholesale rice and wheat business located in Chennai(rice) And Bangalore(wheat). In the below image which has his data (chennai office has all Rice sales. Bangalore got all wheat sales)
Database systems are made up of different tables. In this examle,
1. First division is one table (chennai)
2. Second division is one table ( bangalore)
3. Third division is one table which is the final cleaned up table for visualization or analysis.
1. Data Source:
The Chennai data is collected from chennai office, It is data source 1. The Bangalore data is collected from Bangalore office, It is data source 2.
2. Data Extract and Loading:
For offline business, Data extraction is usually a process of data imported into a computer system for data processing. so here getting this data from this notebook into a digital system is the process of extraction.
According to Kamal, it needs to be sent to him via putting in a box of Courier(Local server) or send via email or online transformation (Cloud Servers).
For an online business, Extracting the data from initial data source and load into a Data hub platform where the data is going to processed and analyzed.
3. Data Processing:
In this tables, the companies are PRIMARY KEY which used to merge two tables (In backend of each company data, it is called unique ID)
– This process is called Data merging or Data combining
In the second table, if you see the circled one ‘Babu’ name is written wrongly as ‘Baabu’. So we need to clean up this data and, So the table can be merged with same names. Otherwise, there will be another duplicate name called ‘baabu’.
– This process is called Data cleaning
So these process of extraction, Transform and load in any mode is called ETL Operation
4. Data Analysis:
A) Manual Learning : Using the two tables, Kamal found that JK is buying only wheat, So next time he will RECOMMEND him to buy MORE WHEAT. But XYX buys more rice and less wheat, so he will RECOMMEND some discounts or provide add-on service to SELL more WHEAT. Thus improving his profits and grow his business 🙂
5) Automated Machine Learning:
So how does MACHINE LEARNING is used here? When the business expands to many cities, he just passes this formula(ALGORITHM) to his other employees to find which product to recommend his buyers.
Where does the Data Automation comes into picture?
In all this process, there is manual work done from collecting the data source, extraction of data and storing into digital system, Data processing where merged and cleaned, further data analysis is done to find insights.
In a typical Online Retail shop, all of this has been automated.
1. Data Source: Using tracking tools (like Google Tag manager, Telium Adobe DTM) where you collect the data only once and send into N number of systems instead of manually doing for one by one source. Thus saves time. And this tag management system collect unique User_ID from one system and pass it to N number of systems. Thus makes our job to easily merge.
2. Data Extraction and Loading: Instead of we go into every system one by one every time to extract, we automate all this data extraction through one pipeline. Find here different ETL pipelines
3. Data Processing: So once the pipeline is built to automate this, the datas are automatically combined as we have passed unique ID to every systems, there will be some manual work initially needed for data cleansing. (In initial stage, providing some inputs will allow the system to automatically update the data in long term) (For example: Writing a logic like “Babu = Baabu; Babu =Babuu”)
4. Data Analyze: Instead of manually finding insights every time, allowing algorithms built on the machines to learn by itself. So the machine finds the pattern in the data and tell us the insights.
5. Data Visualization: Once the data is automatically passed through the pipeline to visualization tools like Tableau or sending the data into a API where the end receiver can visualize.