Training‎ > ‎

Data Cleansing and Reconciliation with Pentaho (October 2018)

Introduction

Enterprise spends around 85% of their time on data preparation (capture, cleansing and reconciliation) before producing a comprehensive management report or having a good dashboard.

Without an understanding and adoption of methodology and right tool, this process will become more difficult and time consuming along with the data growth.

Kettle or Pentaho Data Integration is an open source ETL tool that runs in multi-platform environment. It has more than 140 built-in modules to handle most of data sources available today.

With its rich features and speedy development by open source community, Kettle is now a favorite tool among developers to build enterprise data warehouse. It is already proven and used by hundreds companies in Indonesia.

PHI-Integration offers a 3-day course on how to manage and deliver good data with Kettle. In this course, you will learn the basic concept Pentaho from basic to advanced topic, plus bonus of our plugins for free for 10 users.

II. Who Should Attend?

This course is designed for business user or developer who need to have manage data alongside with Excel.

III. Objectives

At the completion of this course, attendee should be able to do the following.
  • Use Kettle to build simple jobs and transformations. 
  • Use PHI Tool to do data cleansing.
  • Consume data from several data sources (Excel, Word, PDF, Facebook, Twitter, web site). 
  • Be able to automate ETL process. 

IV. Course Duration

3 days (24 hours)

V. Bonus

  • PHI-Integration Data Tools package (licensed for 10 users). 

VI. Course Fee

Call

VII. Course Prerequisites

  • Have a basic understanding of Microsoft Excel. 
  • Have a basic understanding of Structured Query Language (SQL). 
  • Participants do not need to have prior knowledge of Pentaho. 

VIII. Course Requirements

  • PC or Laptop with the following minimum requirement: 
    • 2GHz CPU (4 cores) 
    • 8 GB of RAM
    • DVD Drive or USB Port 
    • 40 GB of available hard disk space. 

Softwares

  • Microsoft Windows 10. 
  • Java Runtime Environment (JRE) 1.8. 
  • Apache Tomcat. 
  • MySQL 5.0 Database Server. 
  • Pentaho Data Integration 5. 

IX. Course Outline

  1. Pentaho Data Integration Introduction 
    • What is Pentaho Data Integration (PDI) or Kettle? 
    • Benefit that PDI give for organization. 
    • PDI architecture and requirements. 
    • Installation and setup. 
  2. Spoon – Graphical User Interface for Kettle 
    • Setting File and RDBMS Repository. 
    • Design and execute "Hello World" Job!
    • Email automation using job. 
    • Transformation: Reading and Writing Excel Files. 
    • How to pivot data using Pentaho. 
    • More on job and transformation steps. 
  3. Formula and Calculation.
    • Introduction to Calculator Step.
    • How to convert various Excel formula into Pentaho formula.
    • Grouping with sum, average, mean, mod, etc. 
  4. Text / String Handling 
    • How to tidying text with various format using string operations. 
    • Advanced text handling using Replace in Strings. 
    • Split teks into fields (telephone area code and number). 
    • Split teks into rows (product names list). 
  5. Short Introduction to MariaDB / MySQL 
    • MySQL or MariaDB as most popular database? 
    • Preparing first database. 
    • Insert and query data using SQL. 
    • Importing PHI-Minimart database. 
  6. Advanced Topics 
    • Reconcile two Excel files. 
    • Using Pentaho Marketplace. 
    • Data cleansing – standardize (clean phone number, zip code, province name, university name, etc). 
    • Data cleansing – find duplicate based on Product Data (product name, brand, sku, etc) and Business Entity Name (vendor name, address, etc). 
    • Data mining using Pentaho. 
  7. Reading Other Sources (Office and Social Media) 
    • Reading Microsoft Word and convert into structured data. 
    • Reading PDF File and convert into structured data. 
    • Crawl website data. 
    • Reading Facebook data. 
    • Reading Twitter data. 
  8. Automation 
    • Using Windows Task Scheduler to schedule ETL running job and transformation. 
    • Email automation. 
    • Web crawling automation. 
    • Monitoring the automation. 
  9. Bonus: Displaying Data using Open Source Dashboard 
    • Introduction to plot.ly dash.
    • Build your first dashboard. 
    • Adding filter. 
    • Export the plots / charts into powerpoint or web site. 

To do registration for this course, please email to info@phi-integration.com.
Comments