• LOGIN
  • No products in the cart.

A Beginners Guide to Pentaho DI

Pentaho Data Integration is a flexible tool that permits you to accumulate data from disparate sources such as databases, files, and applications, and turn the data into a unified layout that is accessible and applicable to end users. Pentaho Data Integration provides the Extraction, Transformation, and Loading (ETL) engine that enables the method of capturing the proper data, cleansing the data, and storing the information using a uniform and consistent format.

Pentaho Data Integration provides support for slowly altering dimensions, and surrogate key for data warehousing, permits data migration between databases and application, is flexible enough to load giant datasets, and can take full advantage of cloud, clustered, and massively parallel processing environments.

Pentaho Data Integration Include:

• Data migration between unique databases and applications

• Loading huge data sets into databases taking full advantage of cloud, clustered and massively parallel processing environments

• Data Cleansing with steps ranging from very simple to very complicated transformations

• Data Integration which include the potential to leverage real-time ETL as a data source for Pentaho Reporting

• Data warehouse population with built-in support for slowly changing dimensions and surrogate key creation.

AGENDA

COMPONENTS OF PENTAHO

FEATURES

PENTAHO SERVERS & STACKS

ARCHITECTURE

PENTAHO VISUALIZATION

REPOSITORY

ADVANTAGES &DISADVANTAGES

PENTAHO DATA INTEGRATION COMPONENTS

Spoon: Introduced earlier, Spoon is a desktop application that makes use of a graphical interface and editor for transformations and jobs. Spoon provides a way for you to create complex ETL jobs besides having to read or write code. When you think of Pentaho Data Integration as a product, Spoon is what comes to mind because, as a database developer, this is the application on which you will spend most of your time. Any time you author, edit, run or debug a transformation or job, you will be using Spoon.

Pentaho DI course

Pan: A standalone command line method that can be used to execute transformations and jobs you created in Spoon. The data transformation engine Pan reads data from and writes data to various data sources. Pan also allows you to manipulate data.

Kitchen: A standalone command line process that can be used to execute jobs. The program that executes the jobs designed in the Spoon graphical interface, both in XML or in a database repository. Jobs are generally scheduled to run in batch mode at regular intervals.

Carte: Carte is a light-weight Web container that lets in you to set up a dedicated, remote ETL server. This provides similar remote execution capabilities as the Data Integration Server but does not provide scheduling, security integration, and a content management system.

FEATURES OF PENTAHO

  • Report Designer − Report designer is an advanced report creation tool. It helps to create a whole data-driven document for the user. It offers highly scalable and flexible functionality than the Ad hoc report. It is used to generate detailed best pixels reports using virtually any data source
  • Metadata Editor − Allows to add uncomplicated metadata domain to a data source.
  • Design Studio − Used for fine-tuning of reports and ad-hoc reporting.
  • Pentaho user console web interface − Used for effortlessly managing reports and analyzing views.
  • Ad-Hoc reporting interface − Offers a step-by-step wizard for designing simple reports. Output formats consist of PDF, RTF, HTML, and XLS.
  • A complex scheduling sub-system − Allows users to execute reviews at given intervals.
  • Mailing − Users can e mail a posted report to different users.
  • Connectivity − Connectivity between the reporting tools and the BI server, which allows to submit the content directly to the BI server.

PENTAHO SERVERS AND STACKS

There are different versions of Pentaho server, like open source, professional standard, professional premium and enterprise. There are three layers: the presentation layer, which has reporting, analysis, dashboards and process management. Then comes the Business Intelligence platform, which has security, administration, business logic, and repository beneath it. Data and Application Integration has ETL, Metadata and EII under it. This can be built on a third-party application like CRM, legacy data, OLAP, other applications and local data.

Pentaho has its presence in all three layers with the respective products- Data layer, server layer and client layer. A server layer has lately regained from BI (Business Intelligence) to BA (Business Analytics). It is now recognized as Pentaho Business Analytics. It can be extended through commercials as well as open-source plug-ins; hence, the data can be published on the server. The user can also run any kind of report on it. The dashboard can additionally be displayed and designed. The Pentaho Analyzer is for the Ad-hoc reporting. It runs through default on Apache Tomcat but can be embedded in any Java-based application server. Pentaho analyzer is meant for reporting. Scheduling and monitoring are meant for the motive of scheduling reports, monitoring them, and sending them to business users. It comes in two flavors namely Community Edition (CE) and Enterprise Edition (EE).

PENTAHO DATA INTEGRATION ARCHITECTURE

The Data Integration Server is a dedicated ETL server whose main features are:

1 Execution: Executes ETL jobs and transformations the usage of the Pentaho Data Integration engine

2 Security: Allows you to manage users and roles (default security) or integrate security to your current security issues such as LDAP or Active Directory

3 Content Management: Provides a centralized repository that permits you to manage your ETL jobs and transformations. This includes full revision records on content and elements such as sharing and locking for collaborative development environments.

4 Scheduling: Provides the services permitting you to schedule and monitor things to do on the Data Integration Server from within the Spoon design environment.

The Enterprise Console provides a thin client for managing deployments of Pentaho Data Integration Enterprise Edition which includes management of Enterprise Edition licenses, monitoring and controlling activity on a remote Pentaho Data Integration server, and analyzing performance trends of registered jobs and transformations.

PENTAHO VISUALIZATION

The Visualization API offers a unified way to visualize data throughout the Pentaho suite, which includes PDI, Analyzer, and CDF. It allows the safe and isolated operation between third party applications, business logic, and visualizations.

The Visualization API is built on top of the following Javascript APIs:

Data API: It provides integration with data sources in the Pentaho platform also with client-side component frameworks

Type API: It offers elements like validation, metadata support, inheritance, and serialization.

Core API: It consists of core features such as theming and services, registration, consumption, and localization.

This tool is used to create, deploy, and configure a visualization.

PENTAHO REPOSITORY

If your team wants a collaborative ETL (Extract, Transform, and Load) environment, we suggest using a Pentaho Repository. In addition to storing and managing your jobs and transformations, the Pentaho Repository provides a full revision history for you to track changes, compare revisions, and revert to previous versions when necessary. These features, along with organization security and content locking, make the Pentaho Repository an ideal platform for collaboration. Connecting to a Pentaho Repository Once a repository is created, a menu appears next to the Connect link. You can use this menu to connect to the repository.

Pentaho DI Training

If you are in the technique of developing your first repository, selecting Connect Now will automatically take you to Step 2.

1 Select a repository in the Connect menu.

2 Log on to the repository through entering your User Name and Password credentials.

For example, User Name = admin, Password = password.

3 Click OK to exit the Repository Configuration dialog box. Your users identify and the repository show name will appear in the top right corner of the PDI client toolbar.

ADVANTAGES OF USING PENTAHO

  • Pentaho BI is a very intuitive tool. With some primary concepts, you can work with it.
  • Simple and easy to use Business Intelligence tool
  • Offers a broad range of BI capabilities which consists of reporting, dashboard, interactive analysis, data integration, data mining, etc.
  • Comes with a user-friendly interface and offers various tools to Retrieve data from more than one data sources
  • Offers single package to work on Data
  • Has a community edition with a lot of contributors along with Enterprise edition.
  • The capability of running on the Hadoop cluster
  • JavaScript code written in the step components can be reused in other components.

DISADVANTAGES OF USING PENTAHO

  • The design of the interface can be weak, and there is no unified interface for all components.
  • Much slower tool evolution compared to other BI tools.
  • Pentaho Business analytics gives a constrained number of components.
  • Poor community support. So, if you do not get a working component, you want to wait until the next version is released.
July 2, 2020
GoLogica Technologies Private Limited  © 2019. All rights reserved.