Canadian National

Big Data Solutions / Enterprise Architect

The Canadian National Railway (reporting mark CN) is a Canadian Class I freight railway headquartered in Montreal, Quebec, that serves Canada and the Midwestern and Southern United States.

The CN has a project to create a data hub and centralize data for many business applications. A draft architecture of this data hub has already been prepared by an outsourced company.

In this context I was asked to complete the draft architecture and define the security access level to this data hub.

The platform is based on the Cloudera distribution with Nifi, Kafka, Hadoop, Spark, Spark Streaming, Hive, MongoDB and PostgreDB.

I was in charge to:

  • Design the target architecture of the data hub platform.
  • Define and design of the security strategy, users access control level (based on roles and tags: Apache Atlas, Ranger, Knox and Kerberos)
  • Define the strategy and the roadmap to secure the Kafka platform cluster.
  • Act as a Big Data technology advisor
  • Implement a POC on AWS to Validate Security Policy
Methodology

A Development platform was already in place. My job was to design the target  architecture for the Production environment.

In this context I defined a several milestones such the storage and the ingestion. These milestones are related to the business needs and priority.

The last milestone defines the final target architecture as well as the platform access security strategy.

Key Details

Role: Enterprise / Solutions Big Data Architect

Project Date: 2019

Project duration: 6 months

Location : Montréal – Canada

Technologies: MS Azure, Cloudera (Horonworks), Spark (Scala), Kafka, Nifi, PostgreSQL, Atlas,  Ranger, Hadoop, Hive

Main Steps

Defining access and security strategy

Definition of the data access strategy. Use of Apache Atlas and Apache Ranger to define access (Role Based Access Control) and (Tag Based Access Control)

Implementation of a POC on Amazon Web Services to validate the concepts and security strategy.

Preparing the environment on MS Azure
  • Definition of the target Production architecture of the data platform.
  • Data flow ingestion using Nifi and Kafka
  • Data Analysis using Apache Spark
  • Structured Data storage using PostgreSQL
POC Infrastructure Preparation

Review of the data hub technical environment (based on Cloudera CDH distribution) and validating related Business Needs.