Course

Data Science

Course Fee

Rs 24,000/-

Basics of Business Analytics

Business analytics is the practice of iterative, methodical exploration of an organization's data, with an emphasis on statistical analysis. Business analytics is used by companies committed to data-driven decision-making. It is about using your data to derive information, insights, knowledge, and recommendations. Businesses use business analytics to improve effectiveness and efficiency of their solutions.

In this module, I will talk about how analytics has progressed from simple descriptive analytics to being predictive and prescriptive. I will also talk about multiple examples to understand these better, and discuss various industry use cases. I will also introduce multiple components of big data analysis including data mining, machine learning, web mining, natural language processing, social network analysis, and visualization in this module. Lastly, I will provide some tips for learners of data science to succeed in learning and applying data science successfully for their projects.

  • Descriptive analytics, predictive analytics, prescriptive analysis
  • Brief Introduction about Components of Big Data Analysis
  • Introduction to Hadoop and Big Data Infrastructure
  • Introduction to Data Mining
  • Introduction to Machine Learning
  • Introduction to Nature Language Processing
  • Introduction to Information Retrieval
  • Introduction to Web Mining
  • Introduction to Social Network Analytics
  • Introduction to IOT
  • Introduction to Visualization
  • Application on Big Data Analytics
  • Challenges in Applying Analytics to Business Problems
  • Tips on Career in Data Science

Python for Data Science

Python and R are the two most popular programming languages for data scientists as of now. Python is an interpreted high-level programming language for general-purpose programming. Created by Guido van Rossum and first released in 1991, Python has a design philosophy that emphasizes code readability, notably using significant whitespace. Python is open source, has awesome community support, is easy to learn, good for quick scripting as well as coding for actual deployments, good for web coding too.

In this module, I will start with basics of the Python language. We will do both theory as well as hands-on exercises intermixed. I will use Jupyter notebooks while doing hands-on. I will also discuss in detail topics like control flow, input output, data structures, functions, regular expressions and object orientation in Python. Closer to data science, I will discuss about popular Python libraries like NumPy, Pandas, SciPy, Matplotlib, Scikit-Learn and NLTK.

  • Why Python
  • Python Installation
  • Python 2.7 Vs 3.x
  • Introduction to Essential Python Libraries
  • Introduction to iPython and Jupyter Notebooks
  • Python Language Basics- Indentation, Comments, Function Calls, Variables and Argument Passing
  • Python Language Basics-Types, Duck-Typing, Import
  • Python Language Basics-Binary operators, Comparisons, Mutable
  • Python Language Basics-Standard Data types in Python
  • Python Language Basics-Command Line Arguments
  • Loops: for, while
  • Conditional Execution
  • Input, output, Eval, Print
  • repr, str, zfill
  • File IO
  • JSON I/O with Python Dictionary
  • JSON I/O with Generic objects
  • JSON I/O Serialization and Deserialization
  • JSON I/O File
  • Introduction to Pickle
  • cPickle
  • Pickle and Multi-Processing
  • Tuples
  • List
  • Sorting, Searching, Slicing
  • Built-In Functions-Enumerate, Sort, Zip, Reversed
  • Dictionary
  • Sets
  • Lists, Sets and Dict Comprehensions
  • Introduction to Functions and Variable Length Argument
  • Namespace, Scope, Local Funtions, Local vs Global Variables
  • Returning multiple vales, Pass by Reference
  • Functions are objects
  • Recursive functions, Anonymous(Lambda) Functions
  • Currying, Generators
  • Itertools Module
  • Errors and Exception Handling
  • Python Modules and Packages
  • object oriented Nature of Python
  • Class Inheritance, overriding, overloading, Data Hiding
  • Searching for patterns, matching groups
  • Regular expression flags
  • split, findall, finditer
  • Repetition syntax
  • Character sets, Exclusion, Character Ranges, Escape Codes
  • Substitution
  • Greedy vs non-greedy matching
  • Backreferences and anchors
  • Capturing parts of pattern match
  • split and zero-width assertions
  • Look-arounds
  • Introduction to Numpy and ndarrays
  • Datatypes of ndarrays
  • Arithmetic operations, Indexing, Slicing
  • Boolean and fancy indexing
  • Basic ndarray operations
  • Array-oriented programming with arrays
  • Conditional, Statistical and Boolean operation
  • Sorting and set operation
  • File IO with NumPy
  • Linear Algebra for Numpy
  • Reshaping, Concatenating and Splitting Arrays
  • Broadcasting
  • Series Data Structures
  • DataFrame
  • Index objects
  • Reindexing
  • Dropping entries from an axis
  • Indexing, Selection and Filtering
  • Arithmetic and Data Alignment
  • Operations between DataFrame and Series
  • Function Application and Mapping
  • Sorting and Ranking
  • Axis indexes with duplicate labels
  • Computing Descriptive Statistics
  • pct_change(), Correlation and Covariance, Unique values, Value counts and membership
  • Introduction to Matpotlib
  • Colours, Markers and line styles
  • Customization of Matplotlib
  • Plotting with Pandas
  • Barplots, Histograms plots, Density Plots
  • Introduction to Seaborn, Style Management
  • Controlling figure aesthetics
  • Colour Palettes
  • Plotting univariate Distribution
  • Plotting bivariate Distribution
  • Visualizing pairwise relationship in pairplots
  • Plotting with Categorical Data
  • Visualizing Linear Relationships
  • Plotting on Data-aware grids
  • Other Python Visualization tools
  • Linear Algebra in SciPy
  • Sparse Matrices in SciPy
  • Constants, Cluster and FFT Packages
  • Integration using SciPy
  • Interpolation in SciPy
  • SciPy I/O, SciPy ndimage
  • Optimization and root finding
  • SciPy.Stats
  • Introduction to SciKit Learn and Machine Learning
  • Sample Dataset in SciKit Learn
  • Train Test using SciKit Learn
  • Classification IRIS using Decision Trees
  • Holdout Validation, K-fold cross Validation
  • Cross Validation using SciKit Learn
  • K-means Clustering in SciKit Learn
    • Introduction to Nature Language Processing tool kit
    • Tokenization, Lower casing and removing stop words, Lemmatization, Stemming
    • ngrams, Sentence tokenization, Part of speech tagging
    • Chunking, Named Entity Recognition
    • Introduction to WordNet, and word sense disambiguation
    • Word ladders game
    • Data Analysis and Prediction using the Loan Prediction Dataset

    R for data science

    While Python has been used by many programmers even before they were introduced to data science, R has its main focus on statistics, data analysis, and graphical models. R is meant mainly for data science. Just like Python, R has also has very good community support. Python is good for beginners, R is good for experienced data scientists. R provides the most comprehensive statistical analysis packages.

    In this module, I will again talk about both theory as well as hands-on about various aspects of R. I will use the R Studio for hands-on. I will discuss basic programming aspects of R as well as visualization using R. Then, I will talk about how to use R for exploratory data analysis, for data wrangling, and for building models on labeled data. Overall, I will cover whatever you need to do good data science using R.

    • R Vs Python
    • Basics of R
    • Data Exploration in R
    • Customizations for ggplot in R
    • Common Problems, Facets, Geoms
    • Statistical Transformation
    • Position Adjustments
    • Coordinate Systems
    • Introduction to R Studio
    • RStudio Editor
    • Keyboard shortcuts
    • RStudio Diagnostics
    • Introduction to dplyr
    • dplyr-filter
    • dplyr-arrange, select
    • dplyr-mutate
    • dplyr-summarize
    • dplyr-Grouping and Ungrouping
    • Introduction to Exploratory Data Analysis
    • Variation
    • Covariation
    • Introduction to Data Wrangling and Tibbles
    • Tibbles Vs Data Frames
    • Introduction to Readr and Read csv
    • Parsing Vector
    • Parsing a file using Readr
    • Writing to files
    • Introduction to tidy data
    • Spreading and Gathering
    • Separating and Unite
    • Missing Values
    • Relational Data in Keys
    • Mutating joins in dplyr
    • Filtering joins and Set operations
    • Introduction to Strings and Combining Strings
    • Regular Expressions
    • Creating Factors using forcats
    • Visualization and reordering of categorical variables
    • Creating Date/Time objects
    • Date/Time Components
    • Time Spans
    • Details about Pipe operator
    • Tools in magrittr
    • Functions in R
    • Conditional execution and function arguments
    • Variable Arguments in R
    • Return values in R
    • Basics of vector in R
    • Basics of Atomic vectors
    • Coercion, Test functions and Recyling rules
    • Naming and subset
    • Lists
    • Augmented vectors
    • For loop and variations
    • Passing functions as an arguments
    • Map Functions
    • Dealing with failure
    • Advanced purrr
    • other patterns of for loop
    • Introduction to modeling
    • Building your first simple model in R
    • Visualizing models in R
    • Modeling with categorical variables
    • Modeling with mix of categorical variables
    • Data Analysis using R: Why Are Low-Quality Diamonds More Expensive?

    Probability and Statistics

    Probability and statistics helps in understanding whether data is meaningful, including inference, testing, and other methods for analyzing patterns in data and using them to predict, understand, and improve results.

    We live in an uncertain and complex world, yet we continually have to make decisions in the present with uncertain future outcomes. To study, or not to study? To invest, or not to invest? To marry, or not to marry? This is what is captured mathematically using the notion of probability. Statistics on the other hand, helps us analyze data sets, and correctly interpret results to make solid, evidence-based decisions.

    In this module, I will discuss some very fundamental terms/concepts related to probability and statistics that often come across any literature related to Machine Learning and AI. Key topics include quantifying uncertainty with probability, descriptive statistics, point and interval estimation of means, central limit theorem, and the basics of hypothesis testing.

    • Introduction to Probability
    • Events, Sample space, Simple Probability, Join Probability
    • Mutually Exclusive events collectively exhaustive events marginal probability
    • Addition Rule
    • Conditional Probability
    • Multiplication Rule
    • Bayes theorem
    • Counting rules caution advanced stuff
    • What are probability distributions
    • Poisson Probability Distribution
    • Normal Probability Distribution
    • Binomial Probability Distribution
    • Central Limit Theorem
    • CLT Example
    • CLT Using R-code
    • Confidence Intervals of Mean
    • Confidence Intervals of Mean Examples
    • Confidence interval of mean in details
    • Confidence interval for the mean with population deviation unknow
    • Confidence interval using Python
    • What do confidence intervals actually mean
    • Confidence intervals for pop mean with unknown pop std dev using Python
    • what is hypothesis testing? Null and alternative hypothesis
    • Hypothesis testing for pop mean type1 and type2 errors
    • 1-tailed hypothesis testing (known sigma)
    • 2-tailed hypothesis testing (known sigma)
    • Hypothesis testing (unknown sigma)
    • 2-sample tests
    • Independent 2-sample t-tests
    • Paired 2-sample t-tests
    • Chi-squared tests of independence
    • Descriptive Vs Inferential statistics
    • Central Tendency (mean, median, mode)
    • Measures of dispresion (Range, IQR, std dev, variance)
    • Five Number summary and skew
    • Graphic displays of basic statistical descriptions
    • Correlation Analysis

    Machine Learning

    Machine learning is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome. Machine Learning is a first-class ticket to the most exciting careers in data science. As data sources proliferate along with the computing power to process them, automated predictions have become much more accurate and dependable. Machine learning brings together computer science and statistics to harness that predictive power. It’s a must-have skill for all aspiring data analysts and data scientists, or anyone else who wants to wrestle all that raw data into refined trends and predictions.

    In this module, broadly I will talk about supervised as well as unsupervised learning. We will talk about multiple types of classifiers like Naïve Bayes, KNN, decision trees, SVMs, artificial neural networks, logistic regression, and ensemble learning. Further, we will also talk about linear regression analysis, sequence labeling using HMMs. As part of unsupervised learning, I will discuss clustering as well as dimensionality reduction. Finally, we will also discuss briefly about semi-supervised learning, mult-task learning, architecting ML solutions, and a few ML case studies.

    • Introduction to machine learning
    • Supervised, semisupervised, unsupervised machine learning
    • Types of data sets
    • Data() in R
    • Introduction to classification
    • Introduction to Decision tree
    • Hunt's algorithm for learning a decision tree
    • Details of tree induction
    • GINI index computation
    • ID3, Entropy and information gain
    • ID3 Example
    • C4.5
    • Pruning
    • Metrics for performance Evaluation
    • Iris Decision Tree Example
    • Introduction to KNN algorithm
    • Decision boundary KNN Vs Decision tree
    • What is the best K
    • KNN Problems
    • Feature selection using KNNs
    • Wilson Editing
    • KNN Imputation
    • Speeding up KNN using KMeans
    • Coding up KNN from scratch in Python
    • KNN using sklearn
    • Digits classification using KNN in Python
    • Examples of few text classification problems
    • Classification for text using bag of words
    • Naïve Bayes for text classification
    • Multinomial Naïve Bayes
    • Multinomial Naïve Bayes Example
    • Naïve Bayes for Hand-written digit recognition
    • Naïve Bayes for weather data
    • Numeric stability issue with Naïve bayes
    • Gaussian Naïve Bayes from scratch in Python
    • Naïve Bayes using sklearn
    • Multinomial Naïve Bayes
    • Linear Classifiers
    • Margin of SVM's
    • SVM optimization
    • SVM for Data which is not linear separable
    • Learning non-linear patterns
    • Kernel Trick
    • SVM Parameter Tuning
    • Handling class imbalance in SVM's
    • SVM's pros and cons and summary
    • Linear SVM using Python
    • SVM with RBF kernel with Python
    • Learning SVM with noise data in Python
    • Introduction to Ensemble learning
    • Why Ensemble learning
    • Independently constructed ensembles for classification: Majority voting
    • Independently constructed ensembles for classification: Bagging
    • Independently constructed ensembles for classification: Random forests
    • Independently constructed ensembles for classification: Error correcting output codes
    • Sequentially constructed ensembles for classification boosting
    • Sequentially constructed ensembles for classification boosting example
    • Sequentially constructed ensembles for classification stacking
    • Introduction to gradient boosted machines (GBM)
    • Relations between GBM gradient Descent
    • GBM regression with squared loss
    • Bagging in Python
    • Random forests in Python
    • Boosting in Python
    • Feature importance using ensemble classifiers
    • XGBoost in Python
    • Parameter tuning for GBM's
    • Voting classifier using skLearn
    • Motivation for Artificial Neural Network
    • Mimicing a single neuron, integration function, Activation Function
    • Perceptron Algorithm
    • Perceptron Algorithm Example
    • Decision Boundary for a single Neuron
    • Learning Non-Linear Patterns
    • Introduction to Deep Learning
    • What can we achieve using a single hidden layers
    • MLPs with Sigmoid activation Function
    • Layers are transformation into a new space
    • Playing at the Tensorflow playground
    • Cost function, Loss function, Error Surface
    • How to learn Weights
    • Stochastic Gradient descent, Minibatch SGD, Momentum
    • Choosing a learning Rate
    • Updaters
    • Back Propagation
    • Softmax and Binary/Multi-class cross entropy loss
    • Overfitting and Regularization
    • Practical Advice on using Neural Networks
    • Autonomous Vehicles
    • Automated Feature Learning using Neural Networks
    • Deep Learning Architectures and Libraries
    • Applications of Artificial Neural Networks
    • History of Artificial Neural Networks and Revival
    • Python Code: Basic Introduction to Tensorflow: Constants, Placeholders and Variables.
    • Python Code: Learning the first Tensorflow model: Linear Regression using Tensorflow.
    • Python Code: MLP for Hand-written digit recognition with no hidden layer with 10 output neurons
    • Python Code: MLP for Hand-written digit recognition with two hidden layers
    • Python Code: Fashion Multi-class classification using MLP in Keras
    • Introduction to Linear Regression
    • Understanding the real meaning of Linear Regression
    • 𝑹^𝟐: Coefficient of Determination
    • Multiple Linear Regression and Non-linear Regression
    • Assumptions for Linear Regression
    • Using Residual to Verify the Assumptions for Linear Regression
    • Deriving Linear Regression Formulas using Ordinary Least Squares Method
    • Multiple Linear Regression
    • Underfitting, Overfitting, Bias and Variance
    • Ridge Regularization
    • Lasso Regularization, Elastic Net Regularization
    • Metrics and Practical Considerations for Regression
    • Python code: Simple Linear Regression using sklearn
    • Python code: Example to code up regression using ordinary least squares method
    • Python code: Multiple Linear Regression using Gradient Descent based approach
    • Python code: Multiple Linear Regression using sklearn
    • Python code: Ridge and Lasso Regression
    • Logistic regression vs Linear Regression
    • Can we use Regression Mechanism for Classification?
    • Logistic Regression – Deriving the Formula
    • Logistic Regression for Multi-class Classification
    • Logistic Regression Decision Boundary
    • Python Code: Logistic regression on the titanic dataset- Part 1
    • Python Code: Logistic regression on the titanic dataset- Part 2
    • Python Code: Logistic regression on the titanic dataset- Part 3
    • Python Code: Logistic regression on the titanic dataset- Part 4
    • Python Code: Visualizing a logistic regression model
    • What is feature selection? Why feature selection?
    • Feature selection vs feature extraction
    • Feature subset selection using Filter based methods
    • More Filter based methods for feature selection
    • Wrapper Methods and their Comparison with Filter Methods
    • Wrapper Methods
    • Embedded Methods
    • Model based machine learning with regularization
    • Regularization using L2
    • Regularization using L1
    • Python Code: Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)
    • Python Code: Recursive Feature Elimination -- wrapper
    • Python Code: Choosing important features (feature importance)
    • Python Code: Feature Selection using Variance Threshold
    • Introduction to Sequence Learning
    • Sequence Labeling as Classification
    • Probabilistic Sequence Models
    • Hidden Markov Model
    • Details about HMMs
    • Dishonest Casino Example of an HMM
    • Three Problems of an HMM
    • Decoding Problem of an HMM and the Viterbi Algorithm
    • Evaluation Problem of an HMM
    • The Forward Algorithm
    • The Backward Algorithm and the Posterior Decoding
    • The Learning Problem of an HMM, The Baum Welch Algorithm
    • Conditional Random Fields (CRFs)
    • Why prefer CRFs over HMMs?
    • Python code: Creating a simple Gaussian HMM
    • Python code: Learning a Gaussian HMM
    • Python code: Sampling from HMM
    • Python Code: Use CoNLL 2002 data to build a NER system: Understand the dataset
    • Python Code: Use CoNLL 2002 data to build a NER system: Define features
    • Python Code: Use CoNLL 2002 data to build a NER system: Learn and evaluate the CRF
    • Python Code: Use CoNLL 2002 data to build a NER system: Hyper-parameter Optimization
    • Python Code: Use CoNLL 2002 data to build a NER system: Feature Importances
    • Applications of Clustering
    • Understanding Distance
    • Basics of Clustering
    • Hierarchical (Agglomerative) clustering Part 1
    • Hierarchical (Agglomerative) clustering Part 2
    • K-means Algorithm example
    • K-means Algorithm details
    • Problems with K-means
    • Evaluation of cluster quality
    • Engineering issues with clustering
    • Soft clustering and EM algorithm example
    • Clustering summary
    • Python code: Kmeans Example
    • Python code: Kmeans on digits Example
    • Python code: Clustering for color compression
    • Mini Batch KMeans
    • Python code: Agglomerative Hierarchical Clustering
    • Ensemble Methods for Clustering: Problem Definition
    • Ensemble Methods for Clustering: Image Segmentation
    • Ensemble Methods for Clustering: Broad Approach
    • Ensemble Methods for Clustering: Finding Corresponding Clusters
    • Ensemble Methods for Clustering: Combining Corresponding Clusters
    • Why PCA?
    • PCA: A Layman's Introduction
    • Understanding Matrix Transformations and Definition of Eigen Vectors
    • How is PCA Computed?
    • PCA Examples
    • Relationship between PCA, Curve Fitting and Entropy
    • Eigenfaces in OpenCV
    • Kernel PCA
    • Python Code: Compute PCA and show components
    • Python Code: PCA as dimensionality reduction
    • Python Code: PCA for visualization: Hand-written digits
    • Python Code: Eigenfaces
    • LDA
    • PCA vs LDA
    • 2 class LDA
    • 2 class LDA: Computing within and Between Class Scatter
    • 2 class LDA Full Example
    • LDA for C classes
    • Limitations of LDA
    • Python Code: LDA on Wine dataset
    • Python Code: LDA from Scikit Learn on Iris dataset
    • Python Code: LDA on Iris dataset from scratch
    • Machine Learning Process
    • Qualities of a Classifier
    • Technical Practical Issues in ML
    • Non-Technical Practical Issues in ML
    • Machine Learning for Healthcare – Part 1
    • Machine Learning for Healthcare – Part 2
    • Machine Learning for Internet Service Providers
    • Machine Learning for People Analytics
    • Machine Learning for Retail and Telecom – Part 1
    • Machine Learning for Retail and Telecom – Part 2
    • Machine Learning for Supply Chain Management
    • Machine Learning for Agriculture
    • Machine Learning for Education
    • Machine Learning for Transportation and self-driving cars
    • Machine Learning for Connected Cars
    • Machine Learning for Legal Domain – Part 1
    • Machine Learning for Legal Domain – Part 2
    • Machine Learning for Oil Industry
    • Machine Learning for Banking Domain – Part 1
    • Machine Learning for Banking Domain – Part 2
    • Machine Learning for Insurance
    • Machine Learning for Project Management
    • Machine Learning for Fashion Industry
    • Other use-cases of Machine Learning
    • Learning various classifiers on Iris dataset
    • MLP for hand-written digit recognition
    • Logistic regression on the titanic dataset
    • Use CoNLL 2002 data to build a NER system

    Data Mining

    The area of Data Mining specifically deals with topics like pattern mining, OLAP, data cubes, and outlier detection. Frequent pattern mining deals with mining frequent subsets, subsequences or subgraphs from transactional, sequence or graph datasets respectively. These are very useful for Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. OLAP enables users to quickly analyze information that has been summarized into multidimensional views and hierarchies. By summarizing predicted queries into multidimensional views prior to run time, OLAP tools provide the benefit of increased performance over traditional database access tools. Outlier analysis has numerous applications in a wide variety of domains such as the financial industry, quality control, fault diagnosis, intrusion detection, web analytics, and medical diagnosis.

    In this module, I will cover basic methods for pattern mining like Apriori and FP growth. I will also cover basic concepts in OLAP and in outlier detection.

    • What is frequent pattern mining? What are the applications?
    • Understanding frequent patterns, association rules, support and confidence
    • Apriori Frequent Pattern Mining Method
    • Improving Apriori Frequent Pattern Mining Method: Less scans
    • FP Growth Frequent Pattern Mining Method: Building an FP tree
    • FP Growth Frequent Pattern Mining Method: Creating Conditional Pattern Bases
    • FP Growth Frequent Pattern Mining Method: Extracting Frequent Patterns
    • Comparing Apriori with FP Growth
    • ECLAT: Frequent Pattern Mining with Vertical Data Format
    • Which association rules are interesting? Lift, Chi Square
    • Which association rules are interesting? Null invariance
    • Understanding closed patterns and max patterns
    • Summary of frequent pattern mining
    • Python code: Hand-computing support and confidence
    • Python code: Association Rule Mining
    • Python code: Apriori
    • Python code: Evaluating lift for association rules
    • Python code: Problem on computing association rules with 100% confidence
    • Python code: Orange way of computing association rules and frequent patterns
    • Basic Concepts in Data Warehousing
    • OLTP vs OLAP
    • Data Warehouse Architecture
    • Data Warehouse Modeling: Data Cubes
    • Conceptual Modeling of Data Warehouses
    • Concept Hierarchies and Types of Measures
    • Data Cube Example
    • OLAP Operations
    • Data Warehouse: Design and Usage
    • Data Cube Computation and Query Processing
    • Data Cube Computation: Preliminary Concepts
    • Efficient Data Cube Computation
    • Multi-Way Array Aggregation
    • Bottom-Up Computation (BUC)
    • High-Dimensional OLAP – Part 1
    • High-Dimensional OLAP – Part 2
    • Introduction to Sampling Cube
    • Query Expansion in Sampling Cube
    • Python Code: Introduction to OLAP and OLAP Server API in Python Cubes 1.1
    • Python Code: Loading data, specifying model and building aggregates in Python Cubes 1.1
    • What are outliers? What is outlier analysis?
    • Broad overview of outlier detection Methods
    • Statistical Methods for Outlier Detection
    • Proximity based Methods for Outlier Detection: Distance based outliers
    • Proximity based Methods for Outlier Detection: Density based outliers
    • Clustering based Methods for Outlier Detection
    • Classification based Methods for Outlier Detection
    • Outlier Detection for high dimensional data
    • Python Code: Remove values > 2 std dev from mean
    • Python Code: Percentile based outliers vs median absolute deviation based outliers
    • Python Code: Example of using LOF for outlier detection
    • Python Code: Example of using Cluster-based Local Outlier Factor (CBLOF) for outlier detection
    • Python Code: Example of using one class SVM for outlier detection using pyod
    • Python Code: Example of using PCA for outlier detection
    • Python Code: One class SVM using scikit learn for outlier detection

    Text Mining and Analytics

    Text mining includes techniques for mining and analyzing text data to discover interesting patterns, extract useful knowledge, and support decision making, with an emphasis on statistical approaches that can be generally applied to arbitrary text data in any natural language with no or minimum human effort.

    This module will introduce the learner to text mining and text manipulation basics. We cover basics of text processing including regular expressions in the R and Python modules itself. Also, I talked about text classification in the machine learning module. Further, in this module, I will talk about further interesting topics in text mining such as n-gram models, Named Entity Recognition, Natural Language Processing, Sentiment Analysis, and Summarization.

    • Next Word Prediction
    • Learning n-gram models
    • Text Generation using n-gram models
    • Handling low frequency n-grams
    • Google n-grams
    • Evaluation of n-gram models
    • Information Retrieval using language models
    • Query Likelihood Model
    • Smoothed Query Likelihood Model
    • Laplace Smoothing
    • Jelinek-Mercer Smoothing
    • Dirichlet Smoothing and Two-Stage Smoothing
    • Overall IR Language Model
    • Python code: Building N-Gram models
    • Python code: Next word prediction using 2-gram models (max prob)
    • Python code: Next word prediction using 2-gram models (Weighted random choice based on freq)
    • Python code: Creating Tri-grams and higher n-gram models
    • Python code: Generating text using n-gram models with n>=3
    • Python code: Laplace Smoothed n-grams
    • Python code: Computing perplexity
    • What is NER?
    • Why is NER challenging?
    • Applications of NER
    • Annotation and Evaluation for NER
    • Broad Approaches for NER
    • Rule based Approaches for NER: List lookup approach
    • Rule based Approaches for NER: Shallow parsing approach
    • Rule based Approaches for NER: Shallow parsing approach with context
    • Learning based Approaches for NER
    • Python Code: Read text file, extract sentences and words
    • Python Code: Part of Speech Tagging and NER
    • Python Code: Chunking/NER visualization
    • Python Code: Get complete Person Names and Location Names from any text
    • What is NLP?
    • List of NLP Tasks
    • Why is NLP challenging?
    • Tokenization
    • Lemmatization and Stemming
    • Sentence Segmentation
    • Phrase Identification
    • Word Sense Disambiguation: Part 1
    • Word Sense Disambiguation: Part 2
    • Parsing
    • Python Code: Word Tokenization with nltk
    • Python Code: Stemming and Lemmatization with nltk
    • Python Code: Tokenization, Word Counts, Stop Word removal, and Text Normalization using Italian recipes data
    • Python Code: Text Processing with Conference Abstracts Dataset
    • Python Code: Text Classification for Reuters Dataset using Scikit-Learn
    • Applications of Sentiment Analysis
    • Word Classification based Approach for Sentiment Analysis
    • Naïve Bayes for Sentiment Analysis
    • Challenges in Sentiment Analysis
    • Sentiment Lexicons
    • Learning Sentiment Lexicons: “and” and “but”
    • Learning Phrasal Sentiment Lexicons: Turney’s Algorithm
    • Learning Sentiment Lexicons: WordNet approach
    • Learning Sentiment Lexicons: Domain specific
    • Python Code: Basic Sentiment Analysis using Naive Bayes and sentiment dictionaries
    • Python Code: Sentiment Analysis on Movie Reviews Dataset
    • Python Code: Sentiment analysis on Twitter Data obtained via Tweepy
    • What is Summarization? What are its applications?
    • Genres and Types of Summaries
    • Position-based, cue phrase-based and word frequency-based approaches for extractive summarization
    • Lex Rank
    • Problems with Extractive Summarization Methods
    • Cohesion-based Methods
    • Lexical Chains Method for Extractive Summarization
    • Information Extraction based Method for Extractive Summarization
    • Interpretation Methods for Summarization
    • Multi-document Summarization
    • Evaluating Summaries – Extrinsic vs Intrinsic
    • Evaluating Summaries – ROUGE and BLEU
    • Python Code: Write a Simple Summarizer in Python from Scratch
    • Python Code: Text Summarization using Gensim (uses TextRank based summarization)
    • Python Code: Text Summarization using sumy (LSA, Word freq method, cue phrase method)
    • Python Code: LexRank using sumy
    • Python Code: Summarization using PyTeaser
    • Python Code: Text Rank using summa
    • What are topic models? Why do you need them?
    • Plate diagrams, unigram models, mixture of unigrams
    • Application of topic modeling to matrices with high dimensionality
    • Singular Value Decomposition
    • Latent Semantic Indexing/Analysis (LSI/LSA) as an application of SVD
    • Latent Semantic Indexing/Analysis (LSI/LSA): Examples, Advantages and Drawbacks
    • Probabilistic Latent Semantic Analysis (PLSA)
    • Comparison between LSI and PLSA/PLSI
    • Motivation for LDA
    • Dirichlet Distributions
    • LDA Model Details
    • Comparison between various topic models: unigrams, mixture of unigrams, PLSI, LDA
    • LDA Hyper-parameters
    • Other Topic Models
    • Python Code: LDA using gensim
    • Python Code: LDA using scikit learn
    • Mini Project: Topic Modeling with Gensim - Loading data
    • Mini Project: Topic Modeling with Gensim - Pre-processing
    • Mini Project: Topic Modeling with Gensim - Building LDA Model
    • Mini Project: Topic Modeling with Gensim - Visualization
    • Mini Project: Topic Modeling with Gensim - Mallet and Hyper-parameter Tuning
    • Mini Project: Topic Modeling with Gensim - LDA Model analysis
    • What are word representations? Where can you use word vectors?
    • Neural Network Language Model (NNLM)
    • Word2Vec
    • CBOW and Skip-gram
    • GloVe (Global vectors for word representation)
    • Python Code: Using gensim to train your first Word2Vec model
    • Python Code: Finding similar words using gensim Word2Vec model
    • Python Code: More stuff with word2vec models: Find odd one out, compute accuracy, get the actual vector, and save model.
    • Python Code: Another gensim model example using Text8 corpus
    • Python Code: GloVe Example
    • Python Code: Using Stanford’s GloVe Embedding

    Web Mining

    Web Mining deals with analytics on web related data. How do search engines return relevant results so quickly for various queries? How do these search engines work? How does Amazon recommend products to its users? How are social networks formed and how do they grow? How do people influence each other on social networks? How do search engines make money through ads? How can you use the wisdom of the crowds to generate useful and credible information?

    The course will take the participants through understanding of the basic information retrieval concepts, web mining concepts, architecture of search engines, and applications. In this module aims to provide a conceptual and practical understanding of various aspects of web mining starting with the basics of web search to discussions about recent topics studied in the World Wide Web community. Topics covered will include: crawling, indexing, ranking, analysis of social networks, recommendation systems, and basics of computational advertising.

    • Term-Document Incidence Matrices
    • Inverted Indexes
    • Inverted Index Construction
    • Sorting for Inverted Index Construction
    • Query Processing using Inverted Indexes
    • Query Optimization for Inverted Indexes
    • Phrase Query Processing using Bi-word Indexes
    • Phrase Query Processing using Positional Indexes
    • Heap's Law
    • Zipf's Law
    • Motivation for Compression of Inverted Indexes
    • Dictionary Compression using Fixed-width terms or a single string
    • Dictionary Compression using blocking and front coding
    • Dictionary Compression using BTrees and Tries
    • Postings Compression by coding gaps
    • Variable Length Encoding for Postings Compression
    • Unary and Gamma codes for Postings Compression
    • What is Lucene?
    • Java code: Indexing Shakespeare's plays.
    • Java code: Searching Shakespeare's plays.
    • Fields in Lucene
    • Analyzers in Lucene
    • QueryParsers and Scoring in Lucene
    • Basics of Crawling
    • What any crawler must/should do?
    • URL frontier, politeness, robots.txt
    • Processing Steps in Crawling
    • Webpage and Web Graph processing
    • Using Nutch for Crawling
    • Need for Relevance Ranking
    • Jaccard Similarity for Relevance Ranking
    • TF and IDF
    • Vector Space Model, Cosine Similarity, and Okapi BM25
    • Efficient Cosine Ranking
    • Parametric, zone and tiered indexes
    • Evaluating Search Engine Quality: Factors, NDCG
    • Evaluating Search Engine Quality: Kappa Measure, AB testing
    • Python code: TFIDF Computation from scratch
    • Python code: TFIDF computation using SKLearn
    • Python code: TFIDF computation using gensim
    • Link-based Ranking of Web Pages
    • Power Iterations Method
    • Random Walk Interpretation
    • Spider traps and dead-ends
    • Problems with PageRank
    • Topic Sensitive PageRank
    • HITS (Hypertext-Induced Topic Selection)
    • Web Spam
    • TrustRank to Handle Link Spam
    • Python code: PageRank and HITS using networkx
    • Python code: PageRank from Scratch
    • Introduction to Recommender Systems
    • User-based Collaborative Filtering
    • Problems with User-based Collaborative Filtering
    • Item-based Collaborative Filtering
    • Hybrid Recommendation Methods
    • Recommendation System Case Studies: Video and Software Items
    • Tag Recommendations
    • People Recommendations within an enterprise
    • Friend Recommendation on Twitter
    • Recommendations for Groups
    • Cold Start Problem
    • Explanations for Recommendations
    • Evaluation of Recommendation Systems: Offline Evaluation
    • Evaluation of Recommendation Systems: User Studies and Online Evaluation
    • Python Code: User-user collaborative filtering and item-based CF from scratch
    • Python Code: Introduction to User Article Interaction Dataset
    • Python Code: Pre-processing dataset before building recommendation models
    • Python Code: Defining recommendation evaluation measure
    • Python Code: Popularity based recommender
    • Python Code: Content based recommender
    • Python Code: Collaborative Filtering based recommender
    • Python Code: Simple Hybrid Recommender
    • Python Code: Comparison across multiple types of recommenders
    • Python Code: Obtaining recommendations for a person
    • Introduction to Social Network Analysis
    • Erdös-Renyi Model
    • Small World Model
    • Kleinberg’s Model
    • Power Laws
    • Preferential Attachment Model
    • Copying Model
    • Forest Fire Model
    • Model with Network Components
    • Summary of Various Network Generation Models
    • Python code: Generate Graphs, Traverse Nodes and edges, Save and Load Graphs using snap
    • Python code: Graph Manipulation using snap
    • Python code: Computing Structural Properties using Snap
    • Python code: Plot graphs and their degree distributions
    • What is Social Influence?
    • Does Social Influence really matter?
    • Examples of Social Influence
    • Measuring Social Influence: RCT test
    • Measuring Social Influence: Shuffle test and reverse test
    • Measuring Social Influence: Reachability and action-based methods
    • Social Theories: Structural Balance and Social Status
    • Models for Social Influence Analysis: Linear Threshold Model
    • Models for Social Influence Analysis: Independent Cascade Model
    • Influence Maximization Problem
    • Solutions for Influence Maximization Problem
    • Applications of Social Influence Analysis
    • Python Code: Independent Cascade Model on Facebook Social Circles Dataset
    • Python Code: Influence maximization heuristics on wiki-Vote data
    • Twitter data characteristics and challenges
    • Burstiness to detect events from Twitter
    • Detecting Events using Graph Community Analysis
    • Detecting Events using CRFs
    • Detecting Events using Tag Correlations
    • Detecting Events by Label Propagation from News
    • Finding best phrase to describe an event
    • Finding event types
    • Finding event timespans
    • Detecting sporting events
    • Detecting local festivals
    • Detecting drug related adverse events
    • Detecting emerging controversial events
    • Python Code: Retreiving trends from Twitter
    • Python Code: Collecting search results and extracting text, screen names and hashtags from tweets
    • Python Code: Lexical analysis of tweets
    • Python Code: Analysis of retweets
    • Introduction to Information Extraction
    • What all can be extracted?
    • Wrapper Induction: Why and what
    • Extraction rules for Wrapper Induction
    • Learning Extraction rules for Wrapper Induction
    • Wrapper Maintenance
    • Extracting Tables from the web
    • Extracting Tables from the web: Recovering relations from raw HTML tables
    • Extracting Tables from the web: Applications
    • Python Code: Get list of all Presidents of India with related information from Wikipedia page using just pandas!
    • Python Code: Understanding Basics of BeautifulSoup
    • Python Code: Scraping weather forecasts using BeautifulSoup
    • Python Code: Scraping apartment information using beautiful soup from apartments.com
    • OpenIE and Tagme
    • Introduction to Computational Advertising
    • Computational Ads Basic Concepts: Stakeholders and Revenue Models
    • Display Ads: Problems and Methods
    • Introduction to Textual Ads
    • Selection of Textual Ads: Part 1
    • Selection of Textual Ads: Part 2
    • Sponsored Search
    • Introduction to Game Theory and Nash Equilibrium
    • Game Theory for Ads
    • Vickrey Auction
    • VCG Auction
    • Auctions for Sponsored Search
    • Generalized First Price Auction
    • Generalized Second Price Auction
    • Comparison between GFP, GSP, VCG
    • Python Code: Introduction to the Ad Click Through Rate (CTR) Prediction Problem
    • Python Code: Exploratory Data Analytics for Ad CTR Prediction: Part 1
    • Python Code: Exploratory Data Analytics for Ad CTR Prediction: Part 2
    • Python Code: Developing Logistic Regression Prediction model for Ad CTR Prediction
    • Python Code: Developing Gradient Boosting Prediction Models for Ad CTR Prediction
    • Introduction to Crowdsourcing
    • Applications of Crowdsourcing: Part 1
    • Applications of Crowdsourcing: Part 2
    • Cons of Crowdsourcing
    • Quality and Incentives Control in Crowdsourcing
    • Managing Complex tasks in Crowdsourcing
    • Security Challenges in Crowdsourcing
    • Fake reviews and social network sybils in Crowdsourcing
    • Managing Quality of Annotations
    • Weighted voting to get final labels
    • Gold testing with bad worker quality and unbalanced datasets
    • Integrating crowdsourcing with machine learning
    • Tips for Iterative HitApp Design
    • Introduction to the Amazon Mechanical Turk Platform
    • An Example of a crowdsourcing project using Mechanical Turk
    • Entities and Knowledge Bases
    • The Entity Resolution Problem
    • Examples of Entity Resolution
    • Similarity Function for Entity Resolution
    • Entity Resolution Workflow
    • Standard Blocking and the Sorted Neighborhood Method
    • Canopy Clustering and Token Blocking
    • Attribute Clustering Blocking
    • ZenCrowd Blocking
    • Prefix-Infix(-Suffix) Blocking
    • Block Post-Processing
    • Meta-Blocking
    • Python Code: Link two datasets using the recordlinkage Python package
    • Python Code: Data deduplication using recordlinkage Python package
    • Python Code: Classification Algorithms for Record Linkage
    • Using the dedupe package in Python

    Data Collection

    Data scientist is the sexiest job of the 21st century. When performing data science, a lot of time is spent in collecting useful data and pre-processing it. If the collected data is of bad quality, it can lead to bad quality models. Hence, it is very important to understand how to collect good quality data. Also, it is important to understand various ways in which data can be collected.

    In this module I will discuss different aspects of data collection. I will begin with discussions around decisions to make while doing data collection, data collection rules and approaches, and ways of performing data collection. Further, data can be collected from the web by scraping. Hence, we will learn how to perform basic scraping. Lastly, we will discuss briefly about collecting graph data as well data collection using IoT sensors.

    • What is data collection?
    • Data collection decisions, rules and approaches
    • Data collection tools: Surveys and Questionnaires
    • Data collection tools: Interviews
    • Data Collection Planning
    • What is web scraping?
    • Techniques for web scraping
    • Techniques to prevent web scraping
    • Scraping Amazon reviews using bash script
    • Scraping using scrapy: redditbot example
    • Scraping using scrapy: shopclues example
    • Scraping using scrapy: techcrunch example
    • Data collection APIs Examples
    • Calling APIs using Python
    • Using flask to create Python APIs
    • What information to collect and boundary specification
    • Sources of Graph Data and Krackhardt CSS
    • Graph Data Repositories
    • What is IoT?
    • RFID and other sensors
    • IoT Applications: Smart Grid and Intelligent Transportation
    • IoT Applications: ANPR and Quantified Self
    • Arduino and Proteus
    • Blinking LED with Arduino+Proteus
    • Arduino Input Output
    • Using Temperature Sensors to collect temperature data

    Deep Learning

    Deep learning has caught a great momentum in the last few years. Research in the field of deep learning is progressing amazingly fast. Deep Learning is a rapidly growing area of machine learning. Machine learning has seen numerous successes but applying learning algorithms today often means spending a long time hand-engineering the input feature representation. This is true for many problems in vision, audio, NLP, robotics, and other areas. To address this, researchers have developed deep learning algorithms that automatically learn a good representation for the input. These algorithms are today enabling many groups to achieve ground-breaking results in vision, speech, language, robotics, and other areas.

    I already discuss the basics of artificial neural networks in the machine learning module. Further, in this module, I will focus on other popular deep learning architectures like Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs) and Long Short Term Memory (LSTMs) Networks.

    • ImageNet and visual recognition problems
    • Biological inspiration for CNNs
    • Applications of CNNs
    • Why not just use MLPs for images?
    • CONV layer of a CNN
    • Details of CONV layer of a CNN
    • Stride and Pad for CONV layers of a CNN
    • Neuron view of the convolution layer
    • RELU in CNNs
    • Pooling and fully connected layers in CNNs
    • AlexNet and Hyper-parameter Optimization
    • Python Code: CNNs for Hand-written digit recognition using Tensorflow
    • Python Code: CNNs for Hand-written digit recognition using Keras
    • Python Code: Simple image classification with Inception Model
    • Motivation for sequence learning models
    • Neural Language Model using MLPs
    • Introduction to Recurrent Neural Networks
    • Back-propagation for RNNs
    • RNN design options
    • CNN-RNN architecture for image captioning
    • Deep Bidirectional RNNs for opinion mining
    • Sequence Learning for machine translation using RNNs
    • Drawbacks of RNNs
    • Solutions for the exploding gradient problem
    • Memory based models: Gated Recurrent Units (GRUs)
    • Long Short-Term Memory Networks (LSTMs)
    • LSTM Variants
    • LSTM Hyperparameter tuning
    • Applications of RNNs and LSTMs: Video analytics, Hate Speech Detection, Extractive Summarization
    • Applications of RNNs and LSTMs: Translation Quality Estimation, Text Segmentation, Recommendation Systems
    • Applications of RNNs and LSTMs: Medical Social Media Analysis
    • Python Code: Classify movie reviews -- binary classification using Keras.
    • Python Code: RNNs for Hand-written digit recognition using Tensorflow
    • Python Code: Bi-directional RNNs for Hand-written digit recognition using Tensorflow
    • Python Code: Next word prediction using RNNs
    • Encoder and decoder in auto-encoders
    • Learning an auto-encoder
    • Denoising auto-encoders
    • Stacked Denoising auto-encoders
    • Deep auto-encoders for document clustering
    • Python Code: Build a 2 layers auto-encoder with TensorFlow to compress images
    • Python Code: Simplest Auto-encoder in Keras
    • Python Code: Sparse Autoencoders using Keras
    • Python Code: Deep auto encoder using Keras
    • Python Code: Image Denoising using a convolutional auto-encoder in Keras
    • Python Code: Scalars, graphs, distributions and histograms using TensorBoard

    Visualization

    For any good data science story, it is very important to visualize it nicely. Visualizations help us understand data and insights much better.

    I cover basics of visualization in R and Python in those respective modules. In this module, I will talk about innovative ways of visualizing complex and large data.

    • Why data visualizations?
    • Guidelines for good plots: Part 1
    • Guidelines for good plots: Part 2
    • Guidelines for good plots: Part 3
    • Maintain integrity when plotting data: Avoid misleading graphs
    • Web–based visualization libraries
    • Data Analysis/Business Intelligence and Visualization Softwares
    • Plotting pitfalls with large data
    • Python Code: Plotting sample of NYC taxi data using bokeh
    • Python Code: Interactive Plotting of NYC taxi data using datashader and bokeh
    • Python Code: Plotting US Census data using datashader
    • Graph visualization: Why?
    • Graph visualization: Challenges
    • Graph visualization: Aesthetics
    • Graph visualization: Common Layout Algorithms
    • Graph visualization: Large graphs
    • Introduction to Gephi

    Target Audience


    The course content and Teaching Methodology is built to cater to the needs of students at various levels of expertise and varied background skills/competencies.

    Learn to Excel. You have to put your time and efforts to learn from this course as we teach from the basics and all that you need to have is a very basic knowledge of Programming and a strong determination to LEARN.

    • Here is a list of aspirants who would benefit from our course:
    • Undergraduate (BS/BTech/BE) students in Engineering, Technology and Science.
    • Post Graduate (MS/MTech/ME/MCA) students in Engineering, Technology and Science.
    • Working Professionals: Software Engineers, Business Analysts, Product & Program Managers, Enthusiasts involved in building ML Products & Services.

    COURSE FEATURES

  • Duration 200+ Hrs
  • Quizzes Yes
  • Assignments Yes
  • Projects Yes
  • Disclaimer


    Please note that the videos are not downloadable. Sharing your access or trying to sell or distribute videos is a legally punishable offence. Earlier we caught some people doing this and they were punished legally and a huge penalty was imposed on them.


    Raise a Complaint