Business analytics is the practice of iterative, methodical exploration of an organization's data, with an emphasis on statistical analysis. Business analytics is used by companies committed to data-driven decision-making. It is about using your data to derive information, insights, knowledge, and recommendations. Businesses use business analytics to improve effectiveness and efficiency of their solutions.
In this module, I will talk about how analytics has progressed from simple descriptive analytics to being predictive and prescriptive. I will also talk about multiple examples to understand these better, and discuss various industry use cases. I will also introduce multiple components of big data analysis including data mining, machine learning, web mining, natural language processing, social network analysis, and visualization in this module. Lastly, I will provide some tips for learners of data science to succeed in learning and applying data science successfully for their projects.
Challenges in Applying Analytics to Business Problems
Tips on Career in Data Science
Python for Data Science
Python and R are the two most popular programming languages for data scientists as of now. Python is an interpreted high-level programming language for general-purpose programming. Created by Guido van Rossum and first released in 1991, Python has a design philosophy that emphasizes code readability, notably using significant whitespace. Python is open source, has awesome community support, is easy to learn, good for quick scripting as well as coding for actual deployments, good for web coding too.
In this module, I will start with basics of the Python language. We will do both theory as well as hands-on exercises intermixed. I will use Jupyter notebooks while doing hands-on. I will also discuss in detail topics like control flow, input output, data structures, functions, regular expressions and object orientation in Python. Closer to data science, I will discuss about popular Python libraries like NumPy, Pandas, SciPy, Matplotlib, Scikit-Learn and NLTK.
Data Analysis and Prediction using the Loan Prediction Dataset
R for data science
While Python has been used by many programmers even before they were introduced to data science, R has its main focus on statistics, data analysis, and graphical models. R is meant mainly for data science. Just like Python, R has also has very good community support. Python is good for beginners, R is good for experienced data scientists. R provides the most comprehensive statistical analysis packages.
In this module, I will again talk about both theory as well as hands-on about various aspects of R. I will use the R Studio for hands-on. I will discuss basic programming aspects of R as well as visualization using R. Then, I will talk about how to use R for exploratory data analysis, for data wrangling, and for building models on labeled data. Overall, I will cover whatever you need to do good data science using R.
Data Analysis using R: Why Are Low-Quality Diamonds More Expensive?
Probability and Statistics
Probability and statistics helps in understanding whether data is meaningful, including inference, testing, and other methods for analyzing patterns in data and using them to predict, understand, and improve results.
We live in an uncertain and complex world, yet we continually have to make decisions in the present with uncertain future outcomes. To study, or not to study? To invest, or not to invest? To marry, or not to marry? This is what is captured mathematically using the notion of probability. Statistics on the other hand, helps us analyze data sets, and correctly interpret results to make solid, evidence-based decisions.
In this module, I will discuss some very fundamental terms/concepts related to probability and statistics that often come across any literature related to Machine Learning and AI. Key topics include quantifying uncertainty with probability, descriptive statistics, point and interval estimation of means, central limit theorem, and the basics of hypothesis testing.
Measures of dispresion (Range, IQR, std dev, variance)
Five Number summary and skew
Graphic displays of basic statistical descriptions
Machine learning is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome. Machine Learning is a first-class ticket to the most exciting careers in data science. As data sources proliferate along with the computing power to process them, automated predictions have become much more accurate and dependable. Machine learning brings together computer science and statistics to harness that predictive power. It’s a must-have skill for all aspiring data analysts and data scientists, or anyone else who wants to wrestle all that raw data into refined trends and predictions.
In this module, broadly I will talk about supervised as well as unsupervised learning. We will talk about multiple types of classifiers like Naïve Bayes, KNN, decision trees, SVMs, artificial neural networks, logistic regression, and ensemble learning. Further, we will also talk about linear regression analysis, sequence labeling using HMMs. As part of unsupervised learning, I will discuss clustering as well as dimensionality reduction. Finally, we will also discuss briefly about semi-supervised learning, mult-task learning, architecting ML solutions, and a few ML case studies.
The area of Data Mining specifically deals with topics like pattern mining, OLAP, data cubes, and outlier detection. Frequent pattern mining deals with mining frequent subsets, subsequences or subgraphs from transactional, sequence or graph datasets respectively. These are very useful for Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. OLAP enables users to quickly analyze information that has been summarized into multidimensional views and hierarchies. By summarizing predicted queries into multidimensional views prior to run time, OLAP tools provide the benefit of increased performance over traditional database access tools. Outlier analysis has numerous applications in a wide variety of domains such as the financial industry, quality control, fault diagnosis, intrusion detection, web analytics, and medical diagnosis.
In this module, I will cover basic methods for pattern mining like Apriori and FP growth. I will also cover basic concepts in OLAP and in outlier detection.
Proximity based Methods for Outlier Detection: Distance based outliers
Proximity based Methods for Outlier Detection: Density based outliers
Clustering based Methods for Outlier Detection
Classification based Methods for Outlier Detection
Outlier Detection for high dimensional data
Python Code: Remove values > 2 std dev from mean
Python Code: Percentile based outliers vs median absolute deviation based outliers
Python Code: Example of using LOF for outlier detection
Python Code: Example of using Cluster-based Local Outlier Factor (CBLOF) for outlier detection
Python Code: Example of using one class SVM for outlier detection using pyod
Python Code: Example of using PCA for outlier detection
Python Code: One class SVM using scikit learn for outlier detection
Text Mining and Analytics
Text mining includes techniques for mining and analyzing text data to discover interesting patterns, extract useful knowledge, and support decision making, with an emphasis on statistical approaches that can be generally applied to arbitrary text data in any natural language with no or minimum human effort.
This module will introduce the learner to text mining and text manipulation basics. We cover basics of text processing including regular expressions in the R and Python modules itself. Also, I talked about text classification in the machine learning module. Further, in this module, I will talk about further interesting topics in text mining such as n-gram models, Named Entity Recognition, Natural Language Processing, Sentiment Analysis, and Summarization.
What are word representations? Where can you use word vectors?
Neural Network Language Model (NNLM)
CBOW and Skip-gram
GloVe (Global vectors for word representation)
Python Code: Using gensim to train your first Word2Vec model
Python Code: Finding similar words using gensim Word2Vec model
Python Code: More stuff with word2vec models: Find odd one out, compute accuracy, get the actual vector, and save model.
Python Code: Another gensim model example using Text8 corpus
Python Code: GloVe Example
Python Code: Using Stanford’s GloVe Embedding
Web Mining deals with analytics on web related data. How do search engines return relevant results so quickly for various queries? How do these search engines work? How does Amazon recommend products to its users? How are social networks formed and how do they grow? How do people influence each other on social networks? How do search engines make money through ads? How can you use the wisdom of the crowds to generate useful and credible information?
The course will take the participants through understanding of the basic information retrieval concepts, web mining concepts, architecture of search engines, and applications. In this module aims to provide a conceptual and practical understanding of various aspects of web mining starting with the basics of web search to discussions about recent topics studied in the World Wide Web community. Topics covered will include: crawling, indexing, ranking, analysis of social networks, recommendation systems, and basics of computational advertising.
Standard Blocking and the Sorted Neighborhood Method
Canopy Clustering and Token Blocking
Attribute Clustering Blocking
Python Code: Link two datasets using the recordlinkage Python package
Python Code: Data deduplication using recordlinkage Python package
Python Code: Classification Algorithms for Record Linkage
Using the dedupe package in Python
Data scientist is the sexiest job of the 21st century. When performing data science, a lot of time is spent in collecting useful data and pre-processing it. If the collected data is of bad quality, it can lead to bad quality models. Hence, it is very important to understand how to collect good quality data. Also, it is important to understand various ways in which data can be collected.
In this module I will discuss different aspects of data collection. I will begin with discussions around decisions to make while doing data collection, data collection rules and approaches, and ways of performing data collection. Further, data can be collected from the web by scraping. Hence, we will learn how to perform basic scraping. Lastly, we will discuss briefly about collecting graph data as well data collection using IoT sensors.
IoT Applications: Smart Grid and Intelligent Transportation
IoT Applications: ANPR and Quantified Self
Arduino and Proteus
Blinking LED with Arduino+Proteus
Arduino Input Output
Using Temperature Sensors to collect temperature data
Deep learning has caught a great momentum in the last few years. Research in the field of deep learning is progressing amazingly fast. Deep Learning is a rapidly growing area of machine learning. Machine learning has seen numerous successes but applying learning algorithms today often means spending a long time hand-engineering the input feature representation. This is true for many problems in vision, audio, NLP, robotics, and other areas. To address this, researchers have developed deep learning algorithms that automatically learn a good representation for the input. These algorithms are today enabling many groups to achieve ground-breaking results in vision, speech, language, robotics, and other areas.
I already discuss the basics of artificial neural networks in the machine learning module. Further, in this module, I will focus on other popular deep learning architectures like Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs) and Long Short Term Memory (LSTMs) Networks.
Customers of a big international bank decided to leave the bank. The bank is investigating a very high rate of customer leaving the bank. The dataset contains 10000 records, and we use it to investigate and predict which of the customers are more likely to leave the bank soon. The approach here is supervised classification; the classification model to be built on historical data and then used to predict the classes for the current customers to identify the churn. The dataset contains 13 features, and also the label column (Exited or not). The best accuracy was obtained with the Naïve Bayes model (83.29%). Such churn prediction models could be very useful for applications such as churn prediction in Telecom sector to identify the customers who are switching from current network, and also for Churn prediction in subscription services.
Exploratory data analysis and Pre-processing for churn prediction
Modeling techniques for churn prediction
Churn Prediction Results
Concluding the Churn Prediction Project
Twitter Sentiment Analysis:
Investigation of open data from internet-based expressions and opinions could yield fascinating outcomes and bits of knowledge into the universe of popular feelings about any item, administration or identity. The blast of Web 2.0 has prompted expanded action in Podcasting, Blogging, Tagging, Contributing to RSS, Social Bookmarking, and Social Networking. Subsequently there has been a sudden increase of enthusiasm for individuals to mine these tremendous assets of information for suppositions. Sentiment analysis or Opinion Mining is mining of sentiment polarities from online social media. In this project we will talk about a procedure which permits use and understanding of twitter information for sentiment analysis. We perform several steps of text pre-processing, and then experiment with multiple classification mechanisms. Using a dataset of 50000 tweets and TFIDF features, we comparison the accuracy obtained using various classifiers for this task. We find that linear SVMs provide us the best accuracy results among the various classifiers tried. Sentiment analysis classifier could be useful for many applications like market analysis of different features of a new product or public opinion for a new movie or speech by a political candidate.
Basic ML and NLP techniques and tools needed for Twitter Sentiment Analysis
Exploratory data analysis and Pre-processing for Twitter Sentiment Analysis
Twitter Sentiment Analysis Results
Concluding the Twitter Sentiment Analysis Project
The course content and Teaching Methodology is built to cater to the needs of students at various levels of expertise and varied background skills/competencies.
Learn to Excel. You have to put your time and efforts to learn from this course as we teach from the basics and all that you need to have is a very basic knowledge of Programming and a strong determination to LEARN.
Here is a list of aspirants who would benefit from our course:
Undergraduate (BS/BTech/BE) students in Engineering, Technology and Science.
Post Graduate (MS/MTech/ME/MCA) students in Engineering, Technology and Science.
Working Professionals: Software Engineers, Business Analysts, Product & Program Managers, Enthusiasts involved in building ML Products & Services.
Please note that the videos are not downloadable. Sharing your access or trying to sell or distribute videos is a legally punishable offence. Earlier we caught some people doing this and they were punished legally and a huge penalty was imposed on them.
A Word Ladder is a word game invented by Lewis Carroll in which players find paths between words by switching one letter at a time. For example, one can link “ape” and “man” in the following way ape->apt->ait->bit->big->bag->mag->man. Note that each step involves changing just one letter of the word. This is just one possible path from “ape” to “man”, but is it the shortest possible path? How to solve this game using Python is what the project is about.
Data Analysis and Prediction using the Loan Prediction Dataset
Loan Prediction dataset consists of various attributes like gender, marital status, number of dependents, education, applicant income, co-applicant income, etc. The goal is to predict whether the loan will be approved for a person or not. In this project, we would first perform dataset exploration both for numeric and categorical fields. Further, we would smartly handle missing values and extreme values. Finaly, we would build a predictive model using Python.
Data Analysis using R: Why Are Low-Quality Diamonds More Expensive?
In this project, we first perform exploratory analysis on the diamonds dataset using R. We perform outlier detection to figure out anomalous values. We build various visualizations. We answer interesting questions like Why are there more diamonds at whole carats and common fractions of carats? Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak? Why are there no diamonds bigger than 3 carats? Finally, we perform machine learning model building and understand why low quality diamonds are seemingly more expensive.
Learning various classifiers on Iris dataset
In this project, we will understand the iris dataset, and then build various classifiers on the iris dataset. For each classifier, we will try to visualize the learned classifier as well note the accuracy obtained. We will compare across the following classifiers: decision trees, KNN, Naive Bayes, Ensemble methods. This project will be implemented using Scikit Learn in Python.
MLP for hand-written digit recognition
In this project, we will talk about using multi-layered perceptrons and how they can be used for hand-written digit recognition task. We will start by introducing the popular MNIST dataset. Then, we will talk about two different MLP architectures: (a) MLP with no hidden layer with 10 output neurons (b) MLP with two hidden layers. We will compare the accuracy across the two architectures. This project will be implemented in TensorFlow.
Logistic regression on the titanic dataset
In this project, we will analyze the Titanic dataset and fit a logistic regression model to predict passenger survival. In the first part, we will understand the Titanic dataset and perform exploratory data analysis. In Part 2, we will take care of the missing values. In part 3, we will perform further pre-processing on the dataset to handle categorical attributes, and remove highly correlated attributes. Further, we will train a logistic regression model in part 4. Finally, in the last part, we will learn how to visualize the logistic regression model. We will use Scikit Learn in Python for the project implementation.
Use CoNLL 2002 data to build a NER system
In this project, we will build a NER system using CRFs. We will first understand the available training dataset. We will then define appropriate features. Next, we will train a basic CRF model. Further, we will try to vary different hyper-parameters and perform hyperparameter optimization to get the best set of hyper-parameters. With this best set, we learn the most accurate NER model. Lastly, we will also perform feature importance analysis. We will use the sklearn_crfsuite package in Python for project implementation.