Data Science Projects
By Manish Gupta is a Principal Applied Researcher at Microsoft India R&D Private Limited at Hyderabad, India.
For 10 popular mobile phones, scrape all reviews about those phones from Amazon. Also, scrape recent tweets about those phones.
Perform sentiment analysis on these reviews. Rank the mobile phones by popularity, and also by positive sentiment both on Twitter as well as on Amazon.
Download 1000 business pages from yelp.com. Each webpage is an HTML containing details about the business. It does not have the email id, but it has the website address for the business which can be used to find the contact us page for the website and thereby extract its email id.
Your task is to obtain structured data for the business: business name, business phone number, business home page URL, business address, opening hours, Takes Reservations, Delivery, Take-out, Accepts Credit Cards, Accepts Apple Pay, Accepts Android Pay, Accepts Bitcoin, Good For, Parking, Bike Parking, Good for Kids, Good for Groups, Attire, Ambience, Noise Level, Alcohol, Outdoor Seating, Wi-Fi, Has TV, Caters, Gender Neutral Restrooms, contact-us URL for the business, email id for the business.
Try to extract similar structured information for 1000 pages from Zomato.
This project focuses on the problem of forecasting the future values of multiple time series, as it has always been one of the most challenging problems in the field. More specifically, we aim the project at testing state-of-the-art methods designed by the participants, on the problem of forecasting future web traffic for approximately 145,000 Wikipedia articles.
The training dataset consists of approximately 145k time series. Each of these time series represent a number of daily views of a different Wikipedia article, starting from July, 1st, 2015 up until December 31st, 2016. Divide the data into train and test, and validate your approaches.
Customers of a big international bank, who decided to leave (Exited) from the bank. A bank is investigating a very high rate of customer leaving the bank. Here is a 10.000 records dataset to investigate and predict which of the customers are more likely to leave the bank soon.
Use various classifiers to find which one provides better accuracy. Identify the most important features. Try out various feature selection techniques also.
Students’ high dropout rate on MOOC platforms has been heavily criticized, and predicting their likelihood of dropout would be useful for maintaining and encouraging students’ learning activities.
In this competition, you are challenged to build a predictor that can predict the chance that a student will drop out of an enrollment after observing his/her early course activities.
In particular, you have access to the statistics of the student’s course-relevant activities during the first 10 days since its launch, such as working on course assignments, watching course videos, accessing the course wiki, etc.
Further, not many students dropout overall but their performance could suffer. The second part of the project concerns predicting student performance in secondary education (high school).
Diabetic retinopathy is the leading cause of blindness in the working-age population of the developed world. It is estimated to affect over 93 million people. Currently, detecting DR is a time-consuming and manual process that requires a trained clinician to examine and evaluate digital color fundus photographs of the retina.
By the time human readers submit their reviews, often a day or two later, the delayed results lead to lost follow up, miscommunication, and delayed treatment. With color fundus photography as input, the goal of this project is to build an automated detection system. You are provided with a large set of high-resolution retina images taken under a variety of imaging conditions.
A left and right field is provided for every subject. Images are labeled with a subject id as well as either left or right (e.g. 1_left.jpeg is the left eye of patient id 1). A clinician has rated the presence of diabetic retinopathy in each image on a scale of 0 to 4, according to the following scale: 0 – No DR, 1 – Mild, 2 – Moderate, 3 – Severe, 4 – Proliferative DR. Your task is to create an automated analysis system capable of assigning a score based on this scale.
He is also an Adjunct Faculty at the International Institute of Information Technology, Hyderabad and a visiting faculty at the Indian School of Business, Hyderabad. He received his Masters in Computer Science from IIT Bombay in 2007 and his Ph.D. from the University of Illinois at Urbana-Champaign in 2013.
Please note that the videos are not downloadable. Sharing your access or trying to sell or distribute videos is a legally punishable offence. Earlier we caught some people doing this and they were punished legally and a huge penalty was imposed on them.