Michael Fuchs Python
/
Recent content on Michael Fuchs PythonHugo -- gohugo.ioen-usTue, 01 Mar 2022 00:00:00 +0000AutoML for Time Series Analysis
/2022/03/01/automl-for-time-series-analysis/
Tue, 01 Mar 2022 00:00:00 +0000/2022/03/01/automl-for-time-series-analysis/1 Introduction 2 Import the Libraries and the Functions 3 Import the Data 4 AutoTS 4.1 Compare Models 4.2 Train a single Model 4.3 Compare Models with external variables 5 Merlion 5.1 Prepare the Data 5.2 Default Forecaster Model 5.3 Multiple Models & Ensembles 5.3.1 Model Config & Training 5.3.2 Model Evaluation 6 Conclusion 1 Introduction There are automated machine learning libraries not only for classification or regression but also for time series prediction.AutoML using PyCaret - Regression
/2022/01/15/automl-using-pycaret-regression/
Sat, 15 Jan 2022 00:00:00 +0000/2022/01/15/automl-using-pycaret-regression/1 Introduction 2 Loading the Libraries and Data 3 PyCaret - Regression 3.1 Setup 3.2 Compare Models 3.3 Model Evaluation 3.4 Model Training 3.5 Model Optimization 3.5.1 Tune the Model 3.5.2 ensemble_models 3.5.3 blend_models 3.5.4 stack_models 3.5.5 Performance Overview 3.6 Model Evaluation after Training 3.7 Model Predictions 3.8 Model Finalization 3.9 Saving the Pipeline & Model 4 Conclusion 1 Introduction In my last post I introduced PyCaret and showed how to solve classification problem using this automated machine learning library.AutoML using PyCaret - Classification
/2022/01/01/automl-using-pycaret-classification/
Sat, 01 Jan 2022 00:00:00 +0000/2022/01/01/automl-using-pycaret-classification/1 Introduction 2 Loading the Libraries and Data 3 PyCaret - Classification 3.1 Setup 3.2 Compare Models 3.2.1 Comparison of Specific Models 3.2.2 Further Settings 3.3 Model Evaluation 3.4 Model Training 3.5 Model Optimization 3.5.1 Tune the Model 3.5.2 Retrieve the Tuner 3.5.3 Automatically Choose Better 3.5.4 ensemble_models 3.5.5 blend_models 3.5.6 stack_models 3.5.7 Further Methods 3.6 Model Evaluation after Training 3.7 Model Predictions 3.NLP - Word Embedding with GENSIM for Text-Classification
/2021/09/01/nlp-word-embedding-with-gensim-for-text-classification/
Wed, 01 Sep 2021 00:00:00 +0000/2021/09/01/nlp-word-embedding-with-gensim-for-text-classification/1 Introduction 2 Import the Libraries and the Data 3 Gensim - Word2Vec 3.1 Instantiation 3.2 Exploration of the calculated Values 3.3 Generation of aggregated Sentence Vectors 3.4 Generation of averaged Sentence Vectors 3.5 Model Training 3.6 Processing of new Input 3.6.1 Load the Word2Vec Model 3.6.2 Load the new Input 3.6.3 Pre-Processing of the new Input 3.6.4 Model Predictions 3.7 Updating the Word2Vec Model 3.NLP - Text Vectorization
/2021/08/01/nlp-text-vectorization/
Sun, 01 Aug 2021 00:00:00 +0000/2021/08/01/nlp-text-vectorization/1 Introduction 2 Import the Libraries and the Data 3 Text Vectorization 3.1 Bag-of-Words(BoW) 3.1.2 Functionality 3.1.3 Creation of the final Data Set 3.1.4 Test of a Sample Record 3.2 N-grams 3.2.1 Explanation 3.2.2 Functionality 3.2.2.1 Defining ngram_range 3.2.2.2 Defining max_features 3.2.3 Creation of the final Data Set 3.3 TF-IDF 3.3.1 Explanation 3.3.1.1 Mathematical Formulas 3.3.1.2 Example Calculation 3.3.1.3 TF-IDF using scikit-learn 3.NLP - Text Pre-Processing - All in One
/2021/06/23/nlp-text-pre-processing-all-in-one/
Wed, 23 Jun 2021 00:00:00 +0000/2021/06/23/nlp-text-pre-processing-all-in-one/1 Introduction 2 Import the Libraries and the Data 3 Text Pre-Processing 3.1 Text Cleaning 3.2 Tokenization 3.3 Stop Words 3.4 Normalization 3.5 Removing Single Characters 3.6 Text Exploration 3.6.1 Most common Words 3.6.1.1 for the whole DF 3.6.1.2 for parts of the DF 3.6.2 Least common Words 3.7 Removing specific Words 3.8 Removing Rare words 3.9 Final Results 4 Conclusion 1 Introduction I have focused heavily on the topic of Text Pre-Processing in my past publications.NLP - Text Pre-Processing VII (Special Cases)
/2021/06/19/nlp-text-pre-processing-vii-special-cases/
Sat, 19 Jun 2021 00:00:00 +0000/2021/06/19/nlp-text-pre-processing-vii-special-cases/1 Introduction 2 Import the Libraries 3 Definition of required Functions 4 Text Pre-Processing - Special Cases 4.1 Converting Emoticons to Words 4.2 Converting Chat Conversion Words to normal Words 4.3 Converting Numbers to Words 4.3.1 Small Numbers 4.3.2 Larger Numbers 4.3.3 Numbers combined with Words and Punctuation 4.3.4 Limitations 5 Application to a DataFrame 5.1 Loading the Data Set 5.2 Step 1: Converting emoticons into words 5.NLP - Text Pre-Processing VI (Word Removal)
/2021/06/16/nlp-text-pre-processing-vi-word-removal/
Wed, 16 Jun 2021 00:00:00 +0000/2021/06/16/nlp-text-pre-processing-vi-word-removal/1 Introduction 2 Import the Libraries and the Data 3 Definition of required Functions 4 Text Pre-Processing 4.1 (Text Cleaning) 4.2 (Tokenization) 4.3 (Stop Words) 4.4 (Digression: POS & NER) 4.5 (Normalization) 4.6 (Removing Single Characters) 4.7 (Text Exploration) 4.8 Removing specific Words 4.8.1 Single Word Removal 4.8.2 Multiple Word Removal 4.8.3 Application to the Example String 4.8.3.1 with Single Word Removal 4.8.3.2 with Multiple Word Removal 4.NLP - Text Pre-Processing V (Text Exploration)
/2021/06/10/nlp-text-pre-processing-v-text-exploration/
Thu, 10 Jun 2021 00:00:00 +0000/2021/06/10/nlp-text-pre-processing-v-text-exploration/1 Introduction 2 Import the Libraries and the Data 3 Definition of required Functions 4 Text Pre-Processing 4.1 (Text Cleaning) 4.2 (Tokenization) 4.3 (Stop Words) 4.4 (Digression: POS & NER) 4.5 (Normalization) 4.6 (Removing Single Characters) 4.7 Text Exploration 4.7.1 Descriptive Statistics 4.7.1.1 Most common Words 4.7.1.2 Least common Words 4.7.2 Text Visualization 4.7.2.1 via Bar Charts 4.7.2.2 via Word Clouds 4.NLP - Text Pre-Processing IV (Single Character Removal)
/2021/06/05/nlp-text-pre-processing-iv-single-character-removal/
Sat, 05 Jun 2021 00:00:00 +0000/2021/06/05/nlp-text-pre-processing-iv-single-character-removal/1 Introduction 2 Import the Libraries and the Data 3 Definition of required Functions 4 Text Pre-Processing 4.1 (Text Cleaning) 4.2 (Tokenization) 4.3 (Stop Words) 4.4 (Digression: POS & NER) 4.5 (Normalization) 4.6 Removing Single Characters 4.6.1 Application to the Example String 4.6.2 Application to the DataFrame 4.6.2.1 With Character Length = 1 (default settings) 4.6.2.2 With Character Length = 2 5 Conclusion 1 Introduction Now we come to another sub-area regarding text pre-processing: The removal of individual characters.NLP - Text Pre-Processing III (POS, NER and Normalization)
/2021/05/31/nlp-text-pre-processing-iii-pos-ner-and-normalization/
Mon, 31 May 2021 00:00:00 +0000/2021/05/31/nlp-text-pre-processing-iii-pos-ner-and-normalization/1 Introduction 2 Import the Libraries and the Data 3 Definition of required Functions 4 Text Pre-Processing 4.1 (Text Cleaning) 4.2 (Tokenization) 4.3 (Stop Words) 4.4 Digression: POS & NER 4.4.1 Part of Speech Tagging (POS) 4.4.2 Named Entity Recognition (NER) 4.5 Normalization 4.5.1 Stemming 4.5.2 Lemmatization 4.5.2.1 Wordnet Lemmatizer with specific POS tag 4.5.2.2 Wordnet Lemmatizer with appropriate POS tag 4.NLP - Text Pre-Processing II (Tokenization and Stop Words)
/2021/05/25/nlp-text-pre-processing-ii-tokenization-and-stop-words/
Tue, 25 May 2021 00:00:00 +0000/2021/05/25/nlp-text-pre-processing-ii-tokenization-and-stop-words/1 Introduction 2 Import the Libraries and the Data 3 Definition of required Functions 4 Text Pre-Processing 4.1 (Text Cleaning) 4.2 Tokenization 4.2.1 Word Tokenizer 4.2.2 Sentence Tokenizer 4.2.3 Application to the Example String 4.2.4 Application to the DataFrame 4.3 Stop Words 4.3.1 Application to the Example String 4.3.2 Application to the DataFrame 5 Conclusion 1 Introduction In my last publication, I started the post series on the topic of text pre-processing.NLP - Text Pre-Processing I (Text Cleaning)
/2021/05/22/nlp-text-pre-processing-i-text-cleaning/
Sat, 22 May 2021 00:00:00 +0000/2021/05/22/nlp-text-pre-processing-i-text-cleaning/1 Introduction 2 Import the Libraries and the Data 3 Definition of required Functions 4 Text Pre-Processing 4.1 Text Cleaning 4.1.1 Conversion to Lower Case 4.1.2 Removing HTML-Tags 4.1.3 Removing URLs 4.1.4 Removing Accented Characters 4.1.5 Removing Punctuation 4.1.6 Removing irrelevant Characters (Numbers and Punctuation) 4.1.7 Removing extra Whitespaces 4.1.8 Extra: Count Words 4.1.9 Extra: Expanding Contractions 4.1.10 Application to the Example String 4.1.11 Application to the DataFrame 5 Conclusion 1 Introduction In my last post (NLP - Text Manipulation) I got into the topic of Natural Language Processing.NLP - Text Manipulation
/2021/05/18/nlp-text-manipulation/
Tue, 18 May 2021 00:00:00 +0000/2021/05/18/nlp-text-manipulation/1 Introduction 1.1 What is NLP? 1.2 Future Perspectives for NLP 1.3 Application Areas of NLP 2 Text Manipulation 2.1 String Variables 2.2 Use of Quotation Marks 2.3 Obtaining specific Information from a String 2.4 String Manipulation 2.5 Arithmetic Operations 2.6 Check String Properties 2.7 Replace certain Characters in Strings 2.8 For Loops with Strings 3 Conclusion 1 Introduction Let’s now move on to another large but very interesting topic area from the field of Data Science: Natural Language ProcessingMachine Learning Pipelines
/2021/05/11/machine-learning-pipelines/
Tue, 11 May 2021 00:00:00 +0000/2021/05/11/machine-learning-pipelines/1 Introduction 2 Loading the libraries and classes 3 Loading the data 4 ML Pipelines 4.1 A simple Pipeline 4.2 Determination of the best Scaler 4.2.1 Creation of the Pipeline 4.2.2 Creation of a Pipeline Dictionary 4.2.3 Fit the Pipeline 4.2.4 Evaluate the Pipeline 4.3 Determination of the best Estimator 4.3.1 Creation of the Pipeline 4.3.2 Creation of a Pipeline Dictionary 4.3.3 Fit the Pipeline 4.Modified Print Statements
/2021/04/20/modified-print-statements/
Tue, 20 Apr 2021 00:00:00 +0000/2021/04/20/modified-print-statements/1 Introduction 2 Loading the libraries and classes 3 Modified Print Statements 3.1 Print Statements with Variables 3.1.1 String Variables 3.1.2 Nummeric Variables 3.2 Print Statements with compound Paths 3.3 Color Print Statements 3.4 Print Statements with if else 4 Conclusion 1 Introduction We often use print statements to get feedback on certain process steps or to present findigs. In this post, I want to show how to use print statements cleverly and make them more descriptive.Visualizations
/2021/04/07/visualizations/
Wed, 07 Apr 2021 00:00:00 +0000/2021/04/07/visualizations/1 Introduction 2 Loading the libraries 3 Line Chart 3.1 Creating the Data 3.2 Simple Line Chart 3.3 Prevention of unwanted Ticks 3.4 Configurations 3.4.1 Rotation of the X-Axis 3.4.2 Labeling of the Chart 3.4.2.1 Add a Subtitle 3.4.2.2 Show bold Labels 3.4.2.3 Add a Legend 3.4.2.4 Add v-Lines 3.5 Storage of the created Charts 4 Conclusion 1 Introduction Visualizations are part of the bread and butter business for any Data Analyst or Scientist.How to connect Python to a local SQL Server
/2021/03/27/how-to-connect-python-to-a-local-sql-server/
Sat, 27 Mar 2021 00:00:00 +0000/2021/03/27/how-to-connect-python-to-a-local-sql-server/1 Introduction 2 Import the Libraries and Preparations 3 Connection to the DB 4 Exploration of the respective DB 5 Loading tables from DB 5.1 Complete Dataframe 5.2 Selected Data 6 Data Manipulation in SQL Server using Python 6.1 Insert Values into SQL Server Table 6.2 Delete Records in SQL Server 7 Inserting a Python Dataframe into SQL Server 8 Conclusion 1 Introduction Recently, I have been dealing with a wide variety of topics in the field of SQL.SQL
/2021/03/24/sql/
Wed, 24 Mar 2021 00:00:00 +0000/2021/03/24/sql/1 Introduction 2 Chapters covered 2.1 Theoretical Background 2.2 Setup and Relationships of Tables 2.3 SQL Server specific Posts 2.4 Data Science 2.5 SQL Tools 3 Conclusion 1 Introduction Long time no hear. But I have not been idle. I have dedicated myself to another project in between: SQL
I have built a new homepage for this again. You can find it here: Michael Fuchs SQLAutomated Notifications
/2021/03/13/automated-notifications/
Sat, 13 Mar 2021 00:00:00 +0000/2021/03/13/automated-notifications/1 Introduction 2 Import the Libraries and the Data 3 Data Pre-Processing 3.1 Encoding of the Predictors 3.2 Encoding of the Target Variable 3.3 Train-Test Split 4 Get Automated Notifications 4.1 via Notify 4.2 via an Audible Signal 4.3 via Telegram 4.3.1 Set up a Chat Bot 4.3.2 Simple Notification 4.3.3 Notification with DateTime 4.3.4 Notification with DateTime and Processing Time 4.3.5 Notification with DateTime, Processing Time and Evaluation 4.NN - Artificial Neural Network for Regression Analysis
/2021/03/02/nn-artificial-neural-network-for-regression-analysis/
Tue, 02 Mar 2021 00:00:00 +0000/2021/03/02/nn-artificial-neural-network-for-regression-analysis/1 Introduction 2 Loading the libraries 3 Loading the data 4 Data pre-processing 4.1 Determination of the predictors and the criterion 4.2 Train-Validation-Test Split 4.3 Scaling 5 ANN for Regression 5.1 Name Definitions 5.2 Parameter Settings 5.3 Layer Structure 5.4 Configuring the model for training 5.5 Callbacks 5.6 Fitting the model 5.7 Obtaining the best model values 5.8 Storing all necessary metrics 5.9 Validation 5.NN – Artificial Neural Network for Multi-Class Classfication
/2021/02/23/nn-artificial-neural-network-for-multi-class-classfication/
Tue, 23 Feb 2021 00:00:00 +0000/2021/02/23/nn-artificial-neural-network-for-multi-class-classfication/1 Introduction 2 Loading the libraries 3 Loading the data 4 Data pre-processing 4.1 Determination of the predictors and the criterion 4.2 Encoding 4.3 Train-Validation-Test Split 4.4 Check if all classes are included in every split part 5 ANN for Multi-Class Classfication 5.1 Name Definitions 5.2 Parameter Settings 5.3 Layer Structure 5.4 Configuring the model for training 5.5 Callbacks 5.6 Fitting the model 5.NN – Artificial Neural Network for binary Classification
/2021/02/16/nn-artificial-neural-network-for-binary-classification/
Tue, 16 Feb 2021 00:00:00 +0000/2021/02/16/nn-artificial-neural-network-for-binary-classification/1 Introduction 2 Loading the libraries 3 Loading the data 4 Data pre-processing 4.1 Determination of the predictors and the criterion 4.2 Encoding 4.3 Train-Validation-Test Split 5 ANN for binary Classification 5.1 Name Definitions 5.2 Parameter Settings 5.3 Layer Structure 5.4 Configuring the model for training 5.5 Callbacks 5.6 Fitting the model 5.7 Obtaining the best model values 5.8 Obtaining class assignments 5.9 Validation 5.NN - Multi-layer Perceptron Regressor (MLPRegressor)
/2021/02/10/nn-multi-layer-perceptron-regressor-mlpregressor/
Wed, 10 Feb 2021 00:00:00 +0000/2021/02/10/nn-multi-layer-perceptron-regressor-mlpregressor/1 Introduction 2 Loading the libraries and data 3 Data pre-processing 4 MLPRegressor 5 Model Evaluation 6 Hyper Parameter Tuning 7 Conclusion 1 Introduction In my last post about Deep Learning with the Multi-layer Perceptron, I showed how to make classifications with this type of neural network.
However, an MLP can also be used to solve regression problems. This will be the content of the following post.NN - Multi-layer Perceptron Classifier (MLPClassifier)
/2021/02/03/nn-multi-layer-perceptron-classifier-mlpclassifier/
Wed, 03 Feb 2021 00:00:00 +0000/2021/02/03/nn-multi-layer-perceptron-classifier-mlpclassifier/1 Introduction 2 Loading the libraries 3 MLPClassifier for binary Classification 3.1 Loading the data 3.2 Data pre-processing 3.3 MLPClassifier 3.4 Model Evaluation 3.5 Hyper Parameter Tuning 4 MLPClassifier for Multi-Class Classification 4.1 Loading the data 4.2 Data pre-processing 4.3 MLPClassifier 4.4 Model Evaluation 4.5 Hyper Parameter Tuning 5 Conclusion 1 Introduction After I already got into the topic of Deep Learning (Computer Vision) with my past posts from January I would like to write about Neural Networks here with a more general post.Classification of Dog-Breeds using a pre-trained CNN model
/2021/01/27/classification-of-dog-breeds-using-a-pre-trained-cnn-model/
Wed, 27 Jan 2021 00:00:00 +0000/2021/01/27/classification-of-dog-breeds-using-a-pre-trained-cnn-model/A Udacity Data Science Nanodegree Capstone Project to classify Dog-Breeds using a pre-trained CNN Model.
Introduction The purpose of this project is to use a convolutional neural network (CNN) to predict dog breeds. I created a pipeline that can be used in a web or mobile app to process real images taken by users. Based on a picture of a dog, the algorithm I created can make an assessment about the dog breed.CV - CNN with TFL and Fine-Tuning for Multi-Class Classification
/2021/01/24/cv-cnn-with-tfl-and-fine-tuning-for-multi-label-classification/
Sun, 24 Jan 2021 00:00:00 +0000/2021/01/24/cv-cnn-with-tfl-and-fine-tuning-for-multi-label-classification/1 Introduction 2 Import the libraries 3 Data pre-processing 3.1 Train-Validation-Test Split 3.2 Obtaining the lists of randomly selected images 3.3 Determination of the directories 3.4 Obtain the total number of training, validation and test images 4 Feature Extraction with Data Augmentation 4.1 Name Definitions 4.2 Parameter Settings 4.3 Instantiating the VGG19 convolutional base 4.4 Freezing all layers up to a specific one 4.CV - CNN with TFL and Fine-Tuning
/2021/01/22/cv-cnn-with-tfl-and-fine-tuning/
Fri, 22 Jan 2021 00:00:00 +0000/2021/01/22/cv-cnn-with-tfl-and-fine-tuning/1 Introduction 2 Import the libraries 3 Data pre-processing 3.1 Train-Validation-Test Split 3.2 Obtaining the lists of randomly selected images 3.3 Determination of the directories 3.4 Obtain the total number of training, validation and test images 4 Feature Extraction with Data Augmentation 4.1 Name Definitions 4.2 Parameter Settings 4.3 Instantiating the VGG19 convolutional base 4.4 Freezing all layers up to a specific one 4.CV - CNN with Transfer Learning for Multi-Class Classification
/2021/01/19/cv-cnn-with-transfer-learning-for-multi-label-classification/
Tue, 19 Jan 2021 00:00:00 +0000/2021/01/19/cv-cnn-with-transfer-learning-for-multi-label-classification/1 Introduction 2 Import the libraries 3 Data pre-processing 3.1 Train-Validation-Test Split 3.2 Obtaining the lists of randomly selected images 3.3 Determination of the directories 3.4 Obtain the total number of training, validation and test images 4 Feature Extraction with Data Augmentation 4.1 Name Definitions 4.2 Parameter Settings 4.3 Instantiating the VGG19 convolutional base 4.4 Instantiating a densely connected classifier 4.4.1 Layer Structure 4.Computer Vision - CNN with Transfer Learning
/2021/01/17/computer-vision-cnn-with-transfer-learning/
Sun, 17 Jan 2021 00:00:00 +0000/2021/01/17/computer-vision-cnn-with-transfer-learning/1 Introduction 2 Import the libraries 3 Data pre-processing 3.1 Train-Validation-Test Split 3.2 Obtaining the lists of randomly selected images 3.3 Determination of the directories 3.4 Obtain the total number of training, validation and test images 4 Feature Extraction without Data Augmentation 4.1 Name Definitions 4.2 Parameter Settings 4.3 Instantiating the VGG19 convolutional base 4.4 Feature Extraction 4.4.1 Get Output Shape of last Layer 4.Computer Vision - CNN for Multi-Class Classification
/2021/01/15/computer-vision-cnn-for-multi-label-classification/
Fri, 15 Jan 2021 00:00:00 +0000/2021/01/15/computer-vision-cnn-for-multi-label-classification/1 Introduction 2 Import the libraries 3 Data pre-processing 3.1 Train-Validation-Test Split 3.2 Obtaining the lists of randomly selected images 3.3 Determination of the directories 3.4 Obtain the total number of training, validation and test images 4 CNN with Data Augmentation 4.1 Name Definitions 4.2 Parameter Settings 4.3 Instantiating a CNN with Data Augmentation 4.3.1 Layer Structure 4.3.2 Configuring the model for training 4.Computer Vision - Convolutional Neural Network
/2021/01/08/computer-vision-convolutional-neural-network/
Fri, 08 Jan 2021 00:00:00 +0000/2021/01/08/computer-vision-convolutional-neural-network/1 Introduction 2 Import the libraries 3 Data pre-processing 3.1 Train-Validation-Test Split 3.2 Obtaining the lists of randomly selected images 3.3 Determination of the directories 3.4 Obtain the total number of training, validation and test images 4 Descriptive Statistics 5 Simple CNN 5.1 Name Definitions 5.2 Parameter Settings 5.3 Instantiating a small CNN 5.3.1 Layer Structure 5.3.2 Configuring the model for training 5.Computer Vision - Automate The Boring Stuff
/2021/01/01/computer-vision-automate-the-boring-stuff/
Fri, 01 Jan 2021 00:00:00 +0000/2021/01/01/computer-vision-automate-the-boring-stuff/1 Introduction 2 Import the libraries 3 Definition of required functions 4 Extract Images from .docx (+renaming) 4.1 Extract and rename Cats Images 4.2 Extract and rename Dogs Images 4.3 Current folder structure 5 Convert Images from .png to .jpg 5.1 Convert Images 5.2 Delete .png files 5.3 Rename .jpg’s accordingly 5.4 Copy all new Images to main folders 6 Train-Validation-Test Split 6.Roadmap for ETL
/2020/11/28/roadmap-for-etl/
Sat, 28 Nov 2020 00:00:00 +0000/2020/11/28/roadmap-for-etl/1 Introduction 2 Roadmap for ETL 2.1 “Simple Pipeline” 2.2 “Pipeline with join” 2.3 “Pipeline with join2” 2.4 “Pipeline with intermediate storage” 3 Conclusion 1 Introduction In the last articles I have intensely dealt with the topic ETL.
An introduction to this topic is worthwhile for the following reasons:
Modularity - better coding Flexibility Easier for other data scientists to read the code Easier error avoidance Automation … 2 Roadmap for ETL At the beginning of the series of lectures I showed basically how to call .ETL - Pipeline with intermediate storage
/2020/11/27/etl-pipeline-with-intermediate-storage/
Fri, 27 Nov 2020 00:00:00 +0000/2020/11/27/etl-pipeline-with-intermediate-storage/1 Introduction 2 Setup 3 ETL Pipeline with intermediate storage 3.1 Extract 3.2 Transform_1 3.3 Transform_2 3.4 Load 4 Create etl_pipeline.py 5 Test etl_pipeline.py 5.1 from jupyter notebook 5.1.1 the very first time 5.1.2 when u changed sth. within preprocess_data 5.1.3 when u continue with analytics 5.2 from command line 6 Conclusion 1 Introduction So far, we have already got to know several variants of ETL with which a large part of use cases can be covered.ETL - Pipeline with join2
/2020/11/26/etl-pipeline-with-join2/
Thu, 26 Nov 2020 00:00:00 +0000/2020/11/26/etl-pipeline-with-join2/1 Introduction 2 Setup 3 ETL Pipeline with join2 3.1 Extract 3.2 Transform 3.2.1 Joining 3.2.2 pre-process countries_metadata for joining 3.2.3 Cleaning 3.2.4 Add further calculations 3.3 Load 4 Create etl_pipeline.py 5 Test etl_pipeline.py 5.1 from jupyter notebook 5.2 from command line 6 Conclusion 1 Introduction Let us come to another variant of ETL. The “last time” I prepared two data sets and then merged them.ETL - Pipeline with join
/2020/11/25/etl-pipeline-with-join/
Wed, 25 Nov 2020 00:00:00 +0000/2020/11/25/etl-pipeline-with-join/1 Introduction 2 Setup 3 ETL Pipeline with join 3.1 Extract 3.2 Transform 3.3 Load 4 Create etl_pipeline.py 5 Test etl_pipeline.py 5.1 from jupyter notebook 5.2 from command line 6 Conclusion 1 Introduction In my last post I showed a “simple ETL”. Now we go one step further and add a join after the data has been processed.
Overview of the ETL steps:ETL - Simple Pipeline
/2020/11/24/etl-simple-pipeline/
Tue, 24 Nov 2020 00:00:00 +0000/2020/11/24/etl-simple-pipeline/1 Introduction 2 Setup 3 ETL Simple Pipeline 3.1 Extract 3.2 Transform 3.3 Load 4 Create etl_pipeline.py 5 Test etl_pipeline.py 5.1 from jupyter notebook 5.2 from command line 6 Conclusion 1 Introduction Now that we’ve gotten into the subject of ETL and I’ve shown how to call “.py files from different sources”, it’s time to write a simple but profitable ETL for data analysis.ETL - Read .py from different sources
/2020/11/23/etl-read-py-from-different-sources/
Mon, 23 Nov 2020 00:00:00 +0000/2020/11/23/etl-read-py-from-different-sources/1 Introduction 2 The Setup 3 Run the python scripts 4 Content of the python scripts 5 Conclusion 1 Introduction Looking back, we have already covered an incredible number of data science topics.
We have dealt with a wide range of “topics in the field of machine learning”. And furthermore, how these algorithms can be applied in practice. A rule of thumb says that a data scientist spends 80% of his time on data preparation.Time Series Analysis - XGBoost for Univariate Time Series
/2020/11/10/time-series-analysis-xgboost-for-univariate-time-series/
Tue, 10 Nov 2020 00:00:00 +0000/2020/11/10/time-series-analysis-xgboost-for-univariate-time-series/1 Introduction 2 Import the libraries and the data 3 Definition of required functions 4 Train Test Split 5 Create Time Series Features 6 Fit the Model 7 Get Feature Importance 8 Forecast And Evaluation 9 Look at Worst and Best Predicted Days 10 Grid Search 11 Conclusion 1 Introduction Now I have written a few posts in the recent past about Time Series and Forecasting. But I didn’t want to deprive you of a very well-known and popular algorithm: XGBoostTime Series Analysis - Neural Networks with multiple predictors
/2020/11/04/time-series-analysis-neural-networks-with-multiple-predictors/
Wed, 04 Nov 2020 00:00:00 +0000/2020/11/04/time-series-analysis-neural-networks-with-multiple-predictors/1 Introduction 2 Import the libraries and the data 3 Definition of required functions 4 Data pre-processing 4.1 Drop Duplicates 4.2 Feature Encoding 4.3 Check for Feature Importance 4.4 Generate Test Set 4.5 Feature Scaling 4.6 Train-Validation Split 4.7 Prepare training and test data using tf 5 Neural Networks with mult. predictors 5.1 LSTM 5.2 Bidirectional LSTM 5.3 GRU 5.4 Encoder Decoder LSTM 5.5 CNN 6 Get the Best Model 7 Conclusion & Overview 1 Introduction Neural networks can be used not only for “univariate time series”.Time Series Analysis - Neural Networks for Univariate Time Series
/2020/11/01/time-series-analysis-neural-networks-for-forecasting-univariate-variables/
Sun, 01 Nov 2020 00:00:00 +0000/2020/11/01/time-series-analysis-neural-networks-for-forecasting-univariate-variables/1 Introduction 2 Import the libraries and the data 3 Definition of required functions 4 Data pre-processing 4.1 Drop Duplicates 4.2 Generate Test Set 4.3 Define Target Variable 4.4 Scaling 4.5 Train-Validation Split 4.5.1 for Single Step Style (sss) 4.5.2 for Horizon Style (hs) 4.6 Prepare training and test data using tf 4.6.1 for Single Step Style (sss) 4.6.2 for Horizon Style (hs) 5 Neural Networks 5.Time Series Analysis - Regression Extension Techniques for Multivariate Time Series
/2020/10/29/time-series-analysis-regression-extension-techniques-for-forecasting-multivariate-variables/
Thu, 29 Oct 2020 00:00:00 +0000/2020/10/29/time-series-analysis-regression-extension-techniques-for-forecasting-multivariate-variables/1 Introduction 2 Import the Libraries and the Data 3 Definition of required Functions 4 EDA 5 Stationarity 5.1 Check for stationary 5.2 Train Test Split 5.3 Make data stationary 5.4 Check again for stationary 6 Cointegration Test 7 Regression Extension Techniques for Forecasting Multivariate Variables 7.1 Vector Autoregression (VAR) 7.1.1 Get best AR Terms 7.1.2 Fit VAR 7.1.3 Inverse Transformation 7.1.4 Evaluation of VAR 7.Time Series Analysis - Regression Extension Techniques for Univariate Time Series
/2020/10/27/time-series-analysis-regression-extension-techniques-for-forecasting-univariate-variables/
Tue, 27 Oct 2020 00:00:00 +0000/2020/10/27/time-series-analysis-regression-extension-techniques-for-forecasting-univariate-variables/1 Introduction 2 Theoretical Background 3 Import the Libraries and Data 4 Definition of required Functions 5 Check for Stationarity 6 ARIMA in Action 7 Seasonal ARIMA (SARIMA) 7.1 Get the final Model 8 SARIMAX 8.1 Get the final Model 9 Conclusion 1 Introduction Now that we are familiar with smoothing methods for predicting time series, we come to so-called regression extension techniques.Time Series Analysis - Smoothing Methods
/2020/10/23/time-series-analysis-smoothing-methods/
Fri, 23 Oct 2020 00:00:00 +0000/2020/10/23/time-series-analysis-smoothing-methods/1 Introduction 2 Import libraries and data 3 Definition of required functions 4 Simple Exponential Smoothing 4.1 Searching for best parameters for SES 4.2 Fit SES 4.3 Fit SES with optimized=True 4.4 Plotting the results for SES 5 Double Exponential Smoothing 5.1 Searching for best parameters for DES 5.2 Fit DES 5.3 Fit DES with optimized=True 5.4 Plotting the results for DES 6 Triple Exponential Smoothing 6.Time Series Analysis - Working with Dates and Times
/2020/10/19/time-series-analysis-working-with-dates-and-times/
Mon, 19 Oct 2020 00:00:00 +0000/2020/10/19/time-series-analysis-working-with-dates-and-times/1 Introduction 1.1 Stationary Data 1.2 Differencing 1.3 Working with Dates and Times 2 Import the libraries and the data 3 Convert timestamp to DateTime 4 Extract Year, Month and Day 5 Extract Weekday and Week 6 Calculate Quarter 7 Generate YearQuarter 8 Filter for TimeDate 9 Conclusion 1 Introduction Let’s continue our journey through the different Analytics fields. Let’s now move on to the topic of Time Series Analysis.Recommendation Systems - Metadata-based Recommender
/2020/10/05/recommendation-systems-metadata-based-recommender/
Mon, 05 Oct 2020 00:00:00 +0000/2020/10/05/recommendation-systems-metadata-based-recommender/1 Introduction 2 Import the libraries and the data 3 Data pre-processing 3.1 Clean id column of df 3.2 Join the dataframes 3.3 Wrangling crew, cast, keywords and genres 3.4 Sanitize data 3.5 Create a soup of desired metadata 3.6 Create vectors with CountVectorizer 3.7 Compute the pairwise similarity 4 Build the Metadata-based Recommender 5 Test the recommender 6 Conclusion 1 Introduction Now that we have developed a “recommender” based on the film descriptions, we will go a step further in this post and add more metadata.Recommendation Systems - Plot Description-based Recommender
/2020/10/03/recommendation-systems-plot-description-based-recommender/
Sat, 03 Oct 2020 00:00:00 +0000/2020/10/03/recommendation-systems-plot-description-based-recommender/1 Introduction 2 Import the libraries and the data 3 Data pre-processing Part I 4 Data pre-processing Part II 4.1 Introduction of the CountVectorizer 4.2 Introduction of the TF-IDFVectorizer 4.3 Create TF-IDF vectors 4.4 Compute the pairwise cosin similarity 5 Build the Plot Description-based Recommender 6 Test the recommender 7 Conclusion 1 Introduction After having developed a simple “Knowledge-based Recommender” we now come to another recommender: the Plot Description-based Recommender.Recommendation Systems - Knowledged-based Recommender
/2020/10/01/recommendation-systems-knowledged-based-recommender/
Thu, 01 Oct 2020 00:00:00 +0000/2020/10/01/recommendation-systems-knowledged-based-recommender/1 Introduction 2 Import the libraries and the data 3 Data pre-processing 3.1 Extract the release year 3.2 Convert the genres features 4 Build the Knowledged-based Recommender 5 Conclusion 1 Introduction After Marketing Analytics it is now time to dedicate yourself to a new field of Analytics. As we have already touched on “recommendations in the marketing context”, it makes sense to continue with the topic of recommendation systems at this point.Marketing - A/B Testing
/2020/09/29/marketing-a-b-testing/
Tue, 29 Sep 2020 00:00:00 +0000/2020/09/29/marketing-a-b-testing/1 Introduction 2 Import the libraries and the data 3 Descriptive Analytics 4 Significance Tests 5 Conclusion 1 Introduction At this point we have already covered some topics from the field of marketing:
“Customer Lifetime Value” “Market Basket Analysis” “Product Analytics and Recommendations” “Conversion Rate Analytics” Now we turn to a smaller but equally important area: A/B Testing. The decisions that are made in the marketing area can be very far-reaching.Marketing - Customer Lifetime Value
/2020/09/22/marketing-customer-lifetime-value/
Tue, 22 Sep 2020 00:00:00 +0000/2020/09/22/marketing-customer-lifetime-value/1 Introduction 2 Import the libraries and the data 3 Data pre-processing 3.1 Negative Quantity 3.2 Missing Values within CustomerID 3.3 Handling incomplete data 3.4 Total Sales 3.5 Create final dataframe 4 Descriptive Analytics 4.1 Final Dataframe for Descriptive Analytics 4.2 Visualizations 5 Predicting 3-Month CLV 5.1 Final Dataframe for Prediction Models 5.2 Building Sample Set 5.3 Train Test Split 5.4 Linear Regression 5.Marketing - Market Basket Analysis
/2020/09/15/marketing-market-basket-analysis/
Tue, 15 Sep 2020 00:00:00 +0000/2020/09/15/marketing-market-basket-analysis/1 Introduction 2 Market Basket Analysis 3 Import the libraries and the data 4 Data pre-processing 5 Executing the Apriori Algorithm 6 Deriving Association Rules 7 Conclusion 1 Introduction Another exciting topic in marketing analytics is Market Basket Analysis. This is the topic of this publication. At the beginning of this post I will be introducing some key terms and metrics aimed at giving a sense of what “association” in a rule means and some ways to quantify the strength of this association.Marketing - Product Analytics and Recommendations
/2020/09/08/marketing-product-analytics-and-recommendations/
Tue, 08 Sep 2020 00:00:00 +0000/2020/09/08/marketing-product-analytics-and-recommendations/1 Introduction 2 Import the libraries and the data 3 Product Analytics 3.1 Number of Orders over Time 3.2 Revenue over Time 3.3 Repeat Customers over Time 3.4 Repeat Customers Revenue over Time 3.5 Popular Items over Time 4 Product Recommendations 4.1 Collaborative Filtering 4.1.1 User-based Collaborative Filtering 4.1.2 Item-based Collaborative Filtering 5 Conclusion 1 Introduction I entered the field of marketing analytics with the topic conversion rate analysis.Marketing - Conversion Rate Analytics
/2020/09/01/marketing-conversion-rate-analytics/
Tue, 01 Sep 2020 00:00:00 +0000/2020/09/01/marketing-conversion-rate-analytics/1 Introduction 2 Import the libraries 3 Descriptive Analytics (Conversion Rate) 4 Drivers behind Marketing Engagement 4.1 Select Numerical Columns 4.2 Select and Encode Categorical Columns 4.3 Create final Dataframe 4.4 Regression Analysis (Logit) 5 Predicting the Likelihood of Marketing Engagement 5.1 Fit the Model 5.2 Feature Importance 5.3 Model Evaluation 6 Engagement to Conversion 7 Conclusion 1 Introduction After having reported very detailed in numerous posts about the different machine learning areas I will now work on various analytics fields.The Data Science Process (CRISP-DM)
/2020/08/21/the-data-science-process-crisp-dm/
Fri, 21 Aug 2020 00:00:00 +0000/2020/08/21/the-data-science-process-crisp-dm/1 Introduction 2 Import the libraries 3 Import the data 4 Answering Research Questions - Descriptive Statistics 4.1 Data pre-processing 4.1.1 Check for Outliers 4.1.2 Check for Missing Values 4.1.3 Handling Categorical Variables 4.2 Research Question 1 4.3 Research Question 2 4.4 Research Question 3 5 Development of a Machine Learning Model End to End Process 5.1 Data pre-processing 5.1.1 Train Test Split 5.Roadmap for the Machine Learning fields
/2020/08/20/roadmap-for-the-machine-learning-fields/
Thu, 20 Aug 2020 00:00:00 +0000/2020/08/20/roadmap-for-the-machine-learning-fields/1 Introduction 2 Roadmap for the Machine Learning fields 3 Conclusion 1 Introduction As mentioned in my previous post here an overview of the Machine Learning fileds.
2 Roadmap for the Machine Learning fields Here are the links to the individual Roadmaps:
“Roadmap for Regression Analysis” “Roadmap for Classification Tasks” “Roadmap for Cluster Analysis” “Roadmap for Dimensionality Reduction” 3 Conclusion The overview shows the fields from machine learning.Roadmap for Dimensionality Reduction
/2020/08/18/roadmap-for-dimensionality-reduction/
Tue, 18 Aug 2020 00:00:00 +0000/2020/08/18/roadmap-for-dimensionality-reduction/1 Introduction 2 Roadmap for Dimensionality Reduction 3 Conclusion 1 Introduction In conclusion to all publications on dimension reduction methods, I am giving an overview of the algorithms discussed here.
2 Roadmap for Dimensionality Reduction Annotation: The Kernel-PCA is of course not a linear dimension reduction method.
Here are the links to the individual topics.
PCA in general PCA for speed up ML models PCA for Visualization Randomized PCA Incremental PCA Kernel PCA LDA Manifold Learning 3 Conclusion Now we are finally at the end of the publications on the various topics in the field of data science.Manifold Learning
/2020/08/12/manifold-learning/
Wed, 12 Aug 2020 00:00:00 +0000/2020/08/12/manifold-learning/1 Introduction 2 Loading the libraries 3 Manifold Learning Methods 3.1 Locally Linear Embedding 3.2 Modified Locally Linear Embedding 3.3 Isomap 3.4 Spectral Embedding 3.5 Multi-dimensional Scaling (MDS) 3.6 t-SNE 4 Comparison of the calculation time 5 Conclusion 1 Introduction Curse of Dimensionality
The curse of dimensionality is one of the most important problems in multivariate machine learning. It appears in many different forms, but all of them have the same net form and source: the fact that points in high-dimensional space are highly sparse.Linear Discriminant Analysis (LDA)
/2020/08/07/linear-discriminant-analysis-lda/
Fri, 07 Aug 2020 00:00:00 +0000/2020/08/07/linear-discriminant-analysis-lda/1 Introduction 2 Loading the libraries and data 3 Descriptive statistics 4 Data pre-processing 5 LDA in general 6 PCA vs. LDA 7 LDA as a classifier 8 Conclusion 1 Introduction Now that I have written extensively about the “PCA”, we now come to another dimension reduction algorithm: The Linear Discriminant Analysis.
LDA is a supervised machine learning method that is used to separate two or more classes of objects or events.PCA for speed up ML models
/2020/07/31/pca-for-speed-up-ml-models/
Fri, 31 Jul 2020 00:00:00 +0000/2020/07/31/pca-for-speed-up-ml-models/1 Introduction 2 Loading the libraries and the dataset 3 LogReg 4 LogReg with PCA 4.1 PCA with 95% variance explanation 4.2 PCA with 80% variance explanation 4.3 Summary 5 Export PCA to use in another program 6 Conclusion 1 Introduction As already announced in post about “PCA”, we now come to the second main application of a PCA: Principal Component Analysis for speed up machine learning models.PCA for Visualization
/2020/07/27/pca-for-visualization/
Mon, 27 Jul 2020 00:00:00 +0000/2020/07/27/pca-for-visualization/1 Introduction 2 Loading the libraries and the dataset 3 Statistics and preprocessing 4 PCA for visualization 4.1 Interpreting Components 4.2 Visualization of the components 5 Conclusion 1 Introduction After I wrote extensively on the subject of “Principal Component Analysis” in my last publication, we now come to one of the two main uses announced: PCA for visualizations.
For this post the dataset Pokemon from the statistic platform “Kaggle” was used.Principal Component Analysis (PCA)
/2020/07/22/principal-component-analysis-pca/
Wed, 22 Jul 2020 00:00:00 +0000/2020/07/22/principal-component-analysis-pca/1 Introduction 2 Loading the libraries 3 Introducing PCA 4 PCA in general 5 Randomized PCA 6 Incremental PCA 7 Kernel PCA 8 Tuning Hyperparameters 9 Conclusion 1 Introduction After the various methods of cluster analysis “Cluster Analysis” have been presented in various publications, we now come to the second category in the area of unsupervised machine learning: Dimensionality Reduction
The areas of application of dimensionality reduction are widely spread within machine learning.Roadmap for Cluster Analysis
/2020/07/14/roadmap-for-cluster-analysis/
Tue, 14 Jul 2020 00:00:00 +0000/2020/07/14/roadmap-for-cluster-analysis/1 Introduction 2 Roadmap for Cluster Analysis 3 Description of the cluster algorithms in a nutshell 4 Conclusion 1 Introduction In my most recent publications, I have dealt extensively with individual topics in the field of cluster analysis. This post should serve as a summary of the topics covered.
2 Roadmap for Cluster Analysis First annotation: The cluster algorithms that are marked with a red star in the graphic do not require an entry of k for the number of clustersSpectral Clustering
/2020/07/08/spectral-clustering/
Wed, 08 Jul 2020 00:00:00 +0000/2020/07/08/spectral-clustering/1 Introduction 2 Loading the libraries 3 Introducing Spectral Clustering 4 Generating some test data 5 k-Means 6 Spectral Clustering 7 Digression: Feature-Engineering & k-Means 8 Conclusion 1 Introduction My post series from the unsupervised machine learning area about cluster algorithms is slowly coming to an end. However, what cluster algorithm cannot be missing in any case is Spectral Clustering. And this is what the following post is about.Mean Shift Clustering
/2020/07/01/mean-shift-clustering/
Wed, 01 Jul 2020 00:00:00 +0000/2020/07/01/mean-shift-clustering/1 Introduction 2 Loading the libraries 3 Generating some test data 4 Introducing Mean Shift Clustering 5 Mean Shift with scikit-learn 6 Conclusion 1 Introduction Suppose you have been given the task of discovering groups, or clusters, that share certain characteristics within a dataset. There are various unsupervised machine learning algorithms that can be used to do this.
As we’ve seen in past posts, “k-Means Clustering” and “Affinity Propagation” can be used if you have good or easily separable data, respectively.Affinity Propagation
/2020/06/29/affinity-propagation/
Mon, 29 Jun 2020 00:00:00 +0000/2020/06/29/affinity-propagation/1 Introduction 2 Loading the libraries 3 Generating some test data 4 Introducing Affinity Propagation 5 Affinity Propagation with scikit-learn 6 Conclusion 1 Introduction In the past few posts some cluster algorithms were presented. I wrote extensively about “k-Means Clustering”, “Hierarchical Clustering”, “DBSCAN”, “HDBSCAN” and finally about “Gaussian Mixture Models” as well as “Bayesian Gaussian Mixture Models”.
Fortunately, we are not yet through with the most common cluster algorithms.Bayesian Gaussian Mixture Models
/2020/06/26/bayesian-gaussian-mixture-models/
Fri, 26 Jun 2020 00:00:00 +0000/2020/06/26/bayesian-gaussian-mixture-models/1 Introduction 2 Loading the libraries 3 Generating some test data 4 Bayesian Gaussian Mixture Models in action 5 Conclusion 1 Introduction In my last post I reported on “Gaussian Mixture Models”. Now we come to an kind of extension of GMM the Bayesian Gaussian Mixture Models. As we have seen at “GMM”, we could either only infer the number of clusters by eye or by comparing the theoretical information criterions “AIC” and “BIC” for different k.Gaussian Mixture Models
/2020/06/24/gaussian-mixture-models/
Wed, 24 Jun 2020 00:00:00 +0000/2020/06/24/gaussian-mixture-models/1 Introduction 2 Loading the libraries 3 Generating some test data 4 Weaknesses of k-Means 5 Gaussian Mixture Models 6 Determine the optimal k for GMM 7 GMM for density estimation 8 Conclusion 1 Introduction Let’s come to a further unsupervised learning cluster algorithm: The Gaussian Mixture Models. As simple or good as the K-Means algorithm is, it is often difficult to use in real world situations.HDBSCAN
/2020/06/20/hdbscan/
Sat, 20 Jun 2020 00:00:00 +0000/2020/06/20/hdbscan/1 Introduction 2 Loading the libraries 3 Introducing HDBSCAN 4 Parameter Selection for HDBSCAN 5 HDBSCAN in action 5.1 Functionality of the HDBSCAN algorithm 5.2 Visualization options 5.3 Predictions with HDBSCAN 6 Conclustion 1 Introduction In the series of unsupervised learning cluster algorithms, we have already got to know “hierarchical clustering” and “density-based clustering (DBSCAN)”. Now we come to an expansion of the DBSCAN algorithm in which the hierarchical approach is integrated.DBSCAN
/2020/06/15/dbscan/
Mon, 15 Jun 2020 00:00:00 +0000/2020/06/15/dbscan/1 Introduction 2 Loading the libraries 3 Introducing DBSCAN 4 DBSCAN with Scikit-Learn 4.1 Data preparation 4.2 k-Means 4.3 DBSCAN 5 Conclusion 1 Introduction The next unsupervised machine learning cluster algorithms is the Density-Based Spatial Clustering and Application with Noise (DBSCAN). DBSCAN is a density-based clusering algorithm, which can be used to identify clusters of any shape in a data set containing noise and outliers.Hierarchical Clustering
/2020/06/04/hierarchical-clustering/
Thu, 04 Jun 2020 00:00:00 +0000/2020/06/04/hierarchical-clustering/1 Introduction 2 Loading the libraries 3 Introducing hierarchical clustering 4 Dendrograms explained 5 Hierarchical Clustering with Scikit-Learn 6 Hierarchical clustering on real-world data 7 Conclusion 1 Introduction The second cluster algorithm I would like present is hierarchical clustering. Hierarchical clustering is also a type of unsupervised machine learning algorithm used to cluster unlabeled data points within a dataset. Like “k-Means Clustering”, hierarchical clustering also groups together the data points with similar characteristics.k-Means Clustering
/2020/05/19/k-means-clustering/
Tue, 19 May 2020 00:00:00 +0000/2020/05/19/k-means-clustering/1 Introduction 2 Loading the libraries 3 Introducing k-Means 4 Preparation of the data record 5 Application of k-Means 6 Determine the optimal k for k-Means 7 Visualization 8 Conclusion 1 Introduction After dealing with supervised machine learning models, we now come to another important data science area: unsupervised machine learning models. One class of the unsupervised machine learning models are the cluster algorithms. Clustering algorithms seek to learn, from the properties of the data, an optimal division or discrete labeling of groups of points.Ensemble Modeling - Voting
/2020/05/05/ensemble-modeling-voting/
Tue, 05 May 2020 00:00:00 +0000/2020/05/05/ensemble-modeling-voting/1 Introduction 2 Background Information on Voting 3 Loading the libraries and the data 4 Data pre-processing 5 Voting with scikit learn 6 GridSearch 7 Overview of the accuracy scores 8 Conclusion 1 Introduction I have already presented three different Ensemble Methods “Bagging”, “Boosting” and “Stacking”. But there is another one that I would like to report on in this publication: Voting
Voting is an ensemble machine learning model that combines the predictions from multiple other models.Stacking with Scikit-Learn
/2020/04/29/stacking-with-scikit-learn/
Wed, 29 Apr 2020 00:00:00 +0000/2020/04/29/stacking-with-scikit-learn/1 Introduction 2 Importing the libraries and the data 3 Data pre-processing 4 Stacking with scikit learn 4.1 Model 1 incl. GridSearch 4.2 Model 2 incl. GridSearch 5 Conclusion 1 Introduction In my previous post I explained the “ensemble modeling method ‘Stacking’”. As it is described there, it is entirely applicable. However, it can be made even easier with the machine learning library scikit learn.Ensemble Modeling - Stacking
/2020/04/24/ensemble-modeling-stacking/
Fri, 24 Apr 2020 00:00:00 +0000/2020/04/24/ensemble-modeling-stacking/1 Introduction 2 Background Information on Stacking 3 Loading the libraries and the data 4 Data pre-processing 4.1 One-hot-encoding 4.2 LabelBinarizer 4.3 Train-Test-Split 4.4 Convert to a numpy array 5 Building a stacked model 5.1 Create a new training set 5.2 Train base models 5.3 Create a new test set 5.4 Fit base models on the complete training set 5.5 Train the stacked model 6 Comparison of the accuracy 7 Conclusion 1 Introduction After “Bagging” and “Boosting” we come to the last type of ensemble method: Stacking.Ensemble Modeling - XGBoost
/2020/04/01/ensemble-modeling-xgboost/
Wed, 01 Apr 2020 00:00:00 +0000/2020/04/01/ensemble-modeling-xgboost/1 Introduction 2 Theoretical Background 3 Import the libraries 4 XGBoost for Classification 4.1 Load the bank dataset 4.2 Pre-process the bank dataset 4.3 Fit the Model 4.4 Evaluate the Model 4.5 Monitor Performance and Early Stopping 4.6 Xgboost Built-in Feature Importance 4.6.1 Get Feature Importance of all Features 4.6.2 Get the feature importance of all the features the model has retained 4.7 Grid Search 5 XGBoost for Regression 5.Ensemble Modeling - Boosting
/2020/03/26/ensemble-modeling-boosting/
Thu, 26 Mar 2020 00:00:00 +0000/2020/03/26/ensemble-modeling-boosting/1 Introduction 2 Background Information on Boosting 3 Loading the libraries and the data 4 Data pre-processing 5 AdaBoost (Adaptive Boosting) 6 Gradient Boosting 7 Conclusion 1 Introduction After “Bagging” we come to another type of ensemble method: Boosting.
For this post the dataset Bank Data from the platform “UCI Machine Learning Repository” was used. You can download it from my “GitHub Repository”.
2 Background Information on Boosting Boosting often considers homogeneous weak learners and learns them sequentially in a very adaptative way (a base model depends on the previous ones) and combines them following a deterministic strategy.Ensemble Modeling - Bagging
/2020/03/07/ensemble-modeling-bagging/
Sat, 07 Mar 2020 00:00:00 +0000/2020/03/07/ensemble-modeling-bagging/1 Introduction 2 Background Information on Bagging 3 Loading the libraries and the data 4 Data pre-processing 5 Decision Tree Classifier 6 Bagging Classifier 7 Random Forest Classifier 7.1 Train the Random Forest Classifier 7.2 Evaluate the Forest Classifier 7.2.1 StratifiedKFold 7.2.2 KFold 7.3 Hyperparameter optimization via Randomized Search 7.4 Determination of feature importance 8 Conclusion 1 Introduction So far we have dealt very intensively with the use of different classification algorithms.Saving machine learning models to disc
/2020/02/29/saving-machine-learning-models-to-disc/
Sat, 29 Feb 2020 00:00:00 +0000/2020/02/29/saving-machine-learning-models-to-disc/1 Introduction 2 Loading the libraries and the data 3 Visualization of the data 4 Model training 5 Safe a model to disc 6 Load a model from disc 7 Conclusion 1 Introduction We have seen how to train and use different types of machine learning models. But how do we proceed when we have developed and trained a model with the desired performance? Due to the fact that the training of large machine learning models can sometimes take many hours, it is a good tip to save your trained models regularly so that you can access them later.Roadmap for Classification Tasks
/2020/02/19/roadmap-for-classification-tasks/
Wed, 19 Feb 2020 00:00:00 +0000/2020/02/19/roadmap-for-classification-tasks/1 Introduction 2 Roadmap for Classification Tasks 2.1 Data pre-processing 2.2 Feature Selection Methods 2.3 Algorithms 2.3.1 Classification Algorithms 2.3.2 Classification with Neural Networks 2.3.3 AutoML 3 Conclusion 1 Introduction Another big chapter from the supervised machine learning area comes to an end. In the past 4 months I wrote in detail about the functionality and use of the most common classification algorithms within data science.Feature selection methods for classification tasks
/2020/01/31/feature-selection-methods-for-classification-tasks/
Fri, 31 Jan 2020 00:00:00 +0000/2020/01/31/feature-selection-methods-for-classification-tasks/1 Introduction 2 Loading the libraries and the data 3 Filter methods 4 Wrapper methods 4.1 SelectKBest 4.2 Step Forward Feature Selection 4.3 Backward Elimination 4.4 Recursive Feature Elimination (RFE) 4.5 Exhaustive Feature Selection 5 Conclusion 1 Introduction I already wrote about feature selection for regression analysis in this “post”. Feature selection is also relevant for classification problems. And that’s what this post is about.Dealing with imbalanced classes
/2020/01/16/dealing-with-imbalanced-classes/
Thu, 16 Jan 2020 00:00:00 +0000/2020/01/16/dealing-with-imbalanced-classes/1 Introduction 2 Loading the libraries and the data 3 Data pre-processing 4 Logistic Regression 5 Resampling methods 5.1 Oversampling 5.2 Undersampling 6 ML Algorithms for imbalanced datasets 6.1 SMOTE (Synthetic Minority Over-sampling Technique) 6.2 NearMiss 7 Penalize Algorithms 8 Tree-Based Algorithms 9 Conclusion 1 Introduction The validation metric ‚Accuracy‘ is a surprisingly common problem in machine learning (specifically in classification), occurring in datasets with a disproportionate ratio of observations in each class.Introduction to KNN Classifier
/2019/12/27/introduction-to-knn-classifier/
Fri, 27 Dec 2019 00:00:00 +0000/2019/12/27/introduction-to-knn-classifier/1 Introduction 2 Background information on KNN 3 Loading the libraries and the data 4 KNN - Model Fitting and Evaluation 5 Determination of K and Model Improvement 6 Conclusion 1 Introduction K Nearest Neighbor (KNN) is a very simple supervised classification algorithm which is easy to understand, versatile and one of the topmost machine learning algorithms. The KNN algorithm can be used for both classification (binary and multiple) and regression problems.Introduction to Naive Bayes Classifier
/2019/12/15/introduction-to-naive-bayes-classifier/
Sun, 15 Dec 2019 00:00:00 +0000/2019/12/15/introduction-to-naive-bayes-classifier/1 Introduction 2 Background information on Naive Bayes Classifier 3 Loading the libraries and the data 4 Data pre-processing 5 Naive Bayes in scikit-learn 5.1 Binary Classification 5.1.1 Gaussian Naive Bayes 5.1.2 Bernoulli Naive Bayes 5.2 Multiple Classification 5.2.1 Gaussian Naive Bayes 5.2.2 Multinomial Naive Bayes 6 Conclusion 1 Introduction Now in the series of multiple classifiers we come to a very easy to use probability model: The Naive Bayes Classifier.Introduction to Decision Trees
/2019/11/30/introduction-to-decision-trees/
Sat, 30 Nov 2019 00:00:00 +0000/2019/11/30/introduction-to-decision-trees/1 Introduction 2 Background information on decision trees 3 Loading the libraries and the data 4 Decision Trees with scikit-learn 5 Visualization of the decision tree 5.1 via graphviz 5.2 via scikit-learn 6 Model evaluation 7 Model improvement 7.1 Hyperparameter optimization via Grid Search 7.2 Pruning 8 Conclusion 1 Introduction After “Multinomial logistic regression” we come to a further multiple class classifier: Decision Trees.Multinomial logistic regression
/2019/11/15/multinomial-logistic-regression/
Fri, 15 Nov 2019 00:00:00 +0000/2019/11/15/multinomial-logistic-regression/1 Introduction 2 Loading the libraries and the data 3 Multinomial logistic regression with scikit-learn 3.1 Fit the model 3.2 Model validation 3.3 Calculated probabilities 4 Multinomial Logit with the statsmodel library 5 Conclusion 1 Introduction In my previous posts, I explained how “Logistic Regression” and “Support Vector Machines” works. Short wrap up: we used a logistic regression or a support vector machine to create a binary classification model.Introduction to Perceptron Algorithm
/2019/11/14/introduction-to-perceptron-algorithm/
Thu, 14 Nov 2019 00:00:00 +0000/2019/11/14/introduction-to-perceptron-algorithm/1 Introduction 2 Background information on Perceptron Algorithm 3 Loading the libraries and the data 4 Perceptron - Model Fitting and Evaluation 5 Hyperparameter optimization via Grid Search 6 OvO/OvR with the Perceptron 7 Perceptron with SGD training 8 Conclusion 1 Introduction I already wrote about “Logistic Regression” and “Support Vector Machines”. I also showed how to optimize these linear classifiers using “SGD training” and how to use the “OneVersusRest and OneVersusAll” Classifier to convert binary classifiers to multiple classifiers.OvO and OvR Classifier
/2019/11/13/ovo-and-ovr-classifier/
Wed, 13 Nov 2019 00:00:00 +0000/2019/11/13/ovo-and-ovr-classifier/1 Introduction 2 Background information on OvO and OvR 3 Loading the libraries and the data 4 OvO/OvR with Logistic Regression 4.1 One-vs-Rest 4.2 One-vs-One 4.3 Grid Search 5 OvO/OvR with SVM 5.1 One-vs-Rest 5.2 One-vs-One 5.3 Grid Search 6 Conclusion 1 Introduction We already know from my previous posts how to train a binary classifier using “Logistic Regression” or “Support Vector Machines”.Introduction to SGD Classifier
/2019/11/11/introduction-to-sgd-classifier/
Mon, 11 Nov 2019 00:00:00 +0000/2019/11/11/introduction-to-sgd-classifier/1 Introduction 2 Background information on SGD Classifiers 3 Loading the libraries and the data 4 Data pre-processing 5 SGD-Classifier 5.1 Logistic Regression with SGD training 5.2 Linear SVM with SGD training 6 Model improvement 6.1 Performance comparison of the different linear models 6.2 GridSearch 7 Conclusion 1 Introduction The name Stochastic Gradient Descent - Classifier (SGD-Classifier) might mislead some user to think that SGD is a classifier.Introduction to Support Vector Machines
/2019/11/08/introduction-to-support-vector-machines/
Fri, 08 Nov 2019 00:00:00 +0000/2019/11/08/introduction-to-support-vector-machines/1 Introduction 2 Background information on Support Vector Machines 3 Loading the libraries and the data 4 Data pre-processing 5 SVM with scikit-learn 5.1 Model Fitting 5.2 Model evaluation 6 Kernel SVM with Scikit-Learn 6.1 Polynomial Kernel 6.2 Gaussian Kernel 6.3 Sigmoid Kernel 7 Hyperparameter optimization via Grid Search 8 Conclusion 1 Introduction In addition to “Logistic Regression”, there is another very well-known algorithm for binary classifications: the Support Vector Machine (SVM).Randomized Search
/2019/11/06/randomized-search/
Wed, 06 Nov 2019 00:00:00 +0000/2019/11/06/randomized-search/1 Introduction 2 Grid Search vs. Randomized Search 3 Loading the libraries and data 4 Data pre-processing 5 Grid Searach 6 Randomized Search 7 Conclusion 1 Introduction In my last publication on “Grid Search” I showed how to do hyper parameter tuning. As you saw in the last chapter (6.3 Grid Search with more than one estimator), these calculations quickly become very computationally intensive. This sometimes leads to very long calculation times.Grid Search
/2019/11/04/grid-search/
Mon, 04 Nov 2019 00:00:00 +0000/2019/11/04/grid-search/1 Introduction 2 Background information on Grid Searach 3 Loading the libraries and the data 4 Data pre-processing 5 LogReg 6 Grid Search 6.1 Grid Search with LogReg 6.2 Grid Search with other machine learning algorithms 6.3 Grid Search with more than one estimator 7 Speed up GridSearchCV using parallel processing 8 Parameter Grid 9 Conclusion 1 Introduction Grid Search is the process of performing hyperparameter tuning in order to determine the optimal values for a given model.Introduction to Logistic Regression
/2019/10/31/introduction-to-logistic-regression/
Thu, 31 Oct 2019 00:00:00 +0000/2019/10/31/introduction-to-logistic-regression/1 Introduction 2 Loading the libraries and the data 3 Descriptive statistics 3.1 Mean values of the features 3.2 Description of the target variable 3.3 Description of the predictor variables 4 Data pre-processing 4.1 Conversion of the target variable 4.2 Creation of dummy variables 4.3 Feature Selection 5 Logistic Regression with the statsmodel library 6 Logistic Regression with scikit-learn 6.1 Over-sampling using SMOTE 6.Roadmap for Regression Analysis
/2019/10/14/roadmap-for-regression-analysis/
Mon, 14 Oct 2019 00:00:00 +0000/2019/10/14/roadmap-for-regression-analysis/1 Introduction 2 Roadmap for Regression Analysis 3 Different types of regression models 4 Further Regression Algorithms 5 Regression with Neural Networks 6 Metrics for Regression Analysis 7 Conclusion 1 Introduction In my most recent publications, I have dealt extensively with individual topics in the field of regression analysis. This post should serve as a summary of the topics covered.
2 Roadmap for Regression Analysis Here are the links to the individual topics.Embedded methods
/2019/10/08/embedded-methods/
Tue, 08 Oct 2019 00:00:00 +0000/2019/10/08/embedded-methods/1 Introduction 2 Loading the libraries and the data 3 Embedded methods 3.1 Ridge Regression 3.2 Lasso Regression 3.3 Elastic Net 4 Grid Search 4.1 Grid for Ridge 4.2 Grid for embedded methods 5 Conclusion 1 Introduction Image Source: “Analytics Vidhya”
Embedded methods are iterative in a sense that takes care of each iteration of the model training process and carefully extract those features which contribute the most to the training for a particular iteration.Wrapper methods
/2019/09/27/wrapper-methods/
Fri, 27 Sep 2019 00:00:00 +0000/2019/09/27/wrapper-methods/1 Introduction 2 Loading the libraries and the data 3 Wrap up: Filter methods 4 Wrapper methods 4.1 Data Preparation 4.1.1 Check for missing values 4.1.2 Removing highly correlated features 4.2 Syntax for wrapper methods 4.2.1 SelectKBest 4.2.2 Forward Feature Selection 4.2.3 Backward Elimination 4.2.4 Recursive Feature Elimination (RFE) 5 Conclusion 1 Introduction Feature selection is pretty important in machine learning primarily because it serves as a fundamental technique to direct the use of variables to what’s most efficient and effective for a given machine learning system.Check for normal distribution
/2019/09/13/check-for-normal-distribution/
Fri, 13 Sep 2019 00:00:00 +0000/2019/09/13/check-for-normal-distribution/1 Introduction 2 Loading the libraries 3 Visual Normality Checks 3.1 Quantile-Quantile Plot 3.2 Histogram Plot 4 Statistical Normality Tests 4.1 Shapiro-Wilk Test 4.2 D’Agostino’s K² Test 5 Conclusion 1 Introduction In my previous “post” the question came up of how to check its data on normal distribution. There are several possibilities for this.
2 Loading the libraries import pandas as pd import numpy as np import pylab import scipy.Feature Scaling with Scikit-Learn
/2019/08/31/feature-scaling-with-scikit-learn/
Sat, 31 Aug 2019 00:00:00 +0000/2019/08/31/feature-scaling-with-scikit-learn/1 Introduction 2 Loading the libraries 3 Scaling methods 3.1 Standard Scaler 3.2 Min-Max Scaler 3.3 Robust Scaler 3.4 Comparison of the previously shown scaling methods 4 Inverse Transformation 5 Export Scaler to use in another program 6 Feature Scaling in practice 7 Normalize or Standardize? 8 Conclusion 1 Introduction Feature scaling can be an important part for many machine learning algorithms. It’s a step of data pre-processing which is applied to independent variables or features of data.Dealing with outliers
/2019/08/20/dealing-with-outliers/
Tue, 20 Aug 2019 00:00:00 +0000/2019/08/20/dealing-with-outliers/1 Introduction 2 Loading the libraries 3 Boxplots - Method 4 Z-score method 5 IQR method 5.1 Detect outlier for column ‘age’ 5.2 Detect outlier for column ‘salary’ 5.3 Remove outlier from dataframe 6 Conclusion 1 Introduction Next to “higly correlated” and “constant” features outlier detection is also a central element of data pre-processing.
In statistics, outliers are data points that do not belong to any particular population.Dealing with constant and duplicate features
/2019/08/09/dealing-with-constant-and-duplicate-features/
Fri, 09 Aug 2019 00:00:00 +0000/2019/08/09/dealing-with-constant-and-duplicate-features/1 Introduction 2 Loading the libraries and the data 3 Removing Constant features 4 Removing Quasi-Constant features 5 Removing Duplicate Features 6 Conclusion 1 Introduction In addition to “removing highly correlated features” as one of the data pre processing steps we also have to take care of constant and duplicate features. Constant features have a variance close to zero and duplicate features are too similar to other variables in the record.Dealing with highly correlated features
/2019/07/28/dealing-with-highly-correlated-features/
Sun, 28 Jul 2019 00:00:00 +0000/2019/07/28/dealing-with-highly-correlated-features/1 Introduction 2 Loading the libraries and the data 3 Preparation 4 Correlations with the output variable 5 Identification of highly correlated features 6 Removing highly correlated features 6.1 Selecting numerical variables 6.2 Train / Test Split 7 Conclusion 1 Introduction One of the points to remember about data pre-processing for regression analysis is multicollinearity. This post is about finding highly correlated predictors within a dataframe.Further Regression Algorithms
/2019/07/24/further-regression-algorithms/
Wed, 24 Jul 2019 00:00:00 +0000/2019/07/24/further-regression-algorithms/1 Introduction 2 Loading the libraries and the data 3 Linear Regression 4 Decision Tree Regression 5 Support Vector Machines Regression 6 Stochastic Gradient Descent (SGD) Regression 7 KNN Regression 8 Ensemble Modeling 8.1 Bagging Regressor 8.2 Bagging Regressor with Decision Tree Reg as base_estimator 8.3 Random Forest Regressor 8.4 AdaBoost Regressor 8.5 AdaBoost Regressor with Decision Tree Reg as base_estimator 8.6 Gradient Boosting Regressor 8.Non-linear regression analysis
/2019/07/14/non-linear-regression-analysis/
Sun, 14 Jul 2019 00:00:00 +0000/2019/07/14/non-linear-regression-analysis/1 Introduction 2 Loading the libraries and the data 3 Data Preparation 4 Hypothesis: a non-linear relationship between the variables mpg and horesepower 5 Linear model 6 Non linear models 6.1 Quadratic Function 6.2 Exponential Function 6.3 Logarithm Function 6.4 Polynomials Function 7 Conclusion Source 1 Introduction In my previous post “Introduction to regression analysis and predictions” I showed how to create linear regression models.statsmodel.formula.api vs statsmodel.api
/2019/07/02/statsmodel-formula-api-vs-statsmodel-api/
Tue, 02 Jul 2019 00:00:00 +0000/2019/07/02/statsmodel-formula-api-vs-statsmodel-api/1 Introduction 2 Loading the libraries and the data 3 The statsmodel.formula.api 4 The statsmodel.api 5 Conclusion 1 Introduction Image Source: “Statsmodels.org”
In my post “Introduction to regression analysis and predictions” I used the statsmodel library to identify significant features influencing the property price. In this publication I would like to show the difference of the statsmodel.formula.api (smf) and the statsmodel.api (sm).
For this post the dataset House Sales in King County, USA from the statistic platform “Kaggle” was used.Metrics for Regression Analysis
/2019/06/30/metrics-for-regression-analysis/
Sun, 30 Jun 2019 00:00:00 +0000/2019/06/30/metrics-for-regression-analysis/1 Introduction 2 Loading the libraries and the data 3 Data pre-processing 3.1 Train-Test Split 3.2 Scaling 4 Model fitting 5 Model Evaluation 5.1 R² 5.2 Mean Absolute Error (MAE) 5.3 Mean Squared Error (MSE) 5.4 Root Mean Squared Error (RMSE) 5.5 Mean Absolute Percentage Error (MAPE) 5.6 Summary of the Metrics 6 Conclusion 1 Introduction In my post Introduction to regression analysis and predictions I showed how to build regression models and also used evaluation metrics under chapter 4.Introduction to regression analysis and predictions
/2019/06/28/introduction-to-regression-analysis-and-predictions/
Fri, 28 Jun 2019 00:00:00 +0000/2019/06/28/introduction-to-regression-analysis-and-predictions/1 Introduction 2 Loading the libraries and the data 3 Implementing linear regression with the statsmodel library 3.1 Simple linear Regression 3.2 Multiple Regression 3.3 Model validation 4 Linear Regression with scikit-learn 5 Conclusion Source 1 Introduction Regression analyzes are very common and should therefore be mastered by every data scientist.
For this post the dataset House Sales in King County, USA from the statistic platform “Kaggle” was used.Types of Encoder
/2019/06/16/types-of-encoder/
Sun, 16 Jun 2019 00:00:00 +0000/2019/06/16/types-of-encoder/1 Introduction 2 Loading the libraries and the data 3 Encoder for predictor variables 3.1 One Hot Encoder 3.1.1 via scikit-learn 3.1.2 via pandas 3.2 Ordinal Encoder 3.3 MultiLabelBinarizer 4 Encoder for target variables 4.1 Label Binarizer 4.2 Label Encoding 5 Inverse Transformation 6 Export Encoder to use in another program 7 Conclusion 1 Introduction As mentioned in my previous “post”, before you can start modeling, a lot of preparatory work is often necessary when preparing the data.The use of dummy variables
/2019/06/14/the-use-of-dummy-variables/
Fri, 14 Jun 2019 00:00:00 +0000/2019/06/14/the-use-of-dummy-variables/1 Introduction 2 Loading the libraries and the data 3 Preparation of the dataframe 4 How to create dummy variables 5 Use dummy variables in a regression analysis 6 Dummy variables with more than two characteristics 7 How to deal with multiple categorical features in a dataset 8 Conclusion 1 Introduction In a nutshell: a dummy variable is a numeric variable that represents categorical data.The use of the groupby function
/2019/05/30/the-use-of-the-groupby-function/
Thu, 30 May 2019 00:00:00 +0000/2019/05/30/the-use-of-the-groupby-function/1 Introduction 2 Loading the libraries and the data 3 Group by 3.1 with size 3.2 with count 3.2.1 Count Non - Zero Observations 3.3 with sum 3.4 with nunique 3.5 with mean 3.6 with agg. 4 Convert the group_by output to a dataframe 5 Conclusion 1 Introduction Goupby is one of the most used functions in data analysis. Therefore, it is worth to take a closer look at their functioning.Random sampling
/2019/05/16/random-sampling/
Thu, 16 May 2019 00:00:00 +0000/2019/05/16/random-sampling/1 Introduction 2 Preparation 3 Split-Methods 3.1 Customer Churn Model 3.2 Train-Test Split via scikit-learn 4 Train-Test-Validation Split 5 Conclusion 1 Introduction Splitting the dataset in training and testing the dataset is one operation every Data Scientist has to perform befor applying any models. The training dataset is the one on which the model is built and the testing dataset is used to check the accuracy of the model.Handling long name spaces
/2019/05/14/handling-long-name-spaces/
Tue, 14 May 2019 00:00:00 +0000/2019/05/14/handling-long-name-spaces/1 Introduction 2 Import the libraries 3 Generate a customized DataFrame 4 Handling long name spaces 5 Conclusion 1 Introduction It happens that the provided data sets sometimes have very long names. Of course, you can rename the data sets and column names used, but sometimes it is necessary to keep meaningful names, even if they have more letters or characters.
With Python, if a line of code gets too long, you usually have the option to jump to the next line.Safe tables and images to disc
/2019/05/13/safe-tables-and-images-to-disc/
Mon, 13 May 2019 00:00:00 +0000/2019/05/13/safe-tables-and-images-to-disc/1 Introduction 2 Import the libraries 3 Definition of required functions 4 Create a folder and a customized DataFrame 5 Safe tables and images to disc 5.1 Safe tables to disc 5.2 Safe images to disc 6 Conclusion 1 Introduction Often Python is used to create reports. Since most managers like to have the analysis results and graphics presented in Power Points or similar, it is important to know how to extract tables and images accordingly.How to create artificial datasets
/2019/05/10/how-to-create-artificial-datasets/
Fri, 10 May 2019 00:00:00 +0000/2019/05/10/how-to-create-artificial-datasets/1 Introduction 2 Import the libraries 3 Definition of required functions 4 Simulated Data 4.1 Make Simulated Data For Regression 4.2 Make Simulated Data For Classification 4.3 Make Simulated Data For Clustering 5 Customized dataset 5.1 Insert a new row to pandas dataframe 5.1.1 In the first place 5.1.2 In the last place 5.1.3 With a defined function 5.1.4 With the append function 5.NumPy. An intuition.
/2019/05/07/numpy-an-intuition/
Tue, 07 May 2019 00:00:00 +0000/2019/05/07/numpy-an-intuition/1 Introduction 2 Attributes of NumPy Arrays 3 Indexing of Arrays 3.1 Access to individual elements 3.2 via Slicing 3.3 Multidimensional subsets of an Array 4 Reshape 5 Concatenate Arrays 6 Split Arrays 7 UFuncs 7.1 Array-Arithmetik 7.2 Exponential function 7.3 Logarithm 7.4 Comparison operators 8 Aggregation 8.1 Multi-dimensional aggregation 9 Timing of functions 10 Conclusion 1 Introduction NumPy is a library of Python that makes it easy to handle vectors, matrices, or large multidimensional arrays in general.How to use Pandas set_option()
/2019/05/02/how-to-use-pandas-set-option/
Thu, 02 May 2019 00:00:00 +0000/2019/05/02/how-to-use-pandas-set-option/1 Introduction 2 The use of pandas set_option() 2.1 to determine max_rows 2.2 to determine max_columns 2.3 to determine text length 2.4 to determine float_format 3 Conclusion 1 Introduction In my previous post “How to suppress scientific notation in Pandas” I have shown how to use the set_option-function of pandas to convert scientifically written numbers into more readable ones. I have taken this as an opportunity to introduce further possibilities of the set_options-function here.How to suppress scientific notation in Pandas
/2019/04/28/how-to-suppress-scientific-notation-in-pandas/
Sun, 28 Apr 2019 00:00:00 +0000/2019/04/28/how-to-suppress-scientific-notation-in-pandas/1 Introduction 2 Scientific notations 3 Import the libraries 4 Display Values as Strings 5 Functions 5.1 Use round() 5.2 Use apply() 5.3 Use set_option() 6 Conclusion 1 Introduction Scientific notations isn’t helpful when you are trying to make quick comparisons across your dataset. However, Pandas will introduce scientific notations by default when the data type is a float. In this post I want to show how to get around this problem.Pivot Tables with Python
/2019/04/24/pivot-tables-with-python/
Wed, 24 Apr 2019 00:00:00 +0000/2019/04/24/pivot-tables-with-python/1 Introduction 2 Getting an overview of our data 3 Categorizing the data by Year and Region 4 Creating a multi-index pivot table 5 Manipulating the data using aggfunc 6 Applying a custom function to remove outlier 7 Categorizing using string manipulation 8 Conclusion 1 Introduction Many people like to work with pivot tables in Excel. This possibility also exists in Python.
For this post the dataset WorldHappinessReport from the statistic platform “Kaggle” was used.Reshape a pandas DataFrame
/2019/04/20/reshape-a-pandas-dataframe/
Sat, 20 Apr 2019 00:00:00 +0000/2019/04/20/reshape-a-pandas-dataframe/1 Introduction 2 Import the libraries 3 Import the data 4 Reshape a pandas DataFrame 4.1 stack() 4.1.1 Application example 4.2 melt() 4.2.1 Application example 5 Comparison of stack() and melt() 6 Conclusion 1 Introduction After merging data (data management), we now come to the topic of how to reshape DataFrames.
2 Import the libraries import pandas as pd import matplotlib.Data Management
/2019/04/16/data-management/
Tue, 16 Apr 2019 00:00:00 +0000/2019/04/16/data-management/1 Introduction 2 Loading the Libraries and the Data 3 Pandas concat-Function 3.1 Concat along rows 3.2 Concat along columns 4 Types of Joins 4.1 Inner Join 4.2 Left Join 4.2.1 left_on & right_on 4.2.2 Missing Keys 4.3 Right Join 4.4 Outer Join 4.5 Left Excluding Join 4.6 Right Excluding Join 4.7 Outer Excluding Join 4.8 Warning 5 Merge multiple data frames 5.Python's Pipe - Operator
/2019/04/04/python-s-pipe-operator/
Thu, 04 Apr 2019 00:00:00 +0000/2019/04/04/python-s-pipe-operator/1 Introduction 2 Python’s Pipe - Operator like R’s %>% 2.1 Filter and select 2.2 Multiple filter and select 2.3 Sample and sort 2.4 Multiple group by and summarize 2.5 Group by and multiple summarize 3 Conclusion 1 Introduction Anyone who has ever worked with R probably knows the very useful pipe operator %>%. Python also has a similar one that will be presented in different versions below.String Manipulation. An intuition.
/2019/03/27/string-manipulation-an-intuition/
Wed, 27 Mar 2019 00:00:00 +0000/2019/03/27/string-manipulation-an-intuition/1 Introduction 2 Separate 2.1 via map - function 2.2 via string function 3 Unite 3.1 two columns 3.2 three and more columns 4 add_prefix 5 add_suffix 6 Conclusion 1 Introduction It happens again and again that in the course of the planned analysis text variables are unfavorably filled and therefore have to be changed. Here are some useful build in methods for string manipulation from Python.Dealing with missing values
/2019/03/18/dealing-with-missing-values/
Mon, 18 Mar 2019 00:00:00 +0000/2019/03/18/dealing-with-missing-values/1 Introduction 2 Loading the Libraries and the Data 3 Checking for missing values 4 Droping of Missing Values 5 Imputations 5.1 for NUMERIC Features 5.1.1 Replace np.NaN with specific values 5.1.2 Replace np.NaN with MEAN 5.1.3 Replace np.NaN with MEDIAN 5.1.4 Replace np.NaN with most_frequent 5.2 for CATEGORICAL Features 5.2.1 Replace np.NaN with most_frequent 5.2.2 Replace np.NaN with specific values 5.3 for specific Values 5.Data Manipulation
/2019/03/12/data-manipulation/
Tue, 12 Mar 2019 00:00:00 +0000/2019/03/12/data-manipulation/1 Introduction 2 Index 2.1 Resetting index 2.2 Resetting multiindex 2.3 Setting index 3 Modifying Columns 3.1 Rename Columns 3.1.1 add_prefix 3.3 Add columns 3.4 Drop and Delete Columns 3.5 Insert Columns 3.6 Rearrange Columns 4 Modifying Rows 4.1 Round each column 4.2 Round columns differently within a df 4.3 Drop Duplicates 5 Replacing Values 5.1 One by One 5.Data type conversion
/2019/03/10/data-type-conversion/
Sun, 10 Mar 2019 00:00:00 +0000/2019/03/10/data-type-conversion/1 Introduction 2 Loading the libraries and the data 3 Overview of the existing data types 4 Type Conversion 4.1 Conversion of a single variable 4.1.1 float64 to float32 4.1.2 float to int 4.1.3 object to numeric (float and int) 5 Conversion of multiple variables 6 Conversion of date and time variables 7 Conclusion 1 Introduction It will always happen that you have an incorrect or unsuitable data type and you have to change it.Add new columns
/2019/03/06/add-new-columns/
Wed, 06 Mar 2019 00:00:00 +0000/2019/03/06/add-new-columns/1 Introduction 2 Normal Calculation 3 If-else statements 4 Multiple If-else statements 4.1 with conditional output values 4.2 with conditional calculation 5 Row Sum 6 With a defined list 7 Conclusion 1 Introduction There are several ways to generate new variables in Python. Below the most common methods will be shown.
For this post the dataset flight from the statistic platform “Kaggle” was used.Selection of columns per data type
/2019/03/04/selection-of-columns-per-data-type/
Mon, 04 Mar 2019 00:00:00 +0000/2019/03/04/selection-of-columns-per-data-type/1 Introduction 2 Loading the libraries and the data 3 Selection of numeric variables 4 Selection of categorical variables 5 Conclusion 1 Introduction In some situations it is necessary to select all columns of a certain data type. For example if you want to convert all categorical variables into dummy variables in order to be able to calculate a regression.
For this post the dataset Bank Data from the platform “UCI Machine Learning repository” was used.Data Wrangling
/2019/03/03/data-wrangling/
Sun, 03 Mar 2019 00:00:00 +0000/2019/03/03/data-wrangling/1 Introduction 2 Loading the libraries and the data 3 Overview of the data 4 Get some statistics 5 Select data 5.1 Easy Selection 5.2 Conditional Selection 5.3 Set option 6 Dropping Values 6.1 Dropping Columns 6.2 Dropping NaN Values 6.3 NaN Values vs. Null Values 7 Filtering Values 7.1 Filter with Lists 7.2 Exclude certain values 8 Working with Lists 8.Read and write to files
/2019/03/01/read-and-write-to-files/
Fri, 01 Mar 2019 00:00:00 +0000/2019/03/01/read-and-write-to-files/1 Introduction 2 Loading the libraries 3 Reading Files 3.1 Reading csv-files 3.1.1 From a Local Source 3.1.2 From GitHub directly 3.2 Reading json files 3.3 Read text files 3.3.1 with a for loop 3.3.2 with read_csv 3.3.2.1 Convert epoch time to DateTime 4 Writing files 4.1 Write to csv 4.2 Write to excel 4.2.1 Writing multiple DFs to one Excel File 5 How to read further data types 6 Conclusion 1 Introduction One funcion you always need to work with data is to import the records you want to analyze.Getting Started with Anaconda
/2019/01/05/getting-started-with-anaconda/
Sat, 05 Jan 2019 00:00:00 +0000/2019/01/05/getting-started-with-anaconda/1 Introduction 2 Install Anaconda 3 Anaconda Navigator 3.1 What are environments? 3.2 Create a new Environment 3.3 Import Environments 4 Anaconda Powershell Prompt 4.1 Exploring your Environments 4.2 Import Environments 4.3 Adding Libraries 4.3.1 Pip vs Conda 4.3.2 Best Practice 4.3.3 Add Packages via conda & pip 4.4 Installing a specific Version 4.5 Updating Libraries 4.6 Deleting Libraries 4.7 Exporting an existing Environment 4.Tag Archive
/2019/01/01/tag-archive/
Tue, 01 Jan 2019 00:00:00 +0000/2019/01/01/tag-archive/1 Roadmaps 2 Data Science 3 Data pre-processing 3.1 General 3.2 for Regression 3.3 for Classification 4 Machine Learning 4.1 Regression Algorithms 4.2 Classification Algorithms 4.3 Cluster Algorithms 4.4 Dimensionality Reduction Algorithms 5 Some Python Stuff 6 Some Intuitions 7 Analytics Fields 7.1 Marketing Analytics 7.2 Recommendation Systems 8 ETL 9 Time Series Analysis 10 Computer Vision 11 Neural Networks 12 Natural Language Processing (NLP) 13 AutoML IntroductionAbout
/about/
Thu, 05 May 2016 21:48:51 -0700/about/Python and R enthusiastic !
Always hungry for more and deeper knowledge..
If you have any questions or comments, just text me: FuchsMichaelAndi1989@gmail.com