Exploratory Data Analysis (EDA) – Credit Card Fraud Detection Case Study

This article was published as a part of the  Data Science Blogathon .

Lots of financial losses are caused every year due to credit card fraud transactions, the financial industry has switched from a posterior investigation approach to an a priori predictive approach with the design of fraud detection algorithms to warn and help fraud investigators.

This case study is focused to give you an idea of applying Exploratory Data Analysis (EDA) in a real business scenario. In this case study, apart from applying the various Exploratory Data Analysis (EDA) techniques, you will also develop a basic understanding of risk analytics and understand how data can be utilized in order to minimise the risk of losing money while lending to customers.

Business Problem Understanding

The loan providing companies find it hard to give loans to people due to their inadequate or missing credit history. Some consumers use this to their advantage by becoming a defaulter. Let us consider your work for a consumer finance company that specialises in lending various types of loans to customers. You must use Exploratory Data Analysis (EDA) to analyse the patterns present in the data which will make sure that the loans are not rejected for the applicants capable of repaying.

When the company receives a loan application, the company has to rights for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s or company’s decision:

  • If the aspirant is likely to repay the loan, then not approving the loan tends in a business loss to the company
  • If the a is aspirant not likely to repay the loan, i.e. he/she is likely to default/fraud, then approving the loan may lead to a financial loss for the company.

The data contains information about the loan application.

When a client applies for a loan, there are four types of decisions that could be taken by the bank/company:

  • Unused offer: The loan has been cancelled by the applicant but at different stages of the process.

In this case study, you will use Exploratory Data Analysis(EDA) to understand how consumer attributes and loan attributes impact the tendency of default.

Business Goal

This case study aims to identify patterns that indicate if an applicant will repay their instalments which may be used for taking further actions such as denying the loan, reducing the amount of loan, lending at a higher interest rate, etc. This will make sure that the applicants capable of repaying the loan are not rejected. Recognition of such aspirants using Exploratory Data Analysis (EDA) techniques is the main focus of this case study.

You can get access to data here.

 Importing Necessary Packages 

Here, we will use two datasets for our analysis as follows,

  • application_data.csv as df1
  • previous_application.csv as df2

Let’s start with reading those files, we’ll start with df1,

Exploratory Data Analysis

Source: Author

Exploratory Data Analysis

Data Inspection

Before starting Exploratory Data Analysis (EDA) procedures we will start with inspecting the data.

exploratory data analysis case study

Here, by giving verbose = True, it will give all the information about all the columns. Try it and see the results.

Exploratory Data Analysis

By describing (), you will get all the statistical information for the numeric columns and get an idea about their distribution and outliers.

Handling Null Values

After all the data inspecting, let’s check for the null values,

Exploratory Data Analysis

As you can see we are getting lots of null values. Let’s analyse this further.

Exploratory Data Analysis

Theoretically, 25 to 30% is the maximum missing values are allowed, beyond which we might want to drop the variable from analysis. But practically we get variables with ~50% of missing values but still, the customer insists to have it for analyzing. In those cases, we have to treat them accordingly. Here, we will remove columns with null values of more than 35% after observing those columns.

Let’s check how many columns are there with null values greater than 35%. And remove those.

Exploratory Data Analysis

After removing null values, check the percentage of null values for each column again.

Exploratory Data Analysis

Let’s handle these missing values by observing them.

Exploratory Data Analysis

Checking null values again after imputing null values.

exploratory data analysis case study

We didn’t impute OCCUPATION_TYPE because it may contain some useful information, so imputing it with mean or median doesn’t make any sense.

We’ll impute ‘OCCUOATION_TYPE” later by analyzing it.

If you observe the columns carefully, you will find that some columns contain an error. So let’s make some changes.

Exploratory Data Analysis

If you see the data carefully, you will find that though these are days, it contains negative values which is not valid. So let’s make changes accordingly.

As you can see all the columns starts with DAYS, let’s make a list of columns we want to change for ease of change.

Exploratory Data Analysis

Some columns contain Y/N type of values, let’s make it 1/0 for ease of understanding.

Let’s check the distribution for columns having categorical values. After checking for all the columns, we get to know that some columns contain ‘XNA’ values which mean null. Let’s impute it accordingly.

exploratory data analysis case study

Let’s impute these values. let’s check whether these values are missing at random or are there any pattern between missing values. You can read more about this here.

exploratory data analysis case study

Here we observe that wherever NAME_INCOME_TYPE is Pensioner there only we have null values in ORGANIZATON_TYPE column.Let’s see count of Pensioner and then we’ll decide whether to impute null values of ORGANIZATION_TYPE with Pensioner or not.

exploratory data analysis case study

  • So from these data, we can conclude that Pensioner value is approximately equal to null values in ORGANIZATION_TYPE column. So the value is Missing At Random
  • Similarly imputing null values of OCCUPATION_TYPE with Pensioner as most of the null values for OCCUPATION_TYPE compared to Income type variable values we found that “ Pensioner ” is the most frequent value almost 80% of the null values of OCCUPATION_TYPE

exploratory data analysis case study

We have some columns which have nominal categorical values. So let’s impute them accordingly. You can read more about this here.

exploratory data analysis case study

Let’s Bin ‘DAYS_BIRTH’ column by converting it to years based on various “AGE_GROUP”

Exploratory Data Analysis

Again check the datatypes for all the columns and change them accordingly.

By checking the data types we found the following columns to change their data types.

After observing all the columns, we found some columns which don’t add any value to our analysis, so simply drop them so that the data looks clear.

Outlier Analysis

Outlier detection for any data science process is very important. Sometimes removing outliers tend to improve our model meanwhile sometimes outliers may give you a very different approach to your analysis.

So let’s make a list of all the numeric columns and plot boxplots to understand the outliers in the data.

Outlier Analysis

You will get a 7×5 boxplot matrix. Let’s have a look at a very small portion.

Boxplot Matrix

Observe the plot and try to make your own insights.

  • CNT_CHILDREN have outlier values having children more than 5.
  • IQR for AMT_INCOME_TOTAL is very slim and it has a large number of outliers.
  • Third quartile of AMT_CREDIT is larger as compared to the First quartile which means that most of the Credit amount of the loan of customers are present in the third quartile. And there are a large number of outliers present in AMT_CREDIT .
  • The third quartile AMT_ANNUITY is slightly larger than the First quartile and there is a large number of outliers.
  • Third quartile of AMT_GOODS_PRICE , DAYS_REGISTRATION AND DAYS_LAST_PHONE_CHANGE is larger as compared to the First quartile and all have a large number of outliers.
  • IQR for DAYS EMPLOYED is very slim. Most of the outliers are present below 25000. And an outlier is present 375000.
  • From boxplot of CNT_FAM_MEMBERS , we can say that most of the clients have 4 family members. There are some outliers present.
  • DAYS_BIRTH , DAYS_ID_PUBLISH and EXT_SOURCE_2 , EXT_SOURCE_3 don’t have any outliers.
  • Boxplot for DAYS_EMPLOYED , OBS_30_CNT_SOCIAL_CIRCLE , DEF_30_CNT_SOCIAL_CIRCLE , OBS_60_CNT_SOCIAL_CIRCLE , DEF_60_CNT_SOCIAL_CIRCLE , AMT_REQ_CREDIT_BUREAU_HOUR , AMT_REQ_CREDIT_BUREAU_DAY , AMT_REQ_CREDIT_BUREAU_WEEK , AMT_REQ_CREDIT_BUREAU_MON , AMT_REQ_CREDIT_BUREAU_QRT and AMT_REQ_CREDIT_BUREAU_YEAR are very slim and have a large number of outliers.
  • FLAG_OWN_CAR : It doesn’t have First and Third quantile and values lies within IQR, So we can conclude that most of the clients own a car
  • FLAG_OWN_REALTY : It doesn’t have First and Third quantile and values lies within IQR, So we can conclude that  most of the clients own a House/Flat

Before we start analysing our data, let’s check the data imbalance. It’s a very important to step in any machine learning or deep learning process.

exploratory data analysis case study

The Imbalance ratio we got  is "11.39"

Let’s check the distribution of the target variable visually using a pie chart.

exploratory data analysis case study

  • df1 dataframe that is application data is highly imbalanced. Defaulted population is 8.1 % and non- defaulted population is 91.9% .Ratio is 11.3

We will separately analyse the data based on the target variable for a better understanding.

Insight

  • It seems like Female clients applied higher than male clients for loan
  • 66.6% Female clients are non-defaulters while 33.4% male clients are non-defaulters .
  • 57% Female clients are defaulters while 42% male clients are defaulters .

Age Distribution | Exploratory Data Analysis

  • Middle Age(35-60)  the group seems to applied higher than any other age group for loans in the case of Defaulters as well as Non-defaulters.
  • Also, Middle Age group facing paying difficulties the most.
  • While Senior Citizens(60-100) and Very young(19-25) age group facing paying difficulties less as compared to other age groups.

Organization’s  Distribution Based on Target 0 and Target 1

exploratory data analysis case study

  • (Defaulters as well as Non-defaulters) Clients with ORGANIZATION_TYPE Business Entity Type 3, Self-employed, Other ,Medicine, Government,Business Entity Type 2 applied the most for the loan as compared to others
  • (Defaulters as well as Non-defaulters) Clients having ORGANIZATION_TYPE Industry: type 13, Trade: type 4, Trade: type 5, Industry: type 8 applied lower for the loan as compared to others.

Creating a plot for each feature manually becomes a too tedious task. So we will define a function and use a loop to iterate through each categorical column.

Insights

Let’s create a list for all categorical columns.

Exploratory Data Analysis

Most of the clients have applied for Cash Loan while very small proportion have applied for Revolving loan for both Defaulters as well as Non-defaulters.
Most of the clients were accompanied while applying for the loan.And with few clients a family member was accompanying for both Defaulters and Non-Defaulters. But who was accompanying client while applying for the loan doesn’t impact on the default.Also both the populations have same proportions.
Clients who applied for loans were getting income by Working,Commercial associate and Pensioner are more likely to apply for the loan, highest being the Working class category . Businessman, students and Unemployed less likely to apply for loan . Working category have high risk to default. State Servant is at Minimal risk to default.
Clients having education Secondary or Secondary Special are more likey to apply for the loan. Clients having education Secondary or Secondary Special have higher risk to default.Other education types have minimal risk.
Married Clients seems to be applied most for the loan compared to others for both Defaulters and Non-Defaulters. In case of Defaulters,Clients having single relationship are less risky In case of Defaulters, Widows shows Minimal risk .
From the bar chart, it is clear that Most of the clients own a house or living in a apartment for both Defaulters and Non-Defaulters.
Pensioners have applied the most for the loan in case of Defaulters and Non-Defaulters. Pensioner being highest followed by laborers have high risk to default.
There is no considerable difference in days for both Defaulters and Non-defaulters.
Clients having Medium salary range are more likely to apply for the loan for both Defaulters and Non-defaulters. Clients having low and medium income are at high risk to default.
Most of the clients applied for Medium Credit Amount of the loan for both Defaulters and Non-defaulters. Clients applying for high and low credit are at high risk of default.

Univariate Analysis of Numerical Columns W.R.T Target Variable

Univariate Analysis

  • People with target one has largely staggered income as compared to target zero. Dist. plot clearly shows that the shape in Income total, Annuity, Credit and Good Price is similar for Target 0 and similar for Target 1.
  • The plots are also highlighting that people who have difficulty in paying back loans with respect to their income, loan amount, price of goods against which loan is procured and Annuity.
  • Dist. plot highlights the curve shape which is wider for Target 1 in comparison to Target 0 which is narrower with well-defined edges.

Bivariate Analysis: Numerical & Categorical W.R.T Target variables

Let’s check the required columns for analysis.

Bivariate Analysis

For Target 0

Income Amount Vs Education Status

  • Widow Client with Academic degree have very few outliers and doesn’t have First and Third quartile. Also, Clients with all types of family statuses having academic degrees have very less outliers as compared to other types of education .
  • Income of the clients with all types of family status having rest of the education type lie Below the First quartile i.e. 25%
  • Clients having Higher Education , Incomplete Higher Education, Lower Secondary Education and Secondary/Secondary Special have a higher number of outliers .
  • From the above figure, we can say that some of the clients having Higher Education tend to have the highest income compared to others.
  • Though some of the clients who haven’t completed their Higher Education tend to have higher incomes .
  • Some of the clients having Secondary/Secondary Special Education tend to have higher incomes .

Insights | Exploratory Data Analysis

  • Clients with different Education types except Academic degrees have a large number of outliers**
  • Most of the population i.e. clients’ credit amounts lie below 25%.
  • Clients with an  Academic degree and who is a widow tend to take higher credit loan.**
  • Some of the clients with Higher Education, Incomplete Higher Education, Lower Secondary Education and Secondary/Secondary Special Education are more likely to take a  high amount of credit loans.

Insights | Exploratory Data Analysis

  • The income amount for Married clients with an academic degree is much lesser as compared to others.
  • (Defaulter) Clients have relatively less income as compared to Non-defaulters.

Insights | Exploratory Data Analysis

  • Married client with academic applied for a  higher credit loan . And doesn’t have outliers. Single clients with academic degrees have a very slim boxplot with no outliers .
  • Some of the clients with Higher Education, Incomplete Higher Education, Lower Secondary Education and Secondary/Secondary Special Education are more likely to take a  high amount of credit loans .

Bivariate Analysis of Categorical-Categorical to Find the Maximum % Clients with Loan-Payment Difficulties

Define a function for bivariate plots

Distribution of Amount Income Range and the category with maximum % Loan-Payment Difficulties

Distribution of Type of Income and the category with maximum Loan-Payment Difficulties

Distribution of Contract Type and the category with maximum Loan-Payment Difficulties

Distribution of Education Type and the category with maximum Loan-Payment Difficulties

Distribution of Housing Type and the category with maximum Loan-Payment Difficulties

Distribution of Occupation Type and the category with maximum Loan-Payment Difficulties

You may be wondering here why I haven’t attached screenshots here. Well, plot the charts and try to give insights based on that on your own. That’s the best way to learn.

Distribution of CODE_GENDER with respect to AMT_INCOME_RANGE to find maximum % Loan-Payment Difficulties using pivot table

Exploratory Data Analysis

  • Female clients with an  Academic degree and high-income type have a higher risk of default
  • Male clients with Secondary/Secondary Special Education having all types of salaries have a higher risk of default.
  • Male clients with Incomplete Education having very low salaries have a high risk of default.
  • Male Clients with Lower Secondary Education having very low or medium have a high risk to default

Let’s check correlations in the data visually. For that make a list of all numeric features.

Correlations between numerical variables

Let’s use pairplot to get the required charts.

Exploratory Data Analysis

  • AMT_CREDIT and AMT_GOODS_PRICE are highly correlated variables for both defaulters and non – defaulters. So as the home price increases the loan amount also increases
  • AMT_CREDIT and AMT_ANNUITY (EMI) are highly correlated variables for both defaulters and non – defaulters. So as the home price increases the EMI amount also increases which is logical
  • All three variables AMT_CREDIT , AMT_GOODS_PRICE and AMT_ANNUITY are highly correlated for both defaulters and non-defaulters, which might not give a good indicator for defaulter detection

Now, let’s check correlations using heatmaps.

Correlations between numerical variables

  • AMT_CREDIT is inversely proportional to the DAYS_BIRTH , peoples belong to the low-age group taking high Credit amount and vice-versa
  • AMT_CREDIT is inversely proportional to the CNT_CHILDREN , means the Credit amount is higher for fewer children count clients have and vice-versa.
  • AMT_INCOME_TOTAL is inversely proportional to the CNT_CHILDREN , means more income for fewer children clients have and vice-versa.
  • fewer children clients have in a densely populated area.
  • AMT_CREDIT is higher in a densely populated area.
  • AMT_INCOME_TOTAL is also higher in a densely populated area.

Exploratory Data Analysis

  • This heat map for Target 1 is also having quite the same observation just like Target 0. But for a few points are different. They are listed below.
  • The client’s permanent address does not match the contact address are having fewer children.
  • The client’s permanent address does not match the work address are having fewer children.

This is the analysis of current application data. We have one more data for the previous applications & have to analyse that also. Consider that data and do the analysis. Try to give insight s.

Now that we have understood and gained insight into the dataset ie performed an Exploratory Data Analysis, try to use ML algorithms to classify fraudulently. So let’s summarize what we have learnt in this case study,

  • we have extensively covered pre-processing steps required to analyze data
  • We have covered Null value imputation methods
  • We have also covered step by step analyzing techniques such as Univariate analysis, Bivariate analysis, Multivariate analysis, etc

Find the link to the source code here .

Hope you enjoyed my article on exploratory data analysis. Thank you for reading!

Read more articles on exploratory data analysis on our blog .

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.  

A Mathematics student turned Data Scientist. I am an aspiring data scientist who aims at learning all the necessary concepts in Data Science in detail. I am passionate about Data Science knowing data manipulation, data visualization, data analysis, EDA, Machine Learning, etc which will help to find valuable insights from the data.

Basics of Machine Learning

Machine learning lifecycle, importance of stats and eda, understanding data, probability, exploring continuous variable, exploring categorical variables, missing values and outliers, central limit theorem, bivariate analysis introduction, continuous - continuous variables, continuous categorical, categorical categorical, multivariate analysis, different tasks in machine learning, build your first predictive model, evaluation metrics, preprocessing data, linear models, selecting the right model, feature selection techniques, decision tree, feature engineering, naã¯ve bayes, multiclass and multilabel, basics of ensemble techniques, advance ensemble techniques, hyperparameter tuning, support vector machine, advance dimensionality reduction, unsupervised machine learning methods, recommendation engines, improving ml models, working with large datasets, interpretability of machine learning models, automated machine learning, model deployment, deploying ml models, embedded devices, frequently asked questions.

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear Submit reply

Write for us

Write, captivate, and earn accolades and rewards for your work

  • Reach a Global Audience
  • Get Expert Feedback
  • Build Your Brand & Audience
  • Cash In on Your Knowledge
  • Join a Thriving Community
  • Level Up Your Data Science Game

imag

Sion Chakrabarti

CHIRAG GOYAL

CHIRAG GOYAL

Barney Darlington

Barney Darlington

Suvojit Hore

Suvojit Hore

Arnab Mondal

Arnab Mondal

Prateek Majumder

Prateek Majumder

GenAI Pinnacle Program

Revolutionizing ai learning & development.

  • 1:1 Mentorship with Generative AI experts
  • Advanced Curriculum with 200+ Hours of Learning
  • Master 26+ GenAI Tools and Libraries

Enroll with us today!

Continue your learning for free, enter email address to continue, enter otp sent to.

Resend OTP in 45s

Privacy Overview

9. Case Studies

Introduction.

In this diverse collection of case studies , the power of Exploratory Data Analysis (EDA) shines as a critical tool for understanding and extracting insights from various datasets across different domains. Each case study focuses on a specific problem domain, ranging from e-commerce customer behavior analysis to predictive maintenance in manufacturing industries . The primary goal of these analyses is to leverage EDA techniques to unravel hidden patterns , relationships , and trends within the data, leading to data-driven decisions and optimized strategies .

Throughout these case studies, diverse datasets play a pivotal role in providing a deep understanding of the subject matter. These datasets encompass online retail transaction records , electronic health records , credit card transaction data , environmental sensor readings , marketing campaign metrics , social media sentiments , GPS data from vehicles , student academic records , agricultural data , and manufacturing equipment sensor readings . Armed with these diverse datasets, analysts embark on an EDA journey, employing tools like Python , R , Pandas , Matplotlib , Seaborn , Plotly , Geopandas , and more.

The EDA process unfolds through several key steps, including data cleaning and preprocessing to ensure data quality, exploration of variables and patterns, and compelling visualizations that bring insights to life. The results of EDA reveal essential facets of each domain, such as best-selling products , customer segmentation , healthcare outcomes , fraudulent transactions , pollution hotspots , successful marketing campaigns , sentiment analysis , optimized transportation routes , student academic performance factors , and predictive equipment maintenance .

These case studies illustrate the indispensable role of Exploratory Data Analysis in empowering decision-makers across industries. By unlocking the valuable insights buried within vast datasets, EDA empowers businesses and organizations to optimize their strategies, enhance customer experiences, improve healthcare quality, prevent fraud, protect the environment, target marketing efforts, and optimize logistics. As a foundational step in the data analysis journey, EDA serves as a powerful bridge between raw data and actionable knowledge, opening up a world of possibilities for data-driven innovation and problem-solving.

Case Studies

E-Commerce Customer Behavior Analysis :

Description : This case study aims to understand customer behavior in an online retail business to improve marketing and product strategies.

Dataset : Online retail dataset containing transactional records, customer IDs, product details, timestamps, and order quantities.

Tools : Python with Pandas for data manipulation, Matplotlib and Seaborn for data visualization.

Steps using EDA :

Data cleaning and preprocessing to handle missing values and remove duplicates.

Exploring product popularity, customer purchase patterns, and customer segmentation.

Visualizing purchase trends, seasonal patterns, and revenue growth.

Results : Identifying best-selling products, peak shopping hours, customer segments, and trends in revenue growth.

Healthcare Patient Outcomes Analysis :

Description : This case study focuses on analyzing patient outcomes based on electronic health records (EHR) to improve healthcare quality.

Dataset : Electronic health records (EHR) with patient demographics, medical history, diagnoses, treatments, and patient outcomes.

Tools : R with dplyr and ggplot2 for data wrangling and visualization.

Data preprocessing and cleaning to handle missing values and outliers.

Exploring patient demographics, disease prevalence, and treatment efficacy.

Visualizing readmission rates, mortality rates, and correlations between variables.

Results : Identifying factors influencing patient outcomes, trends in readmission rates, and potential areas for healthcare improvement.

Financial Fraud Detection :

Description : This case study aims to detect fraudulent transactions in credit card data to enhance fraud prevention systems.

Dataset : Credit card transaction data with details such as transaction amounts, locations, timestamps, and customer IDs.

Tools : Python with Pandas for data preprocessing, Matplotlib and Seaborn for visualization, and machine learning algorithms for fraud detection.

Data cleaning and preprocessing to handle imbalanced classes and outliers.

Exploring transaction patterns, correlations, and frequency of fraud cases.

Visualizing transaction amounts, fraudulent vs. non-fraudulent transactions, and identifying potential fraud hotspots.

Results : Identifying unusual spending patterns, high-risk transactions, and improving fraud detection accuracy.

Environmental Sensor Data Analysis :

Description : This case study involves analyzing environmental sensor data to understand air quality trends and pollution sources .

Dataset : Air quality sensor data with measurements of pollutants like CO2, PM2.5, and ozone at various locations and timestamps.

Tools : Python with Pandas for data cleaning, Plotly for interactive visualizations, and geographical libraries for mapping.

Data preprocessing to handle missing values and outliers in sensor readings.

Exploring pollutant levels, spatial distributions, and temporal trends.

Visualizing pollution hotspots and correlations between pollutants.

Results : Identifying areas with poor air quality, trends in pollutant levels, and potential pollution sources.

Marketing Campaign Performance Analysis :

Description : This case study involves analyzing the performance of marketing campaigns to optimize marketing strategies.

Dataset : Marketing campaign data with details of campaigns, customer responses, conversions, and costs.

Tools : R with tidyverse for data manipulation, ggplot2 for visualization, and A/B testing tools for campaign performance analysis.

Data cleaning and preprocessing to handle missing data and inconsistencies.

Exploring campaign performance metrics, customer response rates, and conversion rates.

Visualizing campaign effectiveness, customer segmentation, and A/B test results.

Results : Identifying successful marketing campaigns, high-converting strategies, and customer segments with the best response rates.

Social Media Sentiment Analysis :

Description : This case study aims to analyze social media data to gauge public sentiment about products, brands, or events.

Dataset : Twitter or Facebook data with text posts, timestamps, and user engagement metrics.

Tools : Python with TextBlob or NLTK for sentiment analysis, WordCloud for word visualization, and Matplotlib for plotting.

Text preprocessing to handle stopwords, special characters, and convert text to lowercase.

Analyzing sentiment scores, word frequencies, and trending topics.

Visualizing word clouds to highlight positive and negative sentiment words.

Results : Identifying overall sentiment towards products or brands, popular topics, and public perception trends.

Transportation and Logistics Optimization :

Description : This case study involves optimizing transportation and logistics operations to improve efficiency and reduce costs.

Dataset : GPS data from vehicles, delivery records, traffic information, and location details.

Tools : Python with Geopandas for geospatial analysis, NetworkX for route optimization, and visualization libraries for maps.

Data preprocessing to handle GPS data, normalize timestamps, and clean location data.

Exploring traffic patterns, congestion points, and delivery routes.

Visualizing optimized routes and delivery efficiency.

Results : Identifying bottlenecks, optimizing delivery schedules, and reducing transportation costs.

Education Performance Analysis :

Description : This case study focuses on analyzing student performance data to understand factors influencing academic outcomes.

Dataset : Student academic records with grades, attendance, test scores, and demographics.

Tools : R with tidyr and dplyr for data tidying, ggplot2 for visualizations, and machine learning models for performance prediction.

Data cleaning and preprocessing to handle missing grades and attendance records.

Exploring student demographics, grade distributions, and attendance patterns.

Visualizing performance trends, correlations between variables, and predicting academic performance.

Results : Identifying factors affecting student academic performance, predicting at-risk students, and designing targeted interventions.

Agricultural Yield Prediction :

Description : This case study aims to predict crop yields based on agricultural data to optimize planting strategies.

Dataset : Agricultural data with historical weather data, soil characteristics, crop details, and yields.

Tools : Python with NumPy and Pandas for data manipulation, Scikit-learn for regression models, and visualization libraries for plotting.

Data preprocessing to handle missing weather data and crop details.

Exploring weather patterns, correlations between weather variables, and crop yields.

Visualizing yield predictions and comparing with actual yields.

Results : Identifying the correlation between weather patterns and crop yields, optimizing planting schedules, and predicting future harvest outcomes.

Predictive Maintenance in Manufacturing :

Description : This case study focuses on predictive maintenance in manufacturing industries to reduce downtime and improve productivity.

Dataset : Sensor data from manufacturing equipment , including temperature, vibration, and other performance indicators.

Tools : Python with Pandas for data preprocessing, Plotly for visualization, and machine learning algorithms for predictive maintenance.

Data cleaning and preprocessing to handle missing sensor readings and outliers.

Exploring sensor data patterns, correlations between sensor variables, and anomalies.

Visualizing predictive maintenance predictions and comparing with actual breakdowns.

Results : Identifying early signs of equipment failure, scheduling maintenance proactively, and minimizing unplanned downtime.

In each case study, the Exploratory Data Analysis (EDA) process plays a crucial role in uncovering insights, trends, and relationships within the data. By using various data cleaning, exploration, and visualization techniques, analysts can gain valuable insights to make data-driven decisions and optimize processes in different domains. The results obtained through EDA inform subsequent analyses, help refine strategies, and lead to improvements in various aspects of the business or domain being studied.

Last updated 1 year ago

exploratory data analysis case study

Member-only story

Exploratory Data Analysis in Python — A Step-by-Step Process

What is exploratory analysis, how it is structured and how to apply it in python with the help of pandas and other data analysis and visualization libraries.

Andrea D'Agostino

Andrea D'Agostino

Towards Data Science

Article last updated: August 2023

Exploratory data analysis ( EDA ) is an especially important activity in the routine of a data analyst or scientist.

It enables an in depth understanding of the dataset, define or discard hypotheses and create predictive models on a solid basis.

It uses data manipulation techniques and several statistical tools to describe and understand the relationship between variables and how these can impact business.

In fact, it’s thanks to EDA that we can ask ourselves meaningful questions that can impact business.

In this article, I will share with you a template for exploratory analysis that I have used over the years and that has proven to be solid for many projects and domains. This is implemented through the use of the Pandas library — an essential tool for any analyst working with Python.

The process consists of several steps:

  • Importing a dataset

Andrea D'Agostino

Written by Andrea D'Agostino

Data scientist. I write about data science, machine learning and analytics. I also write about career and productivity tips to help you thrive in the field.

Text to speech

AI and Data Science logo. This will take you to the homepage

  • Decision Optimization
  • Embeddable AI
  • Global AI and Data Science
  • SPSS Statistics
  • watsonx Assistant
  • Watson Discovery
  • Data and AI Learning
  • User groups
  • Upcoming AI Events
  • On Demand Webinars
  • IBM TechXchange Webinars
  • Virtual Community Events
  • All IBM TechXchange Community Events
  • Gamification Program
  • Community Manager's Welcome
  • Post to Forum
  • Share a Resource
  • Share Your Expertise
  • Blogging on the Community
  • Connect with Data Science Users
  • All IBM TechXchange Community Users
  • IBM TechXchange Group
  • AI Learning
  • IBM Champions
  • IBM Cloud Support
  • IBM Documentation
  • IBM Support
  • IBM Support 101
  • IBM Technology Zone
  • IBM Training
  • Data Science Elite
  • IBM TechXchange Conference
  • Marketplace

AI and Data Science

Master the art of ai and data science..

Ask a question

Register now! watsonx Pre-Conference Virtual Hackathon - 12-22 September

  • Community Home
  • Discussion 3.8K
  • Library 354
  • Members 26K

EDA: Exploratory Data Analysis with example in Jupyter notebook

By shivam solanki posted wed february 19, 2020 05:35 pm.

The goal of EDA is to leverage visualization tools, summary tables, and hypothesis testing to:

  • Provide summary level insight into a dataset.
  • Uncover underlying patterns and structures in you data.
  • Identify outliers, missing data, class balance, and other data-related issues.
  • Relate the available data to the business opportunity.

Let’s work with a case study that comes from the  online retail data set  and are available through the  UCI Machine Learning Repository . This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

Business scenario here is that the management team expects to spend less time in projection models and gain more accuracy in forecasting revenue. It is expected that well projected numbers will help stabilize staffing and budget projections which will have a beneficial ripple effect throughout the company.

Business metric can be defined as a function of revenue gained through more accurate predictions.

Steps of the EDA Process:

1_CWjTTjI0DgfiS98VwZXj4Q.png

Loading data into pandas

2. Use tables, text and visualizations to  tell the story  that relates the business opportunity to the data

1_dwVotrY46w-Zd5IqV429pw.png

Monthly Revenue Calculation

Here, the data is leveraged to calculate the monthly revenue of the online retail store. Since one of the goals of this case study is forecasting revenue, therefore it is important to quantify revenue using such formulae which can later be utilized either in supervised learning or for hypothesis testing.

3. Identify a strategy to  deal with missing values:

It is during the Exploratory Data Analysis (EDA) process that data integrity issues are identified sometimes. After extracting data it is important to include checks for quality assurance even on the first pass through the project workflow. Quality assurance step must implement checks for duplicity and missing values. Missing values are generally dealt with depending on the category of missingness i.e MCAR (Missing completely at random), MAR (Missing at random) and MNAR (Missing not at random). If the missing data are not MCAR, then imputing values can result in an increase in bias and therefore it is very important to have train/test split.

1_p6j0BdVw6ao-DqNRfqfprQ.png

Data cleaning summary

4.  Investigate  the data and underlying business scenario with visualizations and hypothesis testing.

1_p3v1kWICEUhjkDlBacWDYA.png

For example, plot_rev() function has been called from the python script  data_visualization.py  rather than writing scripts for plotting repetitively in the Jupyter notebook.

1_rBhUZB2q2FnltxqepdRdiw.png

EDA charts with plot_rev() function

Jupyter notebook should be kept as a presentable component with minimal code. It can used as a Data Scientist’s powerpoint to deliver your story of the initial findings on the data.

5.  Communicate your findings

There is no single right way to communicate EDA, but a minimum bar is that the data summaries, key findings, investigative process, conclusions are made clear. Deliverables should be concise and clear.

One important deliverable could be the result of Investigating the relationship between the relevant data, the target and the business metric. For example, revenue calculated in step 2 could be the target variable directly related to the business metric and a proposal for supervised learning and/or forecasting model could be substantiated with the EDA deliverable.

Follow this link to access the notebook for EDA

Nhi Diep Thu February 20, 2020 11:14 AM

exploratory data analysis case study

Shivam Solanki Thu February 20, 2020 10:55 AM

exploratory data analysis case study

Nhi Diep Thu February 20, 2020 06:59 AM

  • Discussions
  • IBM TechXchange Conference 2023
  • IBM Community Webinars
  • All IBM Community Events
  • Become a Blogger
  • All IBM Community Users
  • Community Front Porch

exploratory data analysis case study

Exploratory Data Analysis Case Study - 7 Million+ Company Dataset

Imgur

About the Dataset
Installing all libraries, modules, and download dataset from kaggle
Data pre-processing and cleaning
Data Preprocessing Data Cleaning Merge other datasets
Exploratory Data Analysis
Future Work

1. About the Dataset

  • The dataset contains data about 7 million companies from different parts of the world.
  • We will use following Columns from Dataset : Employee, country, year founded, type of industry, name of company, etc.
  • We will also merge other datasets information related to country later for more detailed analysis like median age, urban population, world share etc.

2. Objective

The objective of this project is to apply the data analysis & visualization skills & techniques learned to a real-world dataset.

exploratory data analysis case study

Exploratory Data Analysis: a Case Study Example on Classification Task

Muhammad Risqi Firdaus

Muhammad Risqi Firdaus

Exploratory data analysis (EDA) plays the main role on building a machine learning model. Everything you do on your data would based on your EDA results. I once wrote about some EDA on my old blog post, you can read it here .

I will give you some tutorial about finding the edge of the data. Maybe itis not the best tutorial you read, the task that i cover will be only for classification task. If you have timeseries data or regression task, you can looking for other tutorial.

About The Data

In thise article, I use spaceship titanic data from kaggle. As shown on the kaggle page, the data contain train, test and sample submission data. This is a good data for you to start your exploration. The data that i use have been modified to give a better learning experience for you.

As you may know, to start EDA, after loading the data, you can get the snapshot of the data through method .head() .

The notebook will return top five data based on the id. You can change the amount of the snapshot data (row) by giving number of row you want on method parameter.

This method is important for you. Before analyzing or doing any step forward you need to get a small figure of your data. By the screenshot that i give, you can see there are some odd values on the feature “FoodCourt”. With the general knowledge like that, you can do a specific step such get the unique value of the data.

Getting The General Information

After you get the snapshot of the data, you can get the general information from the data through method .info() . This method will return the type of the data, the count of non-nullable features, and the feature name obviously wkwk.

By the figure shown, you can see that most of the feature is in object (categorical) type. You can see that the non-null count for each feature has different value, which means that you have null data.

Get Rid The Duplicate Data Off

Somehow, our data may be consist of duplicate data, you should check and drop it before further step. Pandas has adrop_duplicates() method that can easily drop the duplicate instances of your data. In this case, we don’t have any duplicate data, so we can move forward to the next step.

Finding The Null Value

Exploring the null value of each feature can lead you to the cleaning process. If the null data percentage is too big, you can use imputer, or maybe choose the least null percentage feature as the drop baseline.

As we can see, the percentage for each feature on our data is very low, so we can drop the row based on the highest null feature. However, in the first phase you can see that the amount of training data was scrimp, so maybe you can arrange an experiment by comparing two types of training experiments: data with/without null.

Figuring The Unique Value

To realize there are some odd values on your data, you can start by figuring out the unique value of your feature. Getting the count of the unique value is taking the main role.

Furthermore, checking each feature could be needed. You can use method .unique() for each feature series.

By checking particular features, maybe we can get some odd value, like “one” inside numerical value. Do you think it is possible for having age 0? The question maybe would be answered as you explore your data.

By the statement before, we can replace the age data above and convert the type to integer.

Exploring The Data Distribution

In classification, it is important to know and adjust the distribution of the data, especially on target data. The figure below shows you the distribution of target data.

We can see that there is no significant difference in data (target) distribution. We can continue our investigation on feature correlation.

For numerical data, you can investigate the distribution of the data through histogram or line plot. If you are handling categorical data, besides heatmap for each class, you can use bar plot to investigate the distribution.

We can see that if the passanger comes from “Earth” the passanger is most likely to be transported, in contrast with other value that the HomePlanet value will lead the passanger not to be transported. By the graph, we could also conclude that the most of the data distribution is having “Earth” category, which can lead to bias.

Which one is the best plot? no one knows, there is no silver bull on it. The best fit plot is depending on your data. However, if we want to encode more precious, we can investigate the correlation of each category based on its class

Is It Enough?

The answer is, no one knows. If you have numeric interval data, maybe you can investigate the existence of the outlier. It could be done by plotting your numerical data using boxplot. Before going too far, we have to deal with the same definition of an outlier term .

In this context, an outlier defined as the value that is beyond the inner fence of the data (beyond 1.5 IQR, you can google it later). The boxplot shows the outlier of the data by a locating the points outside the long projection of the plot.

Is Your Feature Correlated Well?

First thing first, you have to understand that correlation doesn’t imply causality. If you see a correlation between unrelated data, you can check the causality first.

Checking the correlation of categorical data can be kinda tricky. If you are using pearson correlation, you will only get the correlation of numerical data. Furthermore, if you want to investigate correlation on non-numerical data, you can use phi_k correlation (you can google it later).

If you are curious about the exact value of the correlation, you can turn the annotation of the plot on. Determining the correlation of the feature to the target feature could lead us to feature selection.

The Most Important Feature

If you want to look for the feature’s importance, you need to build a base model first . With the changes in your encoding, model, maybe your feature’s importance also can be changed . In this case, i use the most simple base model, decision tree on non null data to identify the feature importance of my initial data. To be noted: i have separated the column Cabin into three columns and encoded it before training the base model.

You have learned some methods for EDA in classification task. By the time, you will learn, that the method for EDA in other tasks may be a little bit different. On another article maybe i will explain feature engineering, data cleaning or maybe EDA on another task. Hopefully this article will be benefical for you. See ya later!!!

Muhammad Risqi Firdaus

Written by Muhammad Risqi Firdaus

Previously writing on: http://coretgabut.blogspot.com/ https://mrfirdauss.vercel.app/

Text to speech

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

  • Exploratory Research | Definition, Guide, & Examples

Exploratory Research | Definition, Guide, & Examples

Published on December 6, 2021 by Tegan George . Revised on November 20, 2023.

Exploratory research is a methodology approach that investigates research questions that have not previously been studied in depth.

Exploratory research is often qualitative and primary in nature. However, a study with a large sample conducted in an exploratory manner can be quantitative as well. It is also often referred to as interpretive research or a grounded theory approach due to its flexible and open-ended nature.

Table of contents

When to use exploratory research, exploratory research questions, exploratory research data collection, step-by-step example of exploratory research, exploratory vs. explanatory research, advantages and disadvantages of exploratory research, other interesting articles, frequently asked questions about exploratory research.

Exploratory research is often used when the issue you’re studying is new or when the data collection process is challenging for some reason.

You can use this type of research if you have a general idea or a specific question that you want to study but there is no preexisting knowledge or paradigm with which to study it.

Prevent plagiarism. Run a free check.

Exploratory research questions are designed to help you understand more about a particular topic of interest. They can help you connect ideas to understand the groundwork of your analysis without adding any preconceived notions or assumptions yet.

Here are some examples:

  • What effect does using a digital notebook have on the attention span of middle schoolers?
  • What factors influence mental health in undergraduates?
  • What outcomes are associated with an authoritative parenting style?
  • In what ways does the presence of a non-native accent affect intelligibility?
  • How can the use of a grocery delivery service reduce food waste in single-person households?

Collecting information on a previously unexplored topic can be challenging. Exploratory research can help you narrow down your topic and formulate a clear hypothesis and problem statement , as well as giving you the “lay of the land” on your topic.

Data collection using exploratory research is often divided into primary and secondary research methods, with data analysis following the same model.

Primary research

In primary research, your data is collected directly from primary sources : your participants. There is a variety of ways to collect primary data.

Some examples include:

  • Survey methodology: Sending a survey out to the student body asking them if they would eat vegan meals
  • Focus groups: Compiling groups of 8–10 students and discussing what they think of vegan options for dining hall food
  • Interviews: Interviewing students entering and exiting the dining hall, asking if they would eat vegan meals

Secondary research

In secondary research, your data is collected from preexisting primary research, such as experiments or surveys.

Some other examples include:

  • Case studies : Health of an all-vegan diet
  • Literature reviews : Preexisting research about students’ eating habits and how they have changed over time
  • Online polls, surveys, blog posts, or interviews; social media: Have other schools done something similar?

For some subjects, it’s possible to use large- n government data, such as the decennial census or yearly American Community Survey (ACS) open-source data.

How you proceed with your exploratory research design depends on the research method you choose to collect your data. In most cases, you will follow five steps.

We’ll walk you through the steps using the following example.

Therefore, you would like to focus on improving intelligibility instead of reducing the learner’s accent.

Step 1: Identify your problem

The first step in conducting exploratory research is identifying what the problem is and whether this type of research is the right avenue for you to pursue. Remember that exploratory research is most advantageous when you are investigating a previously unexplored problem.

Step 2: Hypothesize a solution

The next step is to come up with a solution to the problem you’re investigating. Formulate a hypothetical statement to guide your research.

Step 3. Design your methodology

Next, conceptualize your data collection and data analysis methods and write them up in a research design.

Step 4: Collect and analyze data

Next, you proceed with collecting and analyzing your data so you can determine whether your preliminary results are in line with your hypothesis.

In most types of research, you should formulate your hypotheses a priori and refrain from changing them due to the increased risk of Type I errors and data integrity issues. However, in exploratory research, you are allowed to change your hypothesis based on your findings, since you are exploring a previously unexplained phenomenon that could have many explanations.

Step 5: Avenues for future research

Decide if you would like to continue studying your topic. If so, it is likely that you will need to change to another type of research. As exploratory research is often qualitative in nature, you may need to conduct quantitative research with a larger sample size to achieve more generalizable results.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

exploratory data analysis case study

It can be easy to confuse exploratory research with explanatory research. To understand the relationship, it can help to remember that exploratory research lays the groundwork for later explanatory research.

Exploratory research investigates research questions that have not been studied in depth. The preliminary results often lay the groundwork for future analysis.

Explanatory research questions tend to start with “why” or “how”, and the goal is to explain why or how a previously studied phenomenon takes place.

Exploratory vs explanatory research

Like any other research design , exploratory studies have their trade-offs: they provide a unique set of benefits but also come with downsides.

  • It can be very helpful in narrowing down a challenging or nebulous problem that has not been previously studied.
  • It can serve as a great guide for future research, whether your own or another researcher’s. With new and challenging research problems, adding to the body of research in the early stages can be very fulfilling.
  • It is very flexible, cost-effective, and open-ended. You are free to proceed however you think is best.

Disadvantages

  • It usually lacks conclusive results, and results can be biased or subjective due to a lack of preexisting knowledge on your topic.
  • It’s typically not externally valid and generalizable, and it suffers from many of the challenges of qualitative research .
  • Since you are not operating within an existing research paradigm, this type of research can be very labor-intensive.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Degrees of freedom
  • Null hypothesis
  • Discourse analysis
  • Control groups
  • Mixed methods research
  • Non-probability sampling
  • Quantitative research
  • Ecological validity

Research bias

  • Rosenthal effect
  • Implicit bias
  • Cognitive bias
  • Selection bias
  • Negativity bias
  • Status quo bias

Exploratory research is a methodology approach that explores research questions that have not previously been studied in depth. It is often used when the issue you’re studying is new, or the data collection process is challenging in some way.

Exploratory research aims to explore the main aspects of an under-researched problem, while explanatory research aims to explain the causes and consequences of a well-defined problem.

You can use exploratory research if you have a general idea or a specific question that you want to study but there is no preexisting knowledge or paradigm with which to study it.

Quantitative research deals with numbers and statistics, while qualitative research deals with words and meanings.

Quantitative methods allow you to systematically measure variables and test hypotheses . Qualitative methods allow you to explore concepts and experiences in more detail.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

George, T. (2023, November 20). Exploratory Research | Definition, Guide, & Examples. Scribbr. Retrieved August 27, 2024, from https://www.scribbr.com/methodology/exploratory-research/

Is this article helpful?

Tegan George

Tegan George

Other students also liked, explanatory research | definition, guide, & examples, qualitative vs. quantitative research | differences, examples & methods, what is a research design | types, guide & examples, "i thought ai proofreading was useless but..".

I've been using Scribbr for years now and I know it's a service that won't disappoint. It does a good job spotting mistakes”

Illustration with collage of pictograms of clouds, pie chart, graph pictograms on the following

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods.

EDA helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.

EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a provides a better understanding of data set variables and the relationships between them. It can also help determine if the statistical techniques you are considering for data analysis are appropriate. Originally developed by American mathematician John Tukey in the 1970s, EDA techniques continue to be a widely used method in the data discovery process today.

Learn how to leverage the right databases for applications, analytics and generative AI.

Register for the ebook on generative AI

The main purpose of EDA is to help look at data before making any assumptions. It can help identify obvious errors, as well as better understand patterns within the data, detect outliers or anomalous events, find interesting relations among the variables.

Data scientists can use exploratory analysis to ensure the results they produce are valid and applicable to any desired business outcomes and goals. EDA also helps stakeholders by confirming they are asking the right questions. EDA can help answer questions about standard deviations, categorical variables, and confidence intervals. Once EDA is complete and insights are drawn, its features can then be used for more sophisticated data analysis or modeling, including machine learning .

Specific statistical functions and techniques you can perform with EDA tools include:

  • Clustering and dimension reduction techniques, which help create graphical displays of high-dimensional data containing many variables.
  • Univariate visualization of each field in the raw dataset, with summary statistics.
  • Bivariate visualizations and summary statistics that allow you to assess the relationship between each variable in the dataset and the target variable you’re looking at.
  • Multivariate visualizations, for mapping and understanding interactions between different fields in the data.
  • K-means Clustering is a clustering method in unsupervised learning where data points are assigned into K groups, i.e. the number of clusters, based on the distance from each group’s centroid. The data points closest to a particular centroid will be clustered under the same category. K-means Clustering is commonly used in market segmentation, pattern recognition, and image compression.
  • Predictive models, such as linear regression, use statistics and data to predict outcomes.

There are four primary types of EDA:

  • Univariate non-graphical. This is simplest form of data analysis, where the data being analyzed consists of just one variable. Since it’s a single variable, it doesn’t deal with causes or relationships. The main purpose of univariate analysis is to describe the data and find patterns that exist within it.
  • Stem-and-leaf plots, which show all data values and the shape of the distribution.
  • Histograms, a bar plot in which each bar represents the frequency (count) or proportion (count/total count) of cases for a range of values.
  • Box plots, which graphically depict the five-number summary of minimum, first quartile, median, third quartile, and maximum.
  • Multivariate nongraphical: Multivariate data arises from more than one variable. Multivariate non-graphical EDA techniques generally show the relationship between two or more variables of the data through cross-tabulation or statistics.
  • Multivariate graphical: Multivariate data uses graphics to display relationships between two or more sets of data. The most used graphic is a grouped bar plot or bar chart with each group representing one level of one of the variables and each bar within a group representing the levels of the other variable.

Other common types of multivariate graphics include:

  • Scatter plot, which is used to plot data points on a horizontal and a vertical axis to show how much one variable is affected by another.
  • Multivariate chart, which is a graphical representation of the relationships between factors and a response.
  • Run chart, which is a line graph of data plotted over time.
  • Bubble chart, which is a data visualization that displays multiple circles (bubbles) in a two-dimensional plot.
  • Heat map, which is a graphical representation of data where values are depicted by color.

Some of the most common data science tools used to create an EDA include:

  • Python: An interpreted, object-oriented programming language with dynamic semantics. Its high-level, built-in data structures, combined with dynamic typing and dynamic binding, make it very attractive for rapid application development, as well as for use as a scripting or glue language to connect existing components together. Python and EDA can be used together to identify missing values in a data set, which is important so you can decide how to handle missing values for machine learning.
  • R: An open-source programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians in data science in developing statistical observations and data analysis.

For a deep dive into the differences between these approaches, check out " Python vs. R: What's the Difference? "

Use IBM Watson® Studio to determine whether the statistical techniques that you are considering for data analysis are appropriate.

Learn the importance and the role of EDA and data visualization techniques to find data quality issues and for data preparation, relevant to building ML pipelines.

Learn common techniques to retrieve your data, clean it, apply feature engineering, and have it ready for preliminary analysis and hypothesis testing.

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

logo

Exploratory Data Analysis (EDA) – Retail Case Study Example (Part 3)

Exploratory data analysis for Soccer - by Roopam

Exploratory data analysis for Soccer – by Roopam

For the last couple of weeks we have been working on a marketing analytics case study example (read Part 1 and Part 2 ). In the last part ( Part 2 ) we defined a couple of advanced analytics objectives based on the business problem at an online retail company called DresSmart Inc. In this part, we will perform some exploratory data analysis as a part of the same case study example. But before that let’s explore the power of exploratory data analysis (EDA) to reveal hidden facts about the greatest game on the planet – soccer or football.

Soccer – Exploratory Data Analysis

Soccer is undoubtedly the most popular game on the planet with over 200 nations having their official soccer teams. No other game has such a universal appeal with millions of hardcore followers.  Every detail of soccer is analyzed by the players, the coaches and the support staff. Despite this, a careful exploratory data analysis of the game could unravel match-winning secrets about the greatest game, as you will see in the next two example case studies.

Penalty Kicks

Let’s relive the first knockout (pre-quarterfinal) match of the Soccer World Cup 2014 between Brazil and Chile. The scores were level at 1-1 at the end of allotted 90 minutes. Even the extra half an hour could not conclude the match with the scoreboard still reading 1-1. This led the match towards penalty shoot-outs to break the tie. After the Brazilian player, Neymar, scored the goal in the penultimate penalty kick, Brazil  were 3-2 ahead in the penalty shootouts. Chile still has a penalty kick left from Gonzalo Jara and the opportunity to extend the tie further – but if he misses Chile’s campaign will be over in the competition. What should Gonzalo Jara do to extend the tie?

penalty shootout

Gonzalo Jara’s Kick – Source: irishtimes.com

On average, at this level around 75% penalty kicks convert to goals. The odds, by this definition, are highly in favor of Gonzalo Jara. Where should he kick the ball to improve his odds further? All the fans, coaches, and players will say kick the ball in either corner, away from the goalkeeper who is standing in the center of the goal. They will also advise never to shoot the ball at the dead center towards the goalkeeper. A group of researchers asked the same question and did the exploratory data analysis of penalty kicks at the elite level of soccer. Goalkeepers usually go by their instincts when the ball is kicked at them with undecipherable pace. They either jump towards their left (57% of times) or right (41% of times). This leaves them at the center just 2% of times to stop the ball hit right towards them. Hence, a kick hit dead towards the center of the goal has significantly higher chances of conversion to goal then kicks on either corner at the same height.

Back to Gonzalo Jara, he hits the ball towards his right, in the direction of the diving goalkeeper as shown in the picture above. He misses the shot, the ball hits the goal post and ricochets away from the goal. As a result, Chile got knocked out of the world cup and Brazil advanced to the next stage. In Gonzalo Jara’s defense, the conversion rate for crucial penalty kicks like this one (to avoid elimination) drops to 44%. Yes, pressure is another beast to which even the best succumb.

exploratory data analysis case study

Corner Kicks

In another case, a few years ago Manchester City’s soccer team was struggling with corner kicks and hence decided to do some exploratory data analysis to differentiate effective corner kicks from ineffective. The team of analysts analyzed hundreds of videos of corner kicks from the premier league. After their analysis, they found that in-swinging kicks towards the goal were far more effective and dangerous than the out-swinging kicks. They took their findings to Roberto Mancini, the coach of  Manchester City team at that time. Mancini, who has played and followed the game since his childhood, rejected the findings outrightly. He recalled all those memorable and picture perfects goals by great headers of out-swingers. On the other hand, clumsy goals of in-swingers hardly created a lasting impression on the spectators’ mind. Mancini, it turned out, was wrong. All that looks great and memorable is not always optimal. This is a great case for how simple but sincere exploratory data analysis can challenge the deeply ingrained beliefs developed over centuries (yes, soccer is a really old game).

Exploratory Data Analysis – Retail Case Study Example

Back to our case study example (read  Part 1  and  Part 2 ), in which you are  the chief analytics officer & business strategy head at an online shopping store called DresSMart Inc. You are helping out the CMO of the company to enhance the company’s campaigns’ results. For the last few days, you are playing around with data as a part of exploratory data analysis. The following is one of the several interesting results and patterns you have noticed in the data. When you analyzed the distribution of customers across a number of product categories (men’s shirt, casual trousers, formal skirts etc.) purchased by each customer you found the following pattern.

Exploratory data analysis - marketing analytics case study

Exploratory data analysis – marketing analytics case study (retail)

The above distribution looks more or less as expected. However, there is an interesting peak for customers purchasing more than 50 product-categories. Who are these customers? Why are they buying so many product categories for their usage? You further analyzed this small set of customers and found that they are growing at a faster rate than the other set of customers. Since the inception of the company 7 years ago, the percentage of customers purchasing 50+ product categories in a year has exponentially gone up (currently at 2.1%). This set of customers also contributes to about 23% of all the sales for DresSMart Inc. The following graphs are part of your above analysis.

Exploratory data analysis

Exploratory data analysis

So, what is going on here? You further analyzed the patterns and size(s) of clothes these customers are buying and noticed they are buying the same style in different sizes. Aha! Now you know them, these are small neighborhood retailers using DresSMart Inc as a wholesaler. The following is what you concluded from the above analysis

  • There is no point sending these retailers the same retail product catalog and campaign as to retail customers
  • There is an opportunity to strengthen business ties with these mom-&-pop retailers and in turn, improve profitability of your company through a separate business program

Additionally, your further analysis revealed that order fulfillment or delivery patterns (delivery quantity / chargers etc.)  for these retailers are similar to other customers. Your company is incurring additional cost for these customers in delivery. You could plan the overall supply chain much better keeping these small retailers in the equation. This exploratory data analysis has given you ideas for more low hanging fruits to improve company’s profitability.

Sign-off Note

Exploratory data analysis is a powerful tool. A diligent EDA is an absolute must to put your advanced business analytics in the right direction. EDA provides a great opportunity to test your simple business hypotheses and hunches before jumping into a rigorous model building. Coming back to soccer, we are approaching the final stages of the World Cup. Enjoy the last few games and may the best team lift the prized trophy.

8 thoughts on “ Exploratory Data Analysis (EDA) – Retail Case Study Example (Part 3) ”

Roopam, Excellent way to kick-start the core of the case study. Having spent a fair bit of time in Marketing Analytics (not core modeling, but a lot of EDA and A/B scenarios), I kind of have a hunch where this is going to go – excellent work. But, just out of curiosity, where do you pick up these case studies from? The reason I ask is because, the data though very interesting, is very case specific and may not apply to situations most of us may encounter in real life. As you said, EDA is the key to analysis before jumping right in, and sometimes it’s very painful and tedious – because there are no obvious trends or insights. I have sometimes spent hours slicing and dicing data before I could really form a hypothesis and test it (which by the way was less painful). Any tips / tricks to your readers like me, who could really save some time on EDA ?

P.S. I have to admit, all of your case studies do seem real and may well be so, but I would be wary to admit if any of us could directly find same or similar trends in our data (that would seem too good to be true ;-))

Thanks Kisalay for your kind words! All these cases in some form or other come from the work I have done at various stages of my career. Of course, I take a lot of creative liberty to completely modify information, trends, storylines, scenarios, and conclusions to protect confidentiality. Additionally, I also try to make the cases easy to understand for the readers. However, for most of these cases the general principles of analysis and logical flow is preserved to a greater extent.

I agree EDA is a tedious exercise but it also makes one feel like a detective 🙂 . Let me share my strategy for EDA, I never touch data before having a plan of action. Like a detective investigation you might destroy evidence if you go in without a plan. I usually prefer to have a mental map of my analysis and logical flow before I start slicing and dicing data. It makes me feel much more in control. I also prefer to have a reasonably defined hypothesis based on a business hunch before analysis. Also when you get completely stuck, take a long break away from your computer – fresh air usually helps.

In case, you have to mine a completely unknown data use machine learning algorithms like decision trees, apriori etc. to slice and dice your data. At times you may have to create your own modified algorithms specific to your requirements. Machines, I am completely sure, are any day better than us humans at this task.

I gone through your blogs with a keen intrest to develop Data Analytics skill from scratch.

As you said in one of your article is ” The best way to develop analytics skill is to have a project in your existing job itself “.

I want to have project in my existing job, i am working for a Furniture Manufacturing company in Sales department . This company manufacture house hold furniture and office furniture at massive scale in central India.

I am an Industrial Engineering graduate passout 2015, my 10th score in maths 147/150 .

I want to have career in data science, please guide me the learning path.

Kapil, to create an analytics opportunity in your company I suggest you answer these questions:

– Is there an analytics team in your company? If yes, what kind of business questions this team usually work on?

– If the answer to the above question is no, are there IT systems (ERP, MIS etc.) available in your company? What kind of data fields could be retrieved from these systems? Talk to your IT team to learn more about it.

– What are some of the quick business questions you could answer using the above data? Focus on important questions but simple analysis to begin with.

Since you are in sales with experience in industrial engineering, I suggest you build your analytical skills on top of your core skills. Supply chain analytics is a major area of growth with lots of opportunities. If you could deliver a few simple yet successful projects in your company it will make your CV really powerful. All the best.

Till now there is no analytics team in company, but needs one who can do analytics, i want to fill this gap.

According to your guidance, i discussed with my IT team about the systems available and the kind of data field can get retrieve so that we can answer some basic questions, but to think strategically we are not trained enough. Here i want do a simple analysis project.

To have more clarity, I also prepared three columns in excel with title IT Systems, Data fields and Quick business questions .

I will be very fortunate if you can guide me in the direction to improve my analytical skill and have a simple data analysis project at the same time .

HI, can you share data for this case study, So I will get practical exposure to this problem. Your blogs are very intuitive and easy to understand.

I got some rough data with 20 columns & 5000 No. row data. There is no exact details of data for what this data is & no nomenclature for the data.

with this open ended problem wherein no problem has been defined.

So, can we do data paralytics for such data….?

Nice post regarding Data Analysis.

Leave a comment Cancel reply

Your email address will not be published. Required fields are marked *

Notify me of follow-up comments by email.

Notify me of new posts by email.

This site uses Akismet to reduce spam. Learn how your comment data is processed .

  • For Individuals
  • For Businesses
  • For Universities
  • For Governments
  • Online Degrees
  • Find your New Career
  • Join for Free

Johns Hopkins University

  • Exploratory Data Analysis

This course is part of multiple programs. Learn more

This course is part of multiple programs

Financial aid available

178,942 already enrolled

(6,065 reviews)

What you'll learn

Understand analytic graphics and the base plotting system in R

Use advanced graphing systems such as the Lattice system

Make graphical displays of very high dimensional data

Apply cluster analysis techniques to locate patterns in data

Skills you'll gain

  • Cluster Analysis
  • R Programming

Details to know

exploratory data analysis case study

Add to your LinkedIn profile

See how employees at top companies are mastering in-demand skills

Placeholder

Build your subject-matter expertise

  • Learn new concepts from industry experts
  • Gain a foundational understanding of a subject or tool
  • Develop job-relevant skills with hands-on projects
  • Earn a shareable career certificate

Placeholder

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV

Share it on social media and in your performance review

Placeholder

There are 4 modules in this course

This course covers the essential exploratory techniques for summarizing data. These techniques are typically applied before formal modeling commences and can help inform the development of more complex statistical models. Exploratory techniques are also important for eliminating or sharpening potential hypotheses about the world that can be addressed by the data. We will cover in detail the plotting systems in R as well as some of the basic principles of constructing data graphics. We will also cover some of the common multivariate statistical techniques used to visualize high-dimensional data.

This week covers the basics of analytic graphics and the base plotting system in R. We've also included some background material to help you install R if you haven't done so already.

What's included

15 videos 6 readings 1 quiz 5 programming assignments 1 peer review

15 videos • Total 109 minutes

  • Introduction • 1 minute • Preview module
  • Installing R on Windows (3.2.1) • 3 minutes
  • Installing R on a Mac (3.2.1) • 1 minute
  • Installing R Studio (Mac) • 3 minutes
  • Setting Your Working Directory (Windows) • 7 minutes
  • Setting Your Working Directory (Mac) • 7 minutes
  • Principles of Analytic Graphics • 12 minutes
  • Exploratory Graphs (part 1) • 9 minutes
  • Exploratory Graphs (part 2) • 5 minutes
  • Plotting Systems in R • 9 minutes
  • Base Plotting System (part 1) • 11 minutes
  • Base Plotting System (part 2) • 6 minutes
  • Base Plotting Demonstration • 16 minutes
  • Graphics Devices in R (part 1) • 5 minutes
  • Graphics Devices in R (part 2) • 7 minutes

6 readings • Total 60 minutes

  • Welcome to Exploratory Data Analysis • 10 minutes
  • Syllabus • 10 minutes
  • Pre-Course Survey • 10 minutes
  • Exploratory Data Analysis with R Book • 10 minutes
  • The Art of Data Science • 10 minutes
  • Practical R Exercises in swirl Part 1 • 10 minutes

1 quiz • Total 30 minutes

  • Week 1 Quiz • 30 minutes

5 programming assignments • Total 900 minutes

  • swirl Lesson 1: Principles of Analytic Graphs • 180 minutes
  • swirl Lesson 2: Exploratory Graphs • 180 minutes
  • swirl Lesson 3: Graphics Devices in R • 180 minutes
  • swirl Lesson 4: Plotting Systems • 180 minutes
  • swirl Lesson 5: Base Plotting System • 180 minutes

1 peer review • Total 60 minutes

  • Course Project 1 • 60 minutes

Welcome to Week 2 of Exploratory Data Analysis. This week covers some of the more advanced graphing systems available in R: the Lattice system and the ggplot2 system. While the base graphics system provides many important tools for visualizing data, it was part of the original R system and lacks many features that may be desirable in a plotting system, particularly when visualizing high dimensional data. The Lattice and ggplot2 systems also simplify the laying out of plots making it a much less tedious process.

7 videos 1 reading 1 quiz 5 programming assignments

7 videos • Total 61 minutes

  • Lattice Plotting System (part 1) • 6 minutes • Preview module
  • Lattice Plotting System (part 2) • 6 minutes
  • ggplot2 (part 1) • 6 minutes
  • ggplot2 (part 2) • 13 minutes
  • ggplot2 (part 3) • 9 minutes
  • ggplot2 (part 4) • 10 minutes
  • ggplot2 (part 5) • 8 minutes

1 reading • Total 10 minutes

  • Practical R Exercises in swirl Part 2 • 10 minutes
  • Week 2 Quiz • 30 minutes
  • swirl Lesson 1: Lattice Plotting System • 180 minutes
  • swirl Lesson 2: Working with Colors • 180 minutes
  • swirl Lesson 3: GGPlot2 Part1 • 180 minutes
  • swirl Lesson 4: GGPlot2 Part2 • 180 minutes
  • swirl Lesson 5: GGPlot2 Extras • 180 minutes

Welcome to Week 3 of Exploratory Data Analysis. This week covers some of the workhorse statistical methods for exploratory analysis. These methods include clustering and dimension reduction techniques that allow you to make graphical displays of very high dimensional data (many many variables). We also cover novel ways to specify colors in R so that you can use color as an important and useful dimension when making data graphics. All of this material is covered in chapters 9-12 of my book Exploratory Data Analysis with R.

12 videos 1 reading 4 programming assignments

12 videos • Total 76 minutes

  • Hierarchical Clustering (part 1) • 7 minutes • Preview module
  • Hierarchical Clustering (part 2) • 5 minutes
  • Hierarchical Clustering (part 3) • 7 minutes
  • K-Means Clustering (part 1) • 5 minutes
  • K-Means Clustering (part 2) • 4 minutes
  • Dimension Reduction (part 1) • 7 minutes
  • Dimension Reduction (part 2) • 9 minutes
  • Dimension Reduction (part 3) • 6 minutes
  • Working with Color in R Plots (part 1) • 4 minutes
  • Working with Color in R Plots (part 2) • 7 minutes
  • Working with Color in R Plots (part 3) • 6 minutes
  • Working with Color in R Plots (part 4) • 3 minutes
  • Practical R Exercises in swirl Part 3 • 10 minutes

4 programming assignments • Total 720 minutes

  • swirl Lesson 1: Hierarchical Clustering • 180 minutes
  • swirl Lesson 2: K Means Clustering • 180 minutes
  • swirl Lesson 3: Dimension Reduction • 180 minutes
  • swirl Lesson 4: Clustering Example • 180 minutes

This week, we'll look at two case studies in exploratory data analysis. The first involves the use of cluster analysis techniques, and the second is a more involved analysis of some air pollution data. How one goes about doing EDA is often personal, but I'm providing these videos to give you a sense of how you might proceed with a specific type of dataset.

2 videos 2 readings 1 programming assignment 1 peer review

2 videos • Total 55 minutes

  • Clustering Case Study • 14 minutes • Preview module
  • Air Pollution Case Study • 40 minutes

2 readings • Total 20 minutes

  • Practical R Exercises in swirl Part 4 • 10 minutes
  • Post-Course Survey • 10 minutes

1 programming assignment • Total 180 minutes

  • swirl Lesson 1: CaseStudy • 180 minutes
  • Course Project 2 • 60 minutes

Instructors

Instructor ratings

We asked all learners to give feedback on our instructors based on the quality of their teaching style.

Roger D. Peng, PhD

The mission of The Johns Hopkins University is to educate its students and cultivate their capacity for life-long learning, to foster independent and original research, and to bring the benefits of discovery to the world.

Recommended if you're interested in Data Analysis

exploratory data analysis case study

Johns Hopkins University

Reproducible Research

exploratory data analysis case study

Google Cloud

App Engine: Qwik Start - Go

exploratory data analysis case study

ESADE Business and Law School

Empresa familiar: gestión, dirección y sucesión

Specialization

exploratory data analysis case study

Getting and Cleaning Data

Why people choose coursera for their career.

exploratory data analysis case study

Learner reviews

Showing 3 of 6065

6,065 reviews

Reviewed on Jun 5, 2020

Awesome course that expands on your R knowledge. Only nitpick is that some of the links don't work and the videos need an overhaul as there seem to be little to no updates since 2015/2016.

Reviewed on Mar 8, 2017

When it comes to hierarchical and K-means clustering, the theory wasn't explained clearly. When do we use U and V for what purpose? How does D come in? I'm left confused after this.

Reviewed on Jan 17, 2016

Very nice course, plotting data to explore and understand various features and their relationship is the key in any research domain, and this course teaches the skill required to achieve this.

New to Data Analysis? Start here.

Placeholder

Open new doors with Coursera Plus

Unlimited access to 7,000+ world-class courses, hands-on projects, and job-ready certificate programs - all included in your subscription

Advance your career with an online degree

Earn a degree from world-class universities - 100% online

Join over 3,400 global companies that choose Coursera for Business

Upskill your employees to excel in the digital economy

Frequently asked questions

When will i have access to the lectures and assignments.

Access to lectures and assignments depends on your type of enrollment. If you take a course in audit mode, you will be able to see most course materials for free. To access graded assignments and to earn a Certificate, you will need to purchase the Certificate experience, during or after your audit. If you don't see the audit option:

The course may not offer an audit option. You can try a Free Trial instead, or apply for Financial Aid.

The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.

What will I get if I subscribe to this Specialization?

When you enroll in the course, you get access to all of the courses in the Specialization, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile. If you only want to read and view the course content, you can audit the course for free.

What is the refund policy?

If you subscribed, you get a 7-day free trial during which you can cancel at no penalty. After that, we don’t give refunds, but you can cancel your subscription at any time. See our full refund policy Opens in a new tab .

Is financial aid available?

Yes. In select learning programs, you can apply for financial aid or a scholarship if you can’t afford the enrollment fee. If fin aid or scholarship is available for your learning program selection, you’ll find a link to apply on the description page.

More questions

  • Case Study: Exploratory Data Analysis in R
  • by Daniel Pinedo
  • Last updated over 3 years ago
  • Hide Comments (–) Share Hide Toolbars

Twitter Facebook Google+

Or copy & paste this link into an email or IM:

Exploratory Data Analysis with R

16 data analysis case study: changes in fine particle air pollution in the u.s..

This chapter presents an example data analysis looking at changes in fine particulate matter (PM) air pollution in the United States using the Environmental Protection Agencies freely available national monitoring data. The purpose of the chapter is to just show how the various tools that we have covered in this book can be used to read, manipulate, and summarize data so that you can develop statistical evidence for relevant real-world questions.

Watch a video of this chapter . Note that this video differs slightly from this chapter in the code that is implemented. In particular, the video version focuses on using base graphics plots. However, the general analysis is the same.

16.1 Synopsis

In this chapter we aim to describe the changes in fine particle (PM2.5) outdoor air pollution in the United States between the years 1999 and 2012. Our overall hypothesis is that out door PM2.5 has decreased on average across the U.S. due to nationwide regulatory requirements arising from the Clean Air Act. To investigate this hypothesis, we obtained PM2.5 data from the U.S. Environmental Protection Agency which is collected from monitors sited across the U.S. We specifically obtained data for the years 1999 and 2012 (the most recent complete year available). From these data, we found that, on average across the U.S., levels of PM2.5 have decreased between 1999 and 2012. At one individual monitor, we found that levels have decreased and that the variability of PM2.5 has decreased. Most individual states also experienced decreases in PM2.5, although some states saw increases.

16.2 Loading and Processing the Raw Data

From the EPA Air Quality System we obtained data on fine particulate matter air pollution (PM2.5) that is monitored across the U.S. as part of the nationwide PM monitoring network. We obtained the files for the years 1999 and 2012.

16.2.1 Reading in the 1999 data

We first read in the 1999 data from the raw text file included in the zip archive. The data is a delimited file were fields are delimited with the | character adn missing values are coded as blank fields. We skip some commented lines in the beginning of the file and initially we do not read the header data.

After reading in the 1999 we check the first few rows (there are 117,421) rows in this dataset.

We then attach the column headers to the dataset and make sure that they are properly formated for R data frames.

The column we are interested in is the Sample.Value column which contains the PM2.5 measurements. Here we extract that column and print a brief summary.

Missing values are a common problem with environmental data and so we check to se what proportion of the observations are missing (i.e. coded as NA ).

Because the proportion of missing values is relatively low (0.1125608), we choose to ignore missing values for now.

16.2.2 Reading in the 2012 data

We then read in the 2012 data in the same manner in which we read the 1999 data (the data files are in the same format).

We also set the column names (they are the same as the 1999 dataset) and extract the Sample.Value column from this dataset.

Since we will be comparing the two years of data, it makes sense to combine them into a single data frame

and create a factor variable indicating which year the data comes from. We also rename the Sample.Value variable to a more sensible PM .

16.3 Results

16.3.1 entire u.s. analysis.

In order to show aggregate changes in PM across the entire monitoring network, we can make boxplots of all monitor values in 1999 and 2012. Here, we take the log of the PM values to adjust for the skew in the data.

Boxplot of PM values in 1999 and 2012

Figure 16.1: Boxplot of PM values in 1999 and 2012

From the raw boxplot, it seems that on average, the levels of PM in 2012 are lower than they were in 1999. Interestingly, there also appears to be much greater variation in PM in 2012 than there was in 1999.

We can make some summaries of the two year’s worth data to get at actual numbers.

Interestingly, from the summary of 2012 it appears there are some negative values of PM, which in general should not occur. We can investigate that somewhat to see if there is anything we should worry about.

There is a relatively small proportion of values that are negative, which is perhaps reassuring. In order to investigate this a step further we can extract the date of each measurement from the original data frame. The idea here is that perhaps negative values occur more often in some parts of the year than other parts. However, the original data are formatted as character strings so we convert them to R’s Date format for easier manipulation.

We can then extract the month from each of the dates with negative values and attempt to identify when negative values occur most often.

From the table above it appears that bulk of the negative values occur in the first four months of the year (January–April). However, beyond that simple observation, it is not clear why the negative values occur. That said, given the relatively low proportion of negative values, we will ignore them for now.

16.3.2 Changes in PM levels at an individual monitor

So far we have examined the change in PM levels on average across the country. One issue with the previous analysis is that the monitoring network could have changed in the time period between 1999 and 2012. So if for some reason in 2012 there are more monitors concentrated in cleaner parts of the country than there were in 1999, it might appear the PM levels decreased when in fact they didn’t. In this section we will focus on a single monitor in New York State to see if PM levels at that monitor decreased from 1999 to 2012.

Our first task is to identify a monitor in New York State that has data in 1999 and 2012 (not all monitors operated during both time periods). First we subset the data frames to only include data from New York ( State.Code == 36 ) and only include the County.Code and the Site.ID (i.e. monitor number) variables.

Then we create a new variable that combines the county code and the site ID into a single string.

Finally, we want the intersection between the sites present in 1999 and 2012 so that we might choose a monitor that has data in both periods.

Here (above) we can see that there are 10 monitors that were operating in both time periods. However, rather than choose one at random, it might best to choose one that had a reasonable amount of data in each year.

Now that we have subsetted the original data frames to only include the data from the monitors that overlap between 1999 and 2012, we can count the number of observations at each monitor to see which ones have the most observations.

A number of monitors seem suitable from the output, but we will focus here on County 63 and site ID 2008.

Now we plot the time series data of PM for the monitor in both years.

Daily PM for 1999 and 2012

Figure 6.3: Daily PM for 1999 and 2012

From the plot above, we can that median levels of PM (horizontal solid line) have decreased a little from 10.45 in 1999 to 8.29 in 2012. However, perhaps more interesting is that the variation (spread) in the PM values in 2012 is much smaller than it was in 1999. This suggest that not only are median levels of PM lower in 2012, but that there are fewer large spikes from day to day. One issue with the data here is that the 1999 data are from July through December while the 2012 data are recorded in January through April. It would have been better if we’d had full-year data for both years as there could be some seasonal confounding going on.

16.3.3 Changes in state-wide PM levels

Although ambient air quality standards are set at the federal level in the U.S. and hence affect the entire country, the actual reduction and management of PM is left to the individual states. States that are not “in attainment” have to develop a plan to reduce PM so that that the are in attainment (eventually). Therefore, it might be useful to examine changes in PM at the state level. This analysis falls somewhere in between looking at the entire country all at once and looking at an individual monitor.

What we do here is calculate the mean of PM for each state in 1999 and 2012.

Now make a plot that shows the 1999 state-wide means in one “column” and the 2012 state-wide means in another columns. We then draw a line connecting the means for each year in the same state to highlight the trend.

Change in mean PM levels from 1999 to 2012 by state

Figure 6.5: Change in mean PM levels from 1999 to 2012 by state

This plot needs a bit of work still. But we can see that many states have decreased the average PM levels from 1999 to 2012 (although a few states actually increased their levels).

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 29 August 2024

Dispensed prescription medications and short-term risk of pulmonary embolism in Norway and Sweden

  • Dagfinn Aune 1 , 2 , 3 ,
  • Ioannis Vardaxis 4 ,
  • Bo Henry Lindqvist 4 ,
  • Ben Michael Brumpton 5 , 6 , 7 ,
  • Linn Beate Strand 8 ,
  • Jens Wilhelm Horn 8 , 9 ,
  • Inger Johanne Bakken 10 ,
  • Pål Richard Romundstad 8 ,
  • Kenneth J. Mukamal 11 ,
  • Rickard Ljung 12 ,
  • Imre Janszky 8 , 13 &
  • Abhijit Sen 8 , 14  

Scientific Reports volume  14 , Article number:  20054 ( 2024 ) Cite this article

Metrics details

  • Drug discovery

Scandinavian electronic health-care registers provide a unique setting to investigate potential unidentified side effects of drugs. We analysed the association between prescription drugs dispensed in Norway and Sweden and the short-term risk of developing pulmonary embolism. A total of 12,104 pulmonary embolism cases were identified from patient- and cause-of-death registries in Norway (2004–2014) and 36,088 in Sweden (2005–2014). A case-crossover design was used to compare individual drugs dispensed 1–30 days before the date of pulmonary embolism diagnosis with dispensation in a 61–90 day time-window, while controlling for the receipt of other drugs. A BOLASSO approach was used to select drugs that were associated with short-term risk of pulmonary embolism. Thirty-eight drugs were associated with pulmonary embolism in the combined analysis of the Norwegian and Swedish data. Drugs associated with increased risk of pulmonary embolism included certain proton-pump inhibitors, antibiotics, antithrombotics, vasodilators, furosemide, anti-varicose medications, corticosteroids, immunostimulants (pegfilgrastim), opioids, analgesics, anxiolytics, antidepressants, antiprotozoals, and drugs for cough and colds. Mineral supplements, hydrochlorothiazide and potassium-sparing agents, beta-blockers, angiotensin 2 receptor blockers, statins, and methotrexate were associated with lower risk. Most associations persisted, and several additional drugs were associated, with pulmonary embolism when using a longer time window of 90 days instead of 30 days. These results provide exploratory, pharmacopeia-wide evidence of medications that may increase or decrease the risk of pulmonary embolism. Some of these findings were expected based on the drugs' indications, while others are novel and require further study as potentially modifiable precipitants of pulmonary embolism.

Introduction

Pulmonary embolism (PE) arises when blood clots, often originating in deep veins, obstruct pulmonary arteries 1 . The development of PE encompasses an intricate interplay of pathophysiological, inflammatory, and chemical processes. These processes involve factors such as stasis, endothelial injury, hypercoagulability, inflammatory responses, damage to the endothelium, insults to the coagulation cascade (involving fibrin, thrombin, and tissue factor) as well as additional factors such as immobility and genetic predispositions 1 . The incidence of pulmonary embolism has increased globally over the last decades, particularly in high-income countries, while pulmonary embolism-related in-hospital death rate and age-standardized mortality from pulmonary embolism have decreased or plateaued 2 , 3 , 4 , 5 . Improved survival in patients with conditions predisposing to pulmonary embolism, such as cancer, chronic obstructive pulmonary disease, and autoimmune diseases, may contribute to the increased incidence rates, while a greater proportion of low-risk cases being diagnosed and improvements in the management of pulmonary embolism may have contributed to the reduced mortality rates 2 . Several risk factors have been identified for pulmonary embolism, including age 6 , sex 6 , smoking 7 , obesity 8 , 9 , hypertension 7 , and physical inactivity 10 . In addition, surgery 11 , hospitalization, or trauma that leads to extended immobilization 12 , cancer 12 , 13 , pregnancy 14 , kidney disease 15 , heart failure 12 , or previous history of deep venous thromboembolism or pulmonary embolism are established risk factors. Some medications including oral contraceptives 16 , hormone replacement therapy 17 , tamoxifen 18 , 19 , 20 , bisphosphonates 21 , coagulants 22 , antidepressants 23 , and antipsychotics 24 have been associated with increased risk, while statins have been associated with reduced risk 25 .

In the US, approximately 90% of adults aged 65 years and older use at least one prescription medication 26 . Side effects from use of drugs are a major public health concern and have been estimated to cost around $30 billion in the US 27 . Randomized trials used to test drugs are costly and therefore generally just large enough to detect an effect on the primary outcome of interest, which may be a physiological effect (e.g., blood pressure) rather than a hard clinical outcome. Less common, but important side effects may therefore be missed in many trials. In addition, pre-approval trials have tended to include fewer women, especially in their reproductive age (although some improvement has been observed in later years 28 ), patients with comorbidities, children, and older adults, which may limit the generalizability of the findings from such studies to the general population 29 . Thus, systematic monitoring of the pharmaceutical effects of all approved drugs in clinical practice could advance clinical care and public health.

Our research group has previously conducted systematic examinations of all potential associations between prescribed drugs and short-term risk of acute myocardial infarction 30 and ischemic stroke 31 , and others have used this approach to identify drugs that may improve or impair COVID-19 prognosis 32 , 33 . In the current study, we extended this approach to examine the association between prescribed medications and short-term risk of pulmonary embolism using comprehensive nationwide data in two countries.

Study design

A case-crossover design was used and methods have been described in detail previously 30 , 31 . This design included all cases of pulmonary embolism and applied self-matching by comparing exposure before disease onset with disease-free time in the past as a comparison control. A major advantage of the case-crossover design is that person-specific characteristics that are stable over the typically short-time periods such studies are conducted over (e.g. age, sex, lifestyle, chronic conditions) do not confound the observed associations.

Dispensed medications

We assessed the risk of pulmonary embolism associated with every drug dispensed to patients that had a first-time pulmonary embolism within the study period. Data on dispensed medications prior to the diagnosis or death were extracted from the nation-wide registries of dispensed drugs in both Norway and Sweden. The Norwegian Prescription Database was established on January 1st 2004, and all Norwegian pharmacies are required to supply information on drug prescriptions including type, dose and date of dispensation 34 . Sweden established a similar registry on July 1st 2005, the Swedish National Prescribed Drug Register 35 . National personal identifiers attached to these data were used to link the information on drug prescriptions to other health-related registers existing in these countries. The databases do not include information on drugs purchased over-the-counter or drugs administered to institutionalized patients in nursing homes or hospitals. In Norway, it was possible to exclude participants who, at the time of their pulmonary embolism, were institutionalized and for whom registration of dispensed medications was not available. In Sweden, this information was not available, and therefore we included only those patients to whom at least one drug was dispensed during the year preceding the occurrence of pulmonary embolism. In Sweden, prescriptions are in general valid for one year after date of being issued 36 , and for chronic medication are refilled every third month. A similar system is in place in Norway.

Outcome assessment

The Norwegian Patient Register (established in 2008) 37 , the Swedish National Patient Register (established in 1964) 38 , and the cause of death registries in Norway (established in 1951) 39 and Sweden (established in 1952) 40 were used to identify cases of pulmonary embolism. A total of 12,104 pulmonary embolism cases were identified from patient- and cause-of-death registries in Norway (2004–2014) and 36,088 in Sweden (2005–2014). The quality of the information in the Norwegian and Swedish patient registers has been shown to be very high for other cardiovascular outcomes 38 , 41 , 42 , and acceptable for acute pulmonary embolism in Sweden 43 . In Norway, all patients registered with either a primary hospital discharge diagnosis or underlying cause of death of ICD-10 I26 from 1 January 2008 to 31 December 2014 were included, and in Sweden, the corresponding dates for hospital diagnosis and cause of death were between 1 November 2005 and 31 December 2014. Only the first registered episode of pulmonary embolism was included in the analysis for each participant.

Statistical analysis

For each patient, the occurrence of drug dispensing within 1–30 days before the date of pulmonary embolism occurrence (case period) was compared to a time window of 61–90 days before pulmonary embolism diagnosis (control period) for each drug individually. A 30 day wash-out period between the case- and control-periods was used to minimize the carryover effects of drugs. We calculated odds ratios together with 95% confidence intervals comparing the odds of drug dispensed in the case period to that in the control period using conditional logistic regression. The statistical models were adjusted for every other drug. Adjustments for other drugs were made at the fifth ATC level—so each drug investigated was adjusted for every other drug investigated. Additional analyses were conducted to examine the robustness of the findings where we extended the case-, control- and wash-out periods from 30 to 90 days (case period = 1–90 days before event occurrence, control period 181–270 days before event occurrence) and repeated all analyses. The 30 days 44 and 90 days 44 , 45 , 46 exposure periods were based on prior publications 44 , 45 , 46 .

We assessed the association between all dispensed medications and pulmonary embolism risk. Because we aimed to estimate the most likely effect size for drugs with true associations while accounting for simultaneous prescriptions, we did not use Bonferroni correction or similar conventional methods to address the issue of multiple comparisons, as they may fail to estimate the size of these associations correctly 47 . Instead, we applied a version of the least absolute shrinkage and selection operator (LASSO) regression analysis 47 , 48 , 49 , 50 , 51 called bootstrap-enhanced least absolute shrinkage operator (BOLASSO) 52 . Several bootstrap samples are drawn from the dataset, where each bootstrap sample is generated by sampling N pairs (N is the total number of drugs in the dataset) with replacement. We drew 1000 bootstrap samples in this analysis. Of note, confidence intervals generated via the BOLASSO approach are not optimal, because each bootstrap sample is estimated on different penalty parameters. We include confidence intervals for ease of interpretation. As a result, drugs selected by this approach may include one (i.e., the null) within the confidence intervals. In BOLASSO, we obtain multivariable-adjusted estimates as the effect of each selected drug is controlled for the effects of all other selected drugs. In Supplementary Material, Online A ppendix A, we present in detail the background of the method and how we implemented BOLASSO in conditional logistic regression models for case-crossover data.

Separate analyses were conducted for the Norwegian and Swedish data and we present both country-specific and combined estimates, using meta-analysis to combine the country-specific estimates. Drugs selected by BOLASSO with risk estimates in the same direction for both countries were considered a common hit and were included in the final analysis. A fixed-effects model was used to calculate the combined estimate 53 . All statistical analyses were conducted using R (version 3.2.3; R foundation for Statistical Computing, Vienna, Austria) and Stata/IC 16 (Stata Corp, College Station, Texas, USA).

Ethical approval

The studies were approved by the Regional Committee for Medical and Health Research Ethics (REC) in Central Norway and Regional Ethical Review Board in Sweden. In addition, Norwegian data was also approved by the Norwegian Data Protection Authority. All methods were performed in accordance with the relevant guidelines and regulations by the respective ethical committees from both Norway and Sweden. Exemption from the requirement of obtaining informed consent from the registered individuals was given by REC. In all data files the personal identifier was replaced with a study allocation number as part of the data preparation by the data providers.

A total of 48,192 pulmonary embolism cases were included across the two countries, 36,088 from Sweden (33,678 identified via the patient register and 2410 via the cause of death register) and 12,104 from Norway (11,947 and 157 identified via the corresponding Norwegian registers). Characteristics of these individuals are presented in Table 1 .

A total of 1100 and 1260 distinct pharmaceutical drugs were dispensed among patients who experienced a subsequent pulmonary embolism in Norway and Sweden, respectively. Out of these, 773 unique drugs were dispensed in either the case- or control-period, in Norway and 1091 in Sweden. Of these, BOLASSO selected 117 unique drugs in Norway (Supplementary Table 1 ) and 191 in Sweden (Supplementary Table 2 ). Finally, a total of 59 drugs were selected in common from both countries (Fig.  1 ). Table 2 presents the country-specific and combined estimates of these dually-selected drugs.

figure 1

Case-crossover analysis of dispensed prescription medication use and risk of pulmonary embolism. The above plot illustrates ( A ) unique drug types which were selected in Norway, ( B ) unique drug types which were selected in Sweden, and ( C ) 59 drugs which were common hits from both the countries. Y-axis displays relative risk on the log scale, X-axis displays all the prescribed drugs studied grouped by the anatomical therapeutic chemical (ATC) classification.

Cardiovascular drugs

Several cardiovascular drugs were associated with higher risk of pulmonary embolism, some of which could be due to their indication of use. The antithrombotic agents dalteparin, enoxaparin, and clopidogrel were associated with increased risk, while the angiotensin 2 receptor blocker ‘candesartan and diuretics’ was associated with reduced risk. Diuretics yielded mixed results, with lower risk for hydrochlorothiazide and potassium-sparing agents, while furosemide was associated with higher risk. The vasodilator glyceryl trinitrate was associated with increased risk. An inverse association was observed for the beta-blocking agent carvedilol.

Antibiotics

Several antibiotics examined were associated with increased risk. This includes nystatin, doxycycline, amoxicillin, pivmecillinam, phenoxymethylpenicillin, trimethoprim, sulfamethoxazole and trimethoprim combined, erythromycin, ciprofloxacin, and nitrofurantoin.

Several opioids were associated with higher pulmonary embolism risk, including morphine, oxycodone, oxycodone and naloxone combined, codeine combinations excluding psycholeptics, fentanyl, buprenorphine, and tramadol.

Antineoplastic and immunomodulating agents

A positive association was observed for pegfilgrastim, while lower risk was observed for methotrexate.

Other medications

Analgesics (paracetamol), antiprotozoals (metronidazole), anti-varicose therapy (organo-heparinoid), anxiolytic (oxazepam, but not diazepam), proton pump inhibitors (pantaprazole, but not omeprazole) and corticosteroids for systemic use (prednisolone), but not corticosteroids for dermatological use (clobetasol) were associated with increased risk of pulmonary embolism. Positive associations were observed for cough and cold preparations, such as acetylcysteine and opium derivatives and expectorants. An inverse association was observed for calcium supplements combined with vitamin D or other drugs/supplements.

Additional analyses with a longer exposure window

In additional analyses, when we extended the case-, control and wash-out periods from 30 to 90 days, many of the results remained similar, while others were either stronger or weaker than in the primary analysis (Supplementary Table 3 ). Additional positive associations were observed for ACE inhibitors (enalapril, ramipril), adrenergics/drugs for obstructive airway diseases (salbutamol, terbutaline, indacaterol, formoterol and budenoside), angiotensin II receptor blockers (losartan), antianemics (ferrous sulfate), antibiotics (azithromycin), antidepressants (sertraline, mianserin), antiemetics and antinauseants (aprepitant), antifibrinolytics (tranexamic acid), anti-inflammatory and anti-rheumatic agents (etoricoxib), antimyotics (fluconazole), antineoplastic and immunomodulating agents (capecitabine, erlotinib, tamoxifen), antipropulsives (loperamide), antipsychotics (haloperidol, risperidone), anxiolytics (diazepam), cardiac glycosides (digoxin), corticosteroids for systemic use (dexamethasone), drugs for treating constipation (sodium picosulfate), estrogens (estriol), hormonal contraceptives (levonorgestrel and ethinylestradiol, drospirenone and ethinylestradiol), hypnotics (zopiclone, zolpidem), intravaginal ring (with progestogen and estrogen), opioids (tramadol), and proton pump inhibitors (esomeprazole) (Supplementary Table 3 ). In addition, the positive associations between prednisolone and pulmonary embolism was substantially strengthened (Supplementary Table 3 ). Additional inverse associations were observed between anti-inflammatory and anti-rheumatic agents (glucosamine), antithrombotics (warfarin), calcium channel blockers (felodipine), and drugs for urinary frequency and incontinence (tolterodine) and pulmonary embolism (Supplementary Table 3 ). Supplementary Tables 4 , 5 shows the unique drugs selected from each country using the 90 day time window, but also includes hits that were not common.

This study systematically investigated the associations between all drugs requiring a prescription and short-term risk of developing pulmonary embolism. Using nationwide registry data from Norway and Sweden, and employing an exposure-wide approach, we identified 38 drugs that were either associated with higher (31 drugs) or lower (seven drugs) short-term risk of pulmonary embolism. In additional analyses, using a longer exposure period, which may be more appropriate for some drugs, 66 drugs were associated with higher risk and seven with lower risk.

Antibiotics tended to be most consistently associated with higher risk of pulmonary embolism, and we observed higher risk across several types of antibiotics examined. However, these associations are likely to be confounded by underlying infection, which is a strong risk factor for pulmonary embolism 11 , 54 , 55 , as it might lead to increased inflammation and triggering of platelets, resulting in fibrin deposition and thrombus formation 56 . We also cannot exclude the possibility of reverse causation, where premonitory symptoms of pulmonary embolism like dyspnea are empirically presumed to reflect the more common occurrence of respiratory infection.

Prescription of various opioids was also quite consistently associated with higher risk of pulmonary embolism. This may reflect an increased risk of pulmonary embolism during and after surgery 13 , 57 . Almost 25% of all cases of venous thromboembolism occur during or shortly after surgery 12 . Similarly, the positive associations observed for certain antithrombotics also likely reflects confounding by indication, as these drugs are used to reduce the risk of pulmonary embolism during or after surgery.

The positive association between the antidepressant mirtazapine and pulmonary embolism is consistent with the Million Women’s Study, which reported a positive association between antidepressant use and risk of venous thromboembolism with pulmonary embolism 23 . However, a case-control study reported no association 58 .

We observed a positive association between drugs used to treat cough and cold (e.g., acetylcysteine, and opium derivatives and expectorants) and obstructive airway disease (e.g., salbutamol, formoterol and budesonide) and risk of pulmonary embolism. An acute infection or exacerbation could potentially explain these associations.

Blood pressure-lowering drugs had differing associations with pulmonary embolism risk. While inverse associations were observed for carvedilol, hydrochlorothiazide, and for candesartan in combination with diuretics, furosemide was associated with increased risk. Some large cohort studies have suggested an inverse association between blood pressure and pulmonary embolism risk 59 , 60 . It is possible that the increased pulmonary embolism risk with use of furosemide could be due to activation of the renin–angiotensin–aldosterone system 61 , which has a pro-thrombotic effect 62 . This, on the other hand, could explain why angiotensin-2 receptor blockers reduce the risk of pulmonary embolism, especially in a possible subpopulation that requires a combination of angiotensin-2 receptor blockers and diuretics and may have higher risk of atherosclerotic disease 63 . Beta blockers, and mainly unselected beta blockers, such as carvedilol, can reduce increased sympathetic activity with their prothrombotic effect and thereby reduce pulmonary embolism risk 64 . However, other potential explanations may be reverse causation, whereby premonitory dyspnea is presumed to reflect congestive heart failure, confounding by indication, perhaps by kidney disease 65 , or acute dehydration by furosemide and increased risk of pulmonary embolism as part of the classical Virchow’s triad 66 .

The inverse association observed between dispensed statins (simvastatin and pravastatin) and pulmonary embolism is consistent with results from a cohort study 25 showing reduced incidence from pulmonary embolism with use of statins. The results are also partly consistent with meta-analyses of randomized controlled trials, showing reduced incidence from venous thromboembolism with use of statins 67 , 68 . However, in these meta-analyses, only one trial specifically focused on pulmonary embolism and reported an imprecise effect estimate (HR = 0.77, 0.41–1.46). The association between serum cholesterol level with risk of pulmonary embolism is not clear 6 , 69 , 70 , 71 , and thus it is possible that the observed benefit with statins is due to other non-specific effects unrelated to lipid-lowering, such as anti-inflammatory effects 72 .

There was a weak inverse association observed between calcium supplementation combined with vitamin D or other drugs and risk of pulmonary embolism (HR = 0.89, 0.81–0.97). The Women’s Health Initiative randomized controlled trial reported a similar, but less precise, association between calcium and vitamin D supplementation in relation to pulmonary embolism risk (HR = 0.92, 95% CI 0.73–1.16) 73 .

The strength of this study includes the large-scale registry-based databases from Norway and Sweden and the comprehensive assessment of a wide range of drugs in relation to short-term risk of pulmonary embolism. These registers are of high quality and near complete. Case-crossover studies are vulnerable to misclassification 74 , however, in our study, we could minimize some of the sources of bias due to misclassification. Both countries have a tax-financed universal healthcare system that is accessible to all residents and reporting to the health registers is obligatory. Recall bias with regard to the exposure was eliminated by relying on recorded data. Moreover, the outcome, i.e. pulmonary embolism, can be identified with high degree of accuracy due to the modern diagnostic methods and the universal healthcare in Scandinavia 43 , reducing the potential for selection bias. Use of nationwide databases also supports the generalisability of the findings. The large sample size provided sufficient statistical power to detect even modest associations.

Several of the observed findings likely reflect confounding by indication, and we were not able to separate the effect of a drug and its indication in the current study. It is therefore unclear whether the observed associations were due to the drugs themselves or due to the underlying or prodromal conditions the drugs were meant to treat. However, this type of confounding may be less likely to explain the results when observing inverse associations or when different medications within the same group of medications are differentially associated with pulmonary embolism. In such cases, a direct effect of a drug on pulmonary embolism is more likely. Another limitation of the study is that we did not have detailed information on lifestyle-related risk factors or comorbidities that potentially may have acted as effect modifiers. Most previous studies on drugs and pulmonary embolism were observational cohort and case-control studies 23 , 25 , 58 , 75 . We did not identify prior case-crossover studies on medications and pulmonary embolism risk. Thus, our comparison with previous literature is also limited by differences in design and potential latency periods between drug prescription and the development of pulmonary embolism. The current case-crossover design is best suitable to identify acute or triggering effects of drugs on outcomes with a sudden onset, but may misestimate the effects of long-term drug use on risk(76). It was beyond this project to assess potential drug interactions, but this may be a point for further investigation in future studies. The databases did not include information on over-the-counter drugs or drug prescriptions in hospitals or nursing homes, however, in Norway institutionalized subjects without registration of dispensed medications were excluded, while in Sweden only subjects with at least one drug dispensed in the year preceding the occurrence of pulmonary embolism were included. Lastly, the prescription databases only contain information on the date of dispensing, and not information on the date of actual administration of drugs or whether the drug was taken, but this likely resulted in non-differential misclassification and bias toward the null.

In conclusion, we found both positive and inverse associations between a wide range of dispensed prescription drugs and short-term risk of pulmonary embolism. Further studies are needed to explore these findings in more detail. This method presents an opportunity for systematic and thorough monitoring of a wide range of drugs that could trigger or prevent the onset of serious health conditions, including pulmonary embolism.

Data availability

The data used in the current analysis are available from the Norwegian Patient Register, Norwegian Prescription Database, Norwegian Cause of Death Registry, and Swedish National Patient Register, Swedish National Prescribed Drug Register, Swedish National Cause of Death Register, but restrictions apply to the availability of these data, which were used under license for the current study, and are therefore not publicly available. Requests for data access should be addressed to Abhijit Sen. Email: [email protected].

Turetz, M., Sideris, A. T., Friedman, O. A., Triphathi, N. & Horowitz, J. M. Epidemiology, pathophysiology, and natural history of pulmonary embolism. Semin. Intervent. Radiol. 35 , 92–98 (2018).

Article   PubMed   PubMed Central   Google Scholar  

Barco, S. et al. Global reporting of pulmonary embolism-related deaths in the World Health Organization mortality database: Vital registration data from 123 countries. Res. Pract. Thromb. Haemost. 5 , e12520 (2021).

Dentali, F. et al. Time trends and case fatality rate of in-hospital treated pulmonary embolism during 11 years of observation in Northwestern Italy. Thromb. Haemost. 115 , 399–405 (2016).

Article   PubMed   Google Scholar  

Arshad, N., Isaksen, T., Hansen, J. B. & Brækkan, S. K. Time trends in incidence rates of venous thromboembolism in a large cohort recruited from the general population. Eur. J. Epidemiol. 32 , 299–305 (2017).

Raptis, D. G., Gourgoulianis, K. I., Daniil, Z. & Malli, F. Time trends for pulmonary embolism incidence in Greece. Thromb. J. 18 , 1 (2020).

Tsai, A. W. et al. Cardiovascular risk factors and venous thromboembolism incidence: The longitudinal investigation of thromboembolism etiology. Arch. Intern. Med. 162 , 1182–1189 (2002).

Goldhaber, S. Z. et al. A prospective study of risk factors for pulmonary embolism in women. JAMA 277 , 642–645 (1997).

Article   CAS   PubMed   Google Scholar  

Kabrhel, C., Varraso, R., Goldhaber, S. Z., Rimm, E. B. & Camargo, C. A. Prospective study of BMI and the risk of pulmonary embolism in women. Obesity 17 , 2040–2046 (2009).

Rahmani, J. et al. Relationship between body mass index, risk of venous thromboembolism and pulmonary embolism: A systematic review and dose-response meta-analysis of cohort studies among four million participants. Thromb. Res. 192 , 64–72 (2020).

Kabrhel, C., Varraso, R., Goldhaber, S. Z., Rimm, E. & Camargo, C. A. Jr. Physical inactivity and idiopathic pulmonary embolism in women: Prospective study. BMJ 343 , d3867 (2011).

Masrouha, K. Z., Musallam, K. M., Rosendaal, F. R., Hoballah, J. J. & Jamali, F. R. Preoperative pneumonia and postoperative venous thrombosis: A cohort study of 427,656 patients undergoing major general surgery. World J. Surg. 40 , 1288–1294 (2016).

Heit, J. A. et al. Relative impact of risk factors for deep vein thrombosis and pulmonary embolism: A population-based study. Arch. Intern. Med. 162 , 1245–1248 (2002).

Caron, A. et al. Risk of pulmonary embolism more than 6 weeks after surgery among cancer-free middle-aged patients. JAMA Surg. 154 , 1126–1132 (2019).

Dado, C. D., Levinson, A. T. & Bourjeily, G. Pregnancy and pulmonary embolism. Clin. Chest Med. 39 , 525–537 (2018).

Kuo, T. H., Li, H. Y. & Lin, S. H. Acute kidney injury and risk of deep vein thrombosis and pulmonary embolism in Taiwan: A nationwide retrospective cohort study. Thromb. Res. 151 , 29–35 (2017).

Pearce, H. M., Layton, D., Wilton, L. V. & Shakir, S. A. Deep vein thrombosis and pulmonary embolism reported in the prescription event monitoring study of Yasmin. Br. J. Clin. Pharmacol. 60 , 98–102 (2005).

Barlow, D. H. HRT and the risk of deep vein thrombosis. Int. J. Gynaecol. Obstet. 59 (1), S29–S33 (1997).

Hernandez, R. K., Sorensen, H. T., Pedersen, L., Jacobsen, J. & Lash, T. L. Tamoxifen treatment and risk of deep venous thrombosis and pulmonary embolism: A Danish population-based cohort study. Cancer 115 , 4442–4449 (2009).

Lin, H. F. et al. Correlation of the tamoxifen use with the increased risk of deep vein thrombosis and pulmonary embolism in elderly women with breast cancer: A case-control study. Medicine 97 , e12842 (2018).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Sun, R., Chu, Y., Gao, Y., Cheng, W. & Gao, S. Efficacy and safety of endocrine therapy for breast-cancer prevention in high-risk premenopausal or postmenopausal women: A Bayesian network meta-analysis of nine randomized controlled trials. Menopause 28 , 589–600 (2021).

Vestergaard, P., Schwartz, K., Pinholt, E. M., Rejnmark, L. & Mosekilde, L. Use of bisphosphonates and raloxifene and risk of deep venous thromboembolism and pulmonary embolism. Osteoporos. Int. 21 , 1591–1597 (2010).

Myers, S. P. et al. Tranexamic acid administration is associated with an increased risk of posttraumatic venous thromboembolism. J. Trauma Acute Care Surg. 86 , 20–27 (2019).

Parkin, L. et al. Antidepressants, depression, and venous thromboembolism risk: Large prospective study of UK women. J. Am. Heart Assoc. 6 , e005316 (2017).

Arasteh, O. et al. Antipsychotic drugs and risk of developing venous thromboembolism and pulmonary embolism: A systematic review and meta-analysis. Curr. Vasc. Pharmacol. 18 , 632–643 (2020).

Lassila, R., Jula, A., Pitkaniemi, J. & Haukka, J. The association of statin use with reduced incidence of venous thromboembolism: A population-based cohort study. BMJ Open 4 , e005862 (2014).

Charlesworth, C. J., Smit, E., Lee, D. S., Alramadhan, F. & Odden, M. C. Polypharmacy among adults aged 65 years and older in the United States: 1988–2010. J. Gerontol. A Biol. Sci. Med. Sci. 70 , 989–995 (2015).

Sultana, J., Cutroneo, P. & Trifiro, G. Clinical and economic burden of adverse drug reactions. J. Pharmacol. Pharmacother. 4 , S73–S77 (2013).

Dekker, M. J. H. J. et al. Sex proportionality in pre-clinical and clinical trials: An evaluation of 22 marketing authorization application Dossiers submitted to the European medicines agency. Front. Med. 8 , 643028 (2021).

Article   Google Scholar  

Van Spall, H. G., Toren, A., Kiss, A. & Fowler, R. A. Eligibility criteria of randomized controlled trials published in high-impact general medical journals: A systematic sampling review. JAMA 297 , 1233–1240 (2007).

Sen, A. et al. Systematic assessment of prescribed medications and short-term risk of myocardial infarction—A pharmacopeia-wide association study from Norway and Sweden. Sci. Rep. 9 , 8257 (2019).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Janszky, I. et al. Assessing short-term risk of ischemic stroke in relation to all prescribed medications. Sci. Rep. 11 , 21673 (2021).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Lerner, I. et al. Mining electronic health records for drugs associated with 28 day mortality in COVID-19: Pharmacopoeia-wide association study (PharmWAS). JMIR Med. Inform. 10 , e35190 (2022).

MacFadden, D. R. et al. Screening large population health databases for potential coronavirus disease 2019 therapeutics: A pharmacopeia-wide association study of commonly prescribed medications. Open Forum Infect. Dis. 9 , ofac156 (2022).

Furu, K. Establishment of the nationwide Norwegian prescription database (NorPD)—New opportunities for research in pharmacoepidemiology in Norway. Nor. J. Epidemiol. 18 , 129–136 (2008).

Google Scholar  

Wettermark, B. et al. The new Swedish prescribed drug register–opportunities for pharmacoepidemiological research and experience from the first six months. Pharmacoepidemiol. Drug Saf. 16 , 726–735 (2007).

Bakken, I. J., Ariansen, A. M. S., Knudsen, G. P., Johansen, K. I. & Vollset, S. E. The Norwegian patient registry and the Norwegian registry for primary health care: Research potential of two nationwide health-care registries. Scand. J. Public Health 48 , 49–55 (2020).

Ludvigsson, J. F. et al. External review and validation of the Swedish national inpatient register. BMC Public Health 11 , 450 (2011).

Pedersen, A. G. & Ellingsen, C. L. Data quality in the causes of death registry. Tidsskr. Nor. Laegeforen. 135 , 768–770 (2015).

Brooke, H. L. et al. The Swedish cause of death register. Eur. J. Epidemiol. 32 , 765–773 (2017).

Govatsmark, R. E. S. et al. Completeness and correctness of acute myocardial infarction diagnoses in a medical quality register and an administrative health register. Scand. J. Public Health 48 , 5–13 (2020).

Varmdal, T. et al. Comparison of the validity of stroke diagnoses in a medical quality register and an administrative health register. Scand. J. Public Health 44 , 143–149 (2016).

Andersson, T. et al. Validation of the Swedish national inpatient register for the diagnosis of pulmonary embolism in 2005. Pulm. Circ. 12 , e12037 (2022).

Walker, R. F. et al. Association of testosterone therapy with risk of venous thromboembolism among men with and without hypogonadism. JAMA Intern. Med. 180 , 190–197 (2020).

Grimnes, G. et al. C-reactive protein and risk of venous thromboembolism: Results from a population-based case-crossover study. Haematologica 103 , 1245–1250 (2018).

Grimnes, G., Isaksen, T., Tichelaar, Y. I. G. V., Brækkan, S. K. & Hansen, J. B. Acute infection as a trigger for incident venous thromboembolism: Results from a population-based case-crossover study. Res. Pract. Thromb. Haemost. 2 , 85–92 (2018).

Greenland, S. & Robins, J. M. Empirical-Bayes adjustments for multiple comparisons are sometimes useful. Epidemiology 2 , 244–251 (1991).

Zhao, P. & Yu, B. On model selection consistency of Lasso. J. Mach. Learn. Res. 7 , 2541–2563 (2006).

MathSciNet   Google Scholar  

Nee, M. et al. Prescription medicine use by pedestrians and the risk of injurious road traffic crashes: A case-crossover study. PLoS Med. 14 , e1002347 (2017).

Avalos, M. et al. Prescription-drug-related risk in driving: Comparing conventional and lasso shrinkage logistic regressions. Epidemiology 23 , 706–712 (2012).

Steenland, K., Bray, I., Greenland, S. & Boffetta, P. Empirical Bayes adjustments for multiple results in hypothesis-generating or surveillance studies. Cancer Epidemiol. Biomark. Prev. 9 , 895–903 (2000).

CAS   Google Scholar  

Bach, F. Bolasso: Model consistent Lasso estimation through the bootstrap. In ICML '08: Proceedings of the 25th international conference on Machine learning , 33–40. https://doi.org/10.1145/1390156.1390161 (Ithaca, New York, 2008).

Borenstein, M., Hedges, L. V., Higgins, J. P. & Rothstein, H. R. A basic introduction to fixed-effect and random-effects models for meta-analysis. Res. Synth. Methods 1 , 97–111 (2010).

Clayton, T. C., Gaskin, M. & Meade, T. W. Recent respiratory infection and risk of venous thromboembolism: Case-control study through a general practice database. Int. J. Epidemiol. 40 , 819–827 (2011).

Chen, Y. G. et al. Association between pneumococcal pneumonia and venous thromboembolism in hospitalized patients: A nationwide population-based study. Respirology 20 , 799–804 (2015).

Beristain-Covarrubias, N. et al. Understanding infection-induced thrombosis: Lessons learned from animal models. Front. Immunol. 10 , 2569 (2019).

Sweetland, S. et al. Duration and magnitude of the postoperative risk of venous thromboembolism in middle aged women: Prospective cohort study. BMJ 339 , b4583 (2009).

Lacut, K. et al. Association between antipsychotic drugs, antidepressant drugs and venous thromboembolism: Results from the EDITH case-control study. Fundam. Clin. Pharmacol. 21 , 643–650 (2007).

Gregson, J. et al. Cardiovascular risk factors associated with venous thromboembolism. JAMA Cardiol. 4 , 163–173 (2019).

Nazarzadeh, M. et al. Blood pressure and risk of venous thromboembolism: A cohort analysis of 5.5 million UK adults and Mendelian randomization studies. Cardiovasc. Res. 119 , 835–842 (2022).

Article   PubMed Central   Google Scholar  

Potter, B. M., Ames, M. K., Hess, A. & Poglitsch, M. Comparison between the effects of torsemide and furosemide on the renin–angiotensin–aldosterone system of normal dogs. J. Vet. Cardiol. 26 , 51–62 (2019).

Bekassy, Z., Lopatko Fm, I., Bader, M. & Karpman, D. Crosstalk between the renin–angiotensin, complement and kallikrein–kinin systems in inflammation. Nat. Rev. Immunol. 22 , 411–428 (2022).

Chae, Y. K. et al. Inhibition of renin angiotensin axis may be associated with reduced risk of developing venous thromboembolism in patients with atherosclerotic disease. PLoS One 9 , e87813 (2014).

De Peuter, O. R. et al. Non-selective vs. selective beta-blocker treatment and the risk of thrombo-embolic events in patients with heart failure. Eur. J. Heart Fail. 13 , 220–226 (2011).

Singh, J. et al. Pulmonary embolism in chronic kidney disease and end-stage renal disease hospitalizations: Trends, outcomes, and predictors of mortality in the United States. SAGE Open Med. 9 , 20503121211022996 (2021).

Ku, E., Lee, B. J., Wei, J. & Weir, M. R. Hypertension in CKD: Core curriculum 2019. Am. J. Kidney Dis. 74 , 120–131 (2019).

Li, L., Zhang, P., Tian, J. H. & Yang, K. Statins for primary prevention of venous thromboembolism. Cochrane Database Syst. Rev. 2014 , CD008203 (2014).

PubMed   PubMed Central   Google Scholar  

Kunutsor, S. K., Seidu, S. & Khunti, K. Statins and primary prevention of venous thromboembolism: A systematic review and meta-analysis. Lancet Haematol. 4 , e83–e93 (2017).

Goldhaber, S. Z. et al. Risk factors for pulmonary embolism: The Framingham study. Am. J. Med. 74 , 1023–1028 (1983).

Wattanakit, K. et al. Association between cardiovascular disease risk factors and occurrence of venous thromboembolism. Thromb. Haemost. 108 , 508–515 (2012).

Hu, M., Li, X. & Yang, Y. Causal associations between cardiovascular risk factors and venous thromboembolism. Semin. Thromb. Hemost. 49 , 679–687 (2023).

Rodriguez, A. L. et al. Statins, inflammation and deep vein thrombosis: A systematic review. J. Thromb. Thrombolysis 33 , 371–382 (2012).

Blondon, M. et al. The effect of calcium plus vitamin D supplementation on the risk of venous thromboembolism. From the women’s health initiative randomized controlled trial. Thromb. Haemost. 113 , 999–1009 (2015).

Mittleman, M. A. & Mostofsky, E. Exchangeability in the case-crossover design. Int. J. Epidemiol. 43 , 1645–1655 (2014).

Liu, Y. et al. Current antipsychotic agent use and risk of venous thromboembolism and pulmonary embolism: A systematic review and meta-analysis of observational studies. Ther. Adv. Psychopharmacol. 11 , 2045125320982720 (2021).

Hallas, J., Pottegård, A., Wang, S., Schneeweiss, S. & Gagne, J. J. Persistent user bias in case-crossover studies in pharmacoepidemiology. Am. J. Epidemiol. 184 , 761–769 (2016).

Download references

This work was supported by Central Norway Regional Authority (project number 46060913).

Author information

Authors and affiliations.

Department of Epidemiology and Biostatistics, School of Public Health, Imperial College London, St. Mary’s Campus, Norfolk Place, Paddington, London, W2 1PG, UK

Dagfinn Aune

Department of Nutrition, Oslo New University College, Oslo, Norway

Department of Research, Cancer Registry of Norway, Oslo, Norway

Department of Mathematical Sciences, Norwegian University of Science and Technology, Trondheim, Norway

Ioannis Vardaxis & Bo Henry Lindqvist

Department of Thoracic Medicine, St. Olav’s Hospital, Trondheim University Hospital, Trondheim, Norway

Ben Michael Brumpton

K.G. Jebsen Centre for Genetic Epidemiology, Department of Public Health and Nursing, Norwegian University of Science and Technology, Trondheim, Norway

MRC Integrative Epidemiology Unit, University of Bristol, Bristol, UK

Department of Public Health and Nursing, Faculty of Medicine and Health Sciences, Norwegian University of Science and Technology, 7491, Trondheim, Norway

Linn Beate Strand, Jens Wilhelm Horn, Pål Richard Romundstad, Imre Janszky & Abhijit Sen

Department of Internal Medicine, Levanger Hospital, Health Trust Nord-Trøndelag, Levanger, Norway

Jens Wilhelm Horn

Department of Health Registries, Norwegian Directorate of Health, Trondheim, Norway

Inger Johanne Bakken

Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA

Kenneth J. Mukamal

Unit of Epidemiology, Institute of Environmental Medicine, Karolinska Institutet, Solna, Stockholm, Sweden

Rickard Ljung

Regional Center for Health Care Improvement, St. Olav’s Hospital, Trondheim, Norway

Imre Janszky

Center for Oral Health Services and Research (TkMidt), Trondheim, Norway

Abhijit Sen

You can also search for this author in PubMed   Google Scholar

Contributions

AS had full access to the data and takes responsibility for the integrity of the data and the accuracy of the data analysis, and conducted the statistical analysis. IJ, KJM, RL, and PRR conceived and designed the study. IJ, RL, IJB, PRR, and BMB acquired the data. DA, IV, BHL, BMB, LBS, JWH, IJB, PRR, KJM, RL, IJ, and AS interpreted the data. DA drafted the manuscript. DA, IV, BHL, BMB, LBS, JWH, IJB, PRR, KJM, RL, IJ, and AS critically revised the manuscript for important intellectual content. IJB, BMB, and RL provided administrative, technical or logistic support. IJ obtained funding. IJ, KLM, and RL supervised the study. AS and IJ are guarantors of the Norwegian data and RL is the guarantor of the Swedish data. Exemption from the requirement of obtaining informed consent from the registered individuals was given by REC.

Corresponding author

Correspondence to Dagfinn Aune .

Ethics declarations

Competing interests.

Rickard Ljung is employed at the Swedish Medical Products Agency, Uppsala, Sweden. The views expressed in this paper do not necessarily represent the views of the Government agency. The remaining authors have nothing to disclose. The interpretation and reporting of these data are the sole responsibility of the authors, and no endorsement by the Department of Health Registries is intended nor should be inferred.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary information 1., supplementary information 2., supplementary tables., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Aune, D., Vardaxis, I., Lindqvist, B.H. et al. Dispensed prescription medications and short-term risk of pulmonary embolism in Norway and Sweden. Sci Rep 14 , 20054 (2024). https://doi.org/10.1038/s41598-024-69637-4

Download citation

Received : 04 October 2023

Accepted : 07 August 2024

Published : 29 August 2024

DOI : https://doi.org/10.1038/s41598-024-69637-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Medications
  • Pulmonary embolism

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

exploratory data analysis case study

  • Open access
  • Published: 26 August 2024

Health worker perspectives on barriers and facilitators of tuberculosis investigation coverage among index case contacts in rural Southwestern Uganda: a qualitative study

  • Paddy Mutungi Tukamuhebwa 1 ,
  • Pascalia Munyewende 1 ,
  • Nazarius Mbona Tumwesigye 2 ,
  • Juliet Nabirye 3 &
  • Ntombizodwa Ndlovu 1  

BMC Infectious Diseases volume  24 , Article number:  867 ( 2024 ) Cite this article

Metrics details

In 2012, the World Health Organization recommended screening and investigation of contacts of index tuberculosis patients as a strategy to accelerate detection of tuberculosis (TB) cases. Nine years after the adoption of this recommendation, coverage of TB contact investigations in Uganda remains low. The objective of this study was to examine health care providers’ perceptions of factors influencing coverage of TB contact investigations in three selected rural health facilities in Mbarara district, southwestern Uganda.

This study identified provider opinions on the barriers and facilitators to implementation of TB contact investigation using the Consolidated Framework for Implementation Research. Using an exploratory qualitative study design, semi-structured interviews with 19 health workers involved in the TB program at district, health facility and community levels were conducted from April 2020 and July 2020. Analysis was conducted inductively using reflexive thematic analysis in six iterative steps: familiarizing with the data, creating initial codes, searching for themes, reviewing themes, developing theme definitions, and writing the report.

Nineteen health care workers participated in this study which translates to a 100% response rate. These included two district TB and leprosy supervisors, five nurses, five clinical officers, six village health team members and one laboratory technician. The three themes that emerged from the analysis were intervention-related, health system and contextual factors. Health system-related barriers included inadequate or delayed government funding for the TB program, shortage of human resources, insufficient personal protective equipment, and a stock-out of supplies such as Xpert MTB cartridges. Contextual barriers included steep terrain, poverty or low income, and the stigma associated with TB and COVID-19. Facilitators comprised increased knowledge and understanding of the intervention, performance review and on-the-job training of health workers.

Conclusions

This study found that most of the factors affecting TB contact investigations in this rural community were related to health system constraints such as inadequate or delayed funding and human resource shortages. This can be addressed by strengthening the foundational elements of the health system - health financing and human resources - to establish a comprehensive TB control program that will enable the efficient identification of missing TB patients.

Peer Review reports

Introduction

An estimated 10 million people suffer from active tuberculosis (TB) every year [ 1 ]. The disease continues to be the leading infectious cause of death globally, causing about 1.5 million deaths—95% of which occurred in low- and middle-income countries [ 2 , 3 ]. Although the African region has 9% of the world population, the region contributed 25% of all new TB cases in 2019, becoming the continent with the second-highest TB cases after South-East Asia. In Africa TB is mainly driven by the HIV pandemic, with about 50% of TB cases co-infected with HIV, and is the top cause of death among patients with HIV, causing more than 30% of all AIDS-related deaths [ 4 , 5 ].

In 2012, the WHO recommended the screening and evaluation of contacts of persons with infectious TB as an intervention for increasing TB case detection [ 6 ]. The intervention also provides an opportunity to diagnose latent TB and to scale-up TB preventive therapy among the eligible contacts, such as, children below five years, HIV positive patients, and other high-risk groups [ 7 , 8 ]. Five years later, in 2017, the Uganda Ministry of Health (MoH) adopted these WHO recommendations as high-level policy, and integrated them into the Manual for Management and Control of TB and Leprosy in Uganda [ 9 ]. Furthermore, in 2019, detailed operational guidelines were developed by the Uganda National Tuberculosis and Leprosy Program (NTLP) to guide and standardize TB contact investigation processes at health facility and community levels [ 8 ].

Despite the WHO policy guidance, coverage of TB contact investigation in many TB high burden countries such as Uganda, Kenya, Lao Republic, Pakistan and Yemen is still low [ 10 ]. A meta-analysis conducted in 2015 by Block et al., showed low TB contact investigation coverage in five countries (2.8% in the Lao Republic, 4.8% in Kenya, 14.9% in Pakistan and 15.1% in Uganda) and high coverage in one country (91.7% in the Democratic Republic of Congo) [ 10 ]. Armstrong et al. (2017), in a prospective multi-center observational study conducted in Kampala, Uganda, reported significant drop-out rates across the steps in the contact investigation cascade [ 11 ]. Among the 338 clients eligible for TB contact investigation, only 61% were scheduled for home visits, and only 50% of them were visited [ 11 ]. Furthermore, among the 131 people who were screened for TB and required definitive evaluation, only 20% were evaluated [ 11 ].

In rural Uganda, the coverage of TB contact investigation is much lower (15.1%) than that in urban areas such as Kampala (20%), and yet many of the missing TB cases are in such hard to reach and underserved rural areas [ 10 , 11 ]. This low coverage increases undiagnosed and untreated TB patients, thus perpetuating the TB pandemic. Furthermore, without TB contact investigation, many TB patients might infect other people in the household and the community, or die from TB related complications [ 12 ]. The low contact investigation coverage contributes to a high numbers of missed diagnoses in Uganda (400,000 in 2014), and high TB transmission rates, which hamper progress towards achievement of the third United Nations Sustainable Development Goal of ending the TB epidemic by 2030 [ 13 ].

Implementation research helps to connect research and practice by speeding-up the development and provision of public health interventions [ 14 ]. Given that urban settings have been the primary focus of the majority of implementation research studies in Africa and that the burden of TB differs between urban and rural areas [ 7 , 15 , 16 ], this study used implementation research tools to investigate the barriers to and enablers of TB contact investigation coverage in rural southwestern Uganda [ 3 ]. Although 82% of the Ugandan population lives in rural areas, there is inadequate information about TB contact investigation coverage, and its barriers and facilitators in rural settings [ 17 ]. The purpose of this study was to investigate the barriers and facilitators of investigation coverage among contacts of TB patients in rural Uganda.

The Consolidated Framework for Implementation Research (CFIR) by Damschroder et al. was used to explore barriers and facilitators of implementation in this study [ 18 ]. The framework consists of 39 constructs and five domains: characteristics of the intervention, inner setting, outer setting, individuals involved and implementation process [ 18 ]. The framework has been widely used across the globe to identify the barriers and facilitators of implementation outcomes in various healthcare settings, for example, evaluation of the online frailty tool in primary health care in Canada, integration of hypertension-HIV management in three Ugandan HIV clinics, examining the task shifting strategy for hypertension control at 32 hospitals and community health centers in Ghana and evaluating the implementation context of a quality improvement program for increasing vaccination coverage in Nigeria [ 19 , 20 , 21 , 22 ].

Study setting

This study was conducted in the rural district of Mbarara, located in the southwestern region of Uganda, about 270 km southwest of the capital city, Kampala. According to the 2014 population and housing census, the district had a population of 472,629 (Land area 1785.6 km 2 ), of which 59% resided in rural areas [ 23 ]. In total the district had 87 health facilities including 48 government owned, 26 private clinics and 13 nonprofit health facilities [ 24 ]. There were no data on TB contact investigation available at district level. Health Centres (HC) in Uganda are ranked II, III or IV based on the administrative zone served by the health facility with level II serving a parish, level III serving a sub-county and level IV serving a county [ 25 ]. A HC IV is expected to serve a population of at least 100,000 people. The services offered included general outpatient clinic (including TB and HIV care), immunization, antenatal care, maternity services, inpatient, laboratory, emergency surgery and blood transfusion [ 25 ].

The Ugandan health system operates on a referral basis, with the lowest level of health care provided by community health workers called Village Health Teams (VHTs) and the highest level of care offered at highly specialized hospitals called National Referral Hospitals. Levels of health care increase with complexity in terms of the packages of services offered, staffing levels, and the size of the population served. Three health facilities where the study was conducted were purposively selected due to their rural location, level of care (IV), and significant volume of patients compared to lower levels (II and III).

Coordination of TB services in the district was done by the District TB and Leprosy Supervisor (DTLS), who is responsible for 26 TB diagnostic and treatment centers. Regional coordination of TB activities is done by the Zonal TB and Leprosy Supervisor (ZTLS), while national level coordination and policy formulation is done by the National TB and Leprosy Program (NTLP) [ 15 ].

Study design and study population

A qualitative, exploratory study design was conducted to identify barriers and facilitators to implementing TB contact investigations between April and July 2020. Semi-structured interviews were conducted with all 19 health workers who were purposively selected based on their direct participation in the implementation of TB interventions since they were likely to have the most knowledge and experience with TB contact investigations. These included TB focal persons at the health facilities, clinical officers, nurses, laboratory staff, VHTs, and District TB and Leprosy Supervisors. Health workers who were not in the health facility during the data collection period were excluded from the study. The Consolidated criteria for reporting qualitative studies (COREQ) were applied to comply with the reporting standards (Table S2 ) [ 26 ].

Data collection

Semi-structured interview guides were developed and included background information about study participants and questions developed according to the five domains of the CFIR. The VHT interview guides were translated into the regional dialect and put through a pilot test to ensure that the questions were understood and to gauge how long the interviews would take. Two health facilities that provided comparable research sites in terms of staffing levels and services were used for the pilot testing.

Physical interviews for the study participants were conducted by the lead researcher (PT) in either English or Runyankore and each interview was tape recorded while a trained research assistant took field notes. Data collection for each category of study participants was continued until saturation was reached [ 27 ]. Since data collection took place during the first wave of the COVID-19 pandemic, precautions were taken to prevent COVID-19 cross-infection on both the researcher and the participants. Interviews were conducted at the selected health facilities in well-ventilated spaces, with both the interviewer and the participant wearing N-95 respirators, and surgical masks, respectively. Each interview lasted between 30 and 45 min and no repeat interviews were conducted.

Data management and analysis

Data were transcribed verbatim by the research team and the lead researcher listened to each audio recording while reading through the transcripts to correct errors in transcription and familiarize himself with the data. Transcripts were not given back to the participants for review or comments because evidence suggests that interviewee transcript review does not add value to the quality and rigor of qualitative research [ 28 ]. PT and JN reviewed the transcripts and made initial notes of interesting features or potential codes and themes in the data. The transcripts were then uploaded into MAXQDA 2020, and analyzed using reflexive thematic analysis in six iterative and recursive steps as described by Braun and Clarke [ 29 ]. The six steps included (1) familiarization with the data, (2) coding, (3) searching for themes, (4) reviewing the themes, (5) naming and defining the themes, and (6) writing the report [ 29 ]. The first step of the analysis was to look at the participants’ own words and expressions, without preconceived notions or classifications. The researchers then examined the language used by each participant in relation to the five domains of the CFIR. To ensure the reliability and credibility of the research analysis, both researchers PT and JN developed the themes by reading the transcripts independently to establish inter-coder agreement [ 30 ]. After the initial coding, the two-member team met to discuss the independently developed codes and themes and to reach an agreement on the themes. The transcribed texts and quotes were then grouped into themes, and the lead researcher used a reflexive approach to identify similarities or differences among CFIR domains and constructs. This iterative and recursive process provided space for reflexivity and ensured the credibility of the research findings. Themes were then defined and further refined to reflect the challenges and enablers of contact investigation coverage.

The research team and reflexivity

The field research team consisted of the principal investigator (PT), a male master’s student at the University of the Witwatersrand, and a female research assistant (GA), who is trained in population studies and monitoring and evaluation, and she was not employed at the time of this study. The principal investigator is a medical doctor who has training and experience in TB care and is familiar with WHO TB guidelines for contact investigations. He was not affiliated with the District Health Department or the Ministry of Health NTLP and is therefore unlikely to have influenced participant responses. Prior to the study, the principal investigator received training in qualitative research methods at the University of the Witwatersrand, so he was aware of how a researcher’s background, location, and assumptions can influence a qualitative study. The research team did not know the participants beforehand, and they were not directly involved in patient care in a way that would have influenced their responses.

Ethical considerations

This study was cleared by the Human Research Ethics Committees (Medical) at the University of the Witwatersrand (M200101), and Mbarara University of Science and Technology (MUREC 1/7). The Uganda National Council for Science and Technology granted permission to conduct the study in Uganda (HS569ES). Administrative approval was obtained from the District Health Officer, and the health facility managers of the respective study sites. Information about the study was shared with the participants before the interviews and written informed consent for participation and audio recording was obtained from each participant. To preserve participant privacy, interviews were conducted in a private space within the outpatient units, with only the researchers and the participants present.

Characteristics of study participants

Nineteen participants took part in semi-structured interviews with a response rate of 100% and 21.1% ( n  = 4) of them were male (Table  1 ). The sample comprised five clinical officers (26.3%), five nurses (26.3%), six VHT members (31.6%), one laboratory technician (5.2%), and two DTLs (10.5%). Eight of the participants (42.1%) had over three years’ experience in offering TB care. Clinical officers were paramedics with a diploma in clinical medicine, as opposed to nurses who had a bachelor’s degree in nursing, a diploma, or a nursing certificate. VHTs were lay health workers based in the community to aid with TB interventions in the local population. Laboratory technicians had a diploma in laboratory sciences, whereas DTLSs had one in nursing or clinical medicine.

Barriers and facilitators of TB contact investigation coverage

A reflective thematic analysis of the data gave rise to three themes: health system, contextual and intervention-related factors. The barriers and facilitators identified under each of the three themes (Table S3 ). Based on the WHO’s health system building blocks, the factors affecting the health system emerged under six sub-themes: human resources, commodities, service delivery, leadership and coordination, funding, and health information systems. Contextual factors were further categorized into geographic, social, and cultural, economic, and policy-related factors. Issues affecting TB contact investigations linked to the intervention itself were covered by the final theme (intervention-associated factors).

Barriers and facilitators

Domain 1: characteristics of the intervention.

The intervention related factors reported by the participants fell under three constructs, that is: evidence-base, intervention complexity and implementation cost.

Evidence-base

Out of the 19 healthcare workers involved in this research, 16 were aware of the intervention and its effectiveness in detecting, treating, and stopping the spread of tuberculosis in the community. Some of them had even engaged in relevant programs at the district, health facility, and community levels to improve uptake, such as support supervision, enlisting household contacts, home visits, health education, screening, and sputum sample collection. The DTLSs reported that training and regular orientation on several aspects of TB management, including TB contact investigation, provided easy access to knowledge and information. The district provided training on TB contact investigation to health workers in different platforms, including quarterly performance review meetings. As a result, they had the necessary information, abilities, or confidence to carry out contact investigation tasks.

“Even in meetings , we talk about contact tracing and investigation. Because for us we do meetings quarterly , all those meetings we…include a training in contact tracing and investigation” (Respondent 1—Nurse).

Intervention complexity

Three VHTs reported that TB contact investigations had multiple processes and therefore required a team to go for community visits, which interfered with other ongoing interventions at the health facility, such as TB screening at outpatient clinics, linking positive patients to treatment, providing community-based DOTs for patients on treatment, and following up with clients who defaulted on treatment. They also assisted with other medical services, such as immunizations, prenatal care, and providing ART refills to stable HIV patients. Therefore, during contact investigations, VHTs were mostly involved in community activities, leaving some of the basic facility-based interventions unattended.

“…it interferes with other programs… Now I am here working at the health facility , collecting sputum , screening and… I have many patients attending immunization , antenatal , ART (HIV clinic) , and I am the one who works on them too. And after that , I want to go and do contact tracing… Sometimes I ignore some of the facility activities so that I spare some time to go and do contact tracing in the community” (Respondent 4—VHT) .

Cost of the intervention

During TB contact investigations, it may be required to phone many patients or contacts. It is frequently necessary to call people who have appointments but do not show up at the health facility. Healthcare workers find it challenging to make these calls due to the high airtime requirements of this intervention and the associated cost.

“…some of these contacts need to be contacted on the phone several times because someone tells you he is coming tomorrow; and he doesn’t come. And the person keeps giving appointments without coming. And we do not have all that airtime…” (Respondent 5—Clinical Officer) .

Domain 2: outer setting

Funding from external entities: inadequate funding.

Multiple funding related challenges were reported at national, district and health facility levels. Funding for TB contact investigation was provided, through the Primary Health Care grants released from the Ministry of Health to public health facilities. Additional funds for contact investigation came from USAID through the Regional Health Integration to Enhance Services in Southwestern Uganda; a program for scaling up access to comprehensive HIV, TB and reproductive health services in the region.

Health workers believed that TB was not considered a priority by the Ministry of Health, which led to underfunding of the NTLP, and eventually underfunded TB work at district, health facility and community levels. TB interventions were not integrated into the annual budgeting processes like other interventions. For example, Malaria and sanitation interventions received funds, while TB remained unfunded, since 2014. The DTLS reported that the sanitation program was prioritized and funded better than the TB program, because of the advocacy by the sanitation program.

“…I think if the government says , ‘let us fight this disease’ , they need to put in (funds). Let them consider TB across the board. Let them budget for it like the way they budget for other conditions. Malaria is budgeted for , sanitation…receives money every quarter. But it is like six years (since 2014) when there was money for TB…and it was for only one quarter” (Respondent 1—Nurse).

The DTLSs reported insufficient funds for TB support supervision at the district level, which limited the amount of time the district TB supervisor spends in each health facility for supervision visits. Eventually, the quality of the supervision was compromised because teams did not have sufficient resources to train, mentor and supervise health facility teams.

“Because of the funds being little , we are forced , like in a day , to move to about four facilities. Remember , in TB , there are six indicators that you need to focus on and get to understand what the problem is. So , you find we do not have sufficient time to spend in the facility and support it.” (Respondent 2—Clinical Officer).

Health facility level funding challenges included delayed reimbursement of funds, and inadequate funds for home visits. In some cases, health facilities rely on NGOs for extra funds to conduct contact investigations, because of insufficient funds from the Primary Health Care (PHC) fund.

“…but when you do not have that NGO , things are challenging because you know that PHC money cannot be enough. You find that the PHC money is for only two patients , yet you have like six of them (to follow-up). So , when you do not have that money from NGOs , you cannot do it smoothly.” (Respondent 2—Clinical Officer) .

Some participants reported that they used their own money to trace index TB contacts; however, this money takes a long time to be refunded. Some participants even had a pay gap of about five months, which lowered their morale to continue with community visits.

“Most of the cases , we use our own money… you want to do your job , but transport facilitation (is missing)! Even…when they decide to refund it (money) , it takes so long…for example , since January we have never got that transport (money). We did contact tracing in January , February , March , April and May; we gave them reports , and they see that we are working , but we do not see our transport (refund)” (Respondent 16—VHT).

Critical incidents: COVID-19 pandemic related factors

This study was conducted during the first wave of the COVID-19 pandemic a lockdown policy was implemented by the government. This was characterized by suspension of public and private transportation, some health workers, TB patients and their contacts were unable to access health facilities. These restrictions affected the mobility of the health workers and patients to the health facility, and undermined TB contact investigation efforts. Besides lockdown measures, the COVID-19 pandemic was also associated with stigma among patients and health workers. Some TB contacts were afraid to report cough, in fear of being suspected of having COVID-19 and having to be quarantined for 14 days as per the MOH recommendations at the time. COVID-19 heightened the stigma associated with TB, because the two conditions have similar symptoms. Health workers could not tell who had COVID-19 or TB and, therefore, avoided anyone presenting with cough, because they feared it might be COVID-19. Some laboratory personnel declined to examine sputum samples because they were concerned that the samples might contain COVID-19 and increase their risk of getting the virus.

“Now with corona (COVID-19) , we would come here and not find any patient or health worker because they did not have transport means during the lockdown. Most of our people stayed at home. Even if you had your own motorcycle , they would not allow you to ride it…” (Respondent 13—Clinical officer).

Partnerships and connections: collaboration with NGOs and community-based organizations

Health workers and VHTs reported that the district and health facilities are networked with NGOs and community-based organizations which support the implementation of TB contact investigation and other health interventions. The primary implementing partner was Regional Health Integration to Enhance Services in Southwestern (RHITES-SW) Uganda, which supports the district with transportation and materials, while doing household visits.

Along with funding TB contact investigation, district-based NGOs also sponsored radio airtime to increase awareness and create demand for TB services.

“…RHITES-SW provides us with materials to use , like carrier bags. They provide us with transport to do contact tracing and the information. They normally update us on each and everything that is current in contact tracing and investigation” (Respondent 5—Clinical officer) . “Other stakeholders are working hand in hand with the government and our implementing partners. I see them working as a team to sponsor airtime on radios to create awareness and give some financial assistance.” (Respondent 12 , Clinical Officer).

Policies and laws: availability of updated operational guidelines

The district established favorable communication networks at district and health facility levels, facilitating efficient communication of guidelines, reference materials, and patients’ results. For example, the district had a WhatsApp group, specifically for the district TB team, to share information and monitor district activities.

“…we have a WhatsApp group of all the in charges and TB focal persons , where we discuss TB management and…share guidelines , so whoever needs guideline in TB management , he just goes there” (Respondent 1—Nurse).

Domain 3: inner setting

Available resources.

The barriers that emerged under available resources included, lack of personal protective equipment (PPE), stock-outs of Xpert MTB cartridges and shortage of human resources. Commodities that frequently went out of stock included toolkits for TB contact investigations and Xpert MTB cartridges for conducting Xpert MTB and RIF tests. At times health facilities spend about two months without cartridges, and health workers were notified by the laboratory team not to send sputum samples for analysis, which weighs down contact investigation efforts. Additionally, VHTs reported the lack of essential tools for community visits, especially during extreme weather. Health facilities also frequently ran short of PPE for home-based contact screening, such as masks and gloves, which discouraged them from doing community contact tracing out of fear of acquiring TB.

“…sometimes , there are no GeneXpert (Xpert MTB) cartridges; you find that we are not doing GeneXpert (tests) because cartridges are finished… , at times we take like a month or two without cartridges and…that is not good… , the lab people tell us , ‘do not send samples this month , we do not have (cartridges)’ , which means we are missing people (patients).” (Respondent 12—Clinical Officer). “At times you go to a difficult place…in a rainy season… , you climb a hill while it is raining on you. You do not have an umbrella; you do not have boots or a bag to carry the stuff (materials)…” (Respondent 4—VHT).

Human resource shortage was also reported as barrier. Sometimes, only one health worker was available to go for community visits, yet there are multiple tasks to do, including health education, screening, and sample collection. Therefore, this scarcity of human resources affects the quality of implementation since some of the tasks are left incomplete.

“…sometimes there is a lack of manpower because…the health workers are not enough at the facility , so you find that only one person is going for contact tracing , and the work there is huge , and that person cannot do all the work alone. So , most of the things are not done. They do part of the work and leave out some” (Respondent 15—Nurse).

Two facilitators were discussed under the construct of available resources: presence of a landline telephones to aid communication and a motorcycle to support transportation during community visits. The telephones were loaded with airtime for scheduling household visits and communicating Xpert MTB/RIF results from the hub laboratory while the motorcycle helped to reduce the cost of transportation since community visits only required fuel for the motorcycle.

“We have a health facility motorcycle , which does not force us to put in a lot of money… We just consider the distance we are covering and then put in fuel and move , which is easier than getting a boda-boda (motorcycle taxi).” (Respondent 16—VHT).

Structural characteristics: rugged terrain and poor road network, paper-based reporting systems, and hub and spoke laboratory system

All six VHTs reported that some patients came from hard-to-reach areas, characterized by rugged terrain, where vehicles or motorcycles cannot reach. This makes it hard for health workers to visit such communities for contact investigations. Additionally, some places have poor roads that are impassable during the rainy season, thus affecting service delivery. In such circumstances, health workers use boda-bodas (motorcycle taxis) to a certain point, and then walk the remaining distance. Sometimes the terrain is hilly and exhausting, which discourages teams from doing community visits. Large health facility catchment areas also made it more difficult for field teams to deliver contact investigation services to distant households. As a result, contacts of index TB cases from remote places were instead asked to come to the health facility for further evaluation, however, some of them did not come.

“…for those people who come from hard-to-reach areas , going to those homes is quite challenging. Sometimes we reach a point of walking on foot because we cannot reach there using a car or a motorcycle. So , we must climb a steep hill to look for those patients” (Respondent 4—VHT). “This is a big sub-county; people come from distant areas , even neighboring districts. And of course , as a health worker , you cannot reach every homestead. So , some (contacts) are called to come to the health facility. But because of the long distances , some fail to come.” (Respondent 4—VHT).

Another barrier was the use of the paper-based reporting system. One of the TB focal persons reported that TB contact investigation reports were submitted manually using a paper-based system which affects timeliness of reporting. Submission of reports had to wait for an opportunity when someone was going to the district headquarters, which causes a delay and eventually affects re-imbursement of the payments for activities.

“Sometimes , since we are sending the reports to Mbarara , they reach late because of transport issues. It becomes hard for someone to send the report since you cannot get any transport , so you get someone going to Mbarara , give them the reports , and tell that person where they should be delivering the reports. So , it also takes a bit of time” (Respondent 8—Nurse).

The laboratory system in the district used a “hub and spoke” system, where laboratory samples are collected in peripheral laboratories and transported by motorcycle riders to the central laboratory for analysis. However, participants reported that this system was dysfunctional because of the long results turn-around time, compromised early TB diagnosis and treatment and affected TB contact investigation coverage. In some cases, health workers spent up to two months, waiting for Xpert MTB results.

“And we have a challenge with hub riders… Sometimes , the hub riders take sputum samples to Mbarara , and if they do not go back to pick the results , you will never see them. And you end up spending around two months without results” (Respondent 12—Clinical Officer).

Domain 4: individuals involved

Under characteristics of the individuals involved, participants reported the presence of internal implementation leads called TB focal persons at health facility and DTLS at district level. These were responsible for coordinating the provision of TB services and technical leadership and supervision of the TB program and different levels of care. Additionally, health workers received adequate training on various aspects of TB management including TB contact investigation. Such training sessions supported them with the adequate knowledge and skills to confidently conduct contact investigation activities.

Domain 5: implementation process

The three constructs that emerged under implementation process were planning, engaging and reflection and evaluation.

The DTLSs reported that leaders at the Ministry of Health had transferred the planning, coordination, and funding of TB interventions, including TB contact investigation. Instead, this role was left to implementing partners, usually local and international Non-Governmental Organizations (NGOs), which negatively impacted the TB program at district level. Also, participants reported that implementing partners tend to have different priorities. For example, these organizations mainly focus on HIV interventions, and less on TB. Therefore, it is challenging to divert them from their preferences and focus them on district priorities, since their priorities are often guided by donor funding.

“Also , The Ministry of Health has deliberately left this work (TB contact investigation) …to implementing partners , and it has killed everything. And in that line , I think we can eradicate TB , but if the government is putting in (effort) , not leaving this disease for the implementing partners.” (Respondent 1—Nurse). “They tell you their priority is HIV , and you cannot shift them. They have their …operational guidelines that you cannot change.” (Respondent 1 , Nurse).

Reflection and evaluation

data use to inform program decisions by the district health team was identified as a facilitator. The district held quarterly performance and reflection meetings with the participation of the district’s NGOs, community-based organizations, district health management team, and healthcare providers from the various health centers. In these meetings, attendees discussed their performance, challenges across the different technical areas, and strategies for bridging the gaps.

the involvement of all stakeholders within the district, including health facility teams, district teams, NGOs, and community-based organizations involved in the TB program, in regular engagements to review implementation progress, performance, and plan improvement strategies was reported as a facilitator. Non-Governmental Organizations are actively involved in discussions regarding potential funding opportunities for specific activities.

“…we normally have the district stakeholders meeting , where they (external stakeholders) normally come here , and we discuss performance in different areas - MCH (maternal and child health) and HIV; TB is also given a platform. We tell them about our challenges.” (Respondent 1—Nurse) .

The stigma associated with TB was reported as a common challenge by all participants in this study. For this reason, index TB patients preferred not to be visited at home by a health worker, out of fear of being stigmatized if neighbors and other community members found out that they had TB. Some index TB patients even tried to avoid being visited by giving health workers incorrect phone numbers and physical addresses. Patients with TB and HIV co-infection have an increased fear of disclosing their status because of the misconception that every TB patient has HIV. Additionally, poverty among index TB patients was also found to be a challenge because contacts of TB patients lacked funds to transport them to the health facility for assessment, diagnosis, and treatment. As a result, it was necessary for health professionals to collect sputum samples from the community and bring them to the health facility for analysis. This, however, was not always feasible, leaving some of the contacts of TB patients unevaluated.

“…some patients give us wrong telephone contacts , we call the number , it is not on , or a different person picks it. So , we fail to trace that person. Some fear health workers going to their homes. Mostly when the index TB patient is also HIV positive , they do not want people in their villages to see any health care worker coming to their home because they may identify them” (Respondent 11—VHT).

This study explored the factors influencing TB contact investigation coverage in three rural, primary health facilities in Southwestern Uganda. The study is unique in its rural focus unlike previous studies in Uganda and Kenya, which were conducted in cities [ 7 , 15 , 31 ]. The barriers and facilitators identified in this study were diverse and covered all the five domains of the CFIR. Although some studies have used other implementation research tools to identify the barriers and facilitators to implementing TB contact investigation, this study used the CFIR to explore the factors influencing TB contact investigation coverage in Africa.

The key challenges that emerged from this study included health system challenges, such as the lack of funding for TB contact investigation, insufficient PPE and inadequate Xpert MTB equipment for diagnostic testing. The rugged terrain and poor road networks in rural communities also made it difficult for health workers to access patients in the community, and vice versa. Poverty, TB- and COVID19-related stigma were also perceived as barriers. On the other hand, the facilitators to TB contact investigation included an increased awareness of TB contact investigation, adequate knowledge of the Ugandan MoH guidelines, confidence in delivering the intervention and on-the-job training of health workers. In addition, the availability of a telephone and transport to schedule and make household visits were reported as facilitators. The support of key district stakeholders involved in TB contact investigations and quarterly performance review meetings also emerged as facilitators.

The health system barriers that emerged from this research were inadequate or irregular funding, human resource shortages, lack of PPE supplies (face masks, gloves, raincoats, and gumboots), out of stock of Xpert MTB cartridges and lack of airtime for communication. In addition, inadequate or inconsistent funding limited the frequency of the DTLS visits to health facilities for supervision and caused a delay in payment of travel and allowances to field teams, causing TB contact investigation operations to be hampered. This finding is in contrast with another study conducted in urban Kenya, which found that the TB program received sustainable funding for infrastructure and health workforce for contact investigation [ 32 ]. Furthermore, this Kenyan study used the WHO health systems framework. It focused on the stakeholder perspectives of the barriers and facilitators to optimizing TB contact investigation in Nairobi, the capital of Kenya. This funding disparity between rural and urban areas could be due to a higher TB prevalence in most urban settings thus attracting the attention of policy makers to allocate more resources there [ 33 ].

Consistent with this study, three studies conducted in Botswana, Ethiopia and Uganda reported human resource shortages as a considerable hindrance to TB contact investigation coverage [ 3 , 15 , 16 ]. In urban Uganda, health workers had other competing duties in the TB clinics, thus, they did not have sufficient time for community-level activities, including household contact tracing [ 15 ]. In this study, sometimes only one health worker was available for community visits, and they could not complete multiple tasks, such as health education, screening, sample collection, HIV testing and documentation in the registers. The staff shortage is partly attributed to a small number of staff trained in TB, and assigning them responsibilities in other units outside the TB unit [ 3 ].

Another challenge identified in this study was a lack of PPE materials such as masks, gloves, raincoats and gumboots for health workers to protect themselves against TB and other infectious diseases (such as COVID-19). Health staff were hesitant to conduct household contact investigations without wearing masks and gloves, to avoid contracting TB and COVID-19. Similarly, protective gear, such as raincoats and gumboots, to be used in harsh weather conditions, were not provided to health workers. There is limited literature on the influence of PPE materials on TB contact investigation coverage and this calls for more research in this area. These findings indicate that the supply chain management system for essential infection control materials is weak. These findings emphasize the need to strengthen mechanisms to guarantee sufficient PPE supplies and sustain the supply chain for these products.

The context within which an intervention is implemented, is recognized as a significant determinant of implementation success [ 18 ]. Contextual factors refer to issues about a person or their environment that can positively or negatively affect the delivery of an intervention [ 18 ]. Socio-economic, policy-related, and geographical barriers emerged as contextual barriers in this research. The socio-economic factors included poverty, lack of phones where patients can be contacted to confirm the appointment of household visits, stigma, and fear of reporting cough in fear of being labelled as having COVID-19.

In Botswana, Kenya, Ethiopia, and Uganda, the stigma associated with Tuberculosis has been reported as a barrier to TB contact investigation. [ 3 , 7 , 15 , 16 ]. Although these studies did not specifically focus on TB contact investigation coverage, stigma hindered household visits, because index TB patients avoided home visits by health workers, out of fear of their status being disclosed to the community and discrimination from them, which could eventually affect demand and coverage of the intervention. An important observation in our study was that stigma was aggravated by the misconception that every TB patient has HIV, and the emergence of the COVID-19 pandemic. Tuberculosis and COVID-19 have common respiratory symptoms (cough, fever, and breathing difficulties), making it difficult to distinguish the two. This causes diagnostic confusion, and the health workers may also avoid such patients, in fear of contracting COVID-19 [ 34 ]. Furthermore, because of the new COVID-19 stigma, patients with a chronic cough might fear coming to the health facilities for diagnosis, thus complicating the two pandemics [ 34 ].

The COVID-19 lockdown policy implemented in 2020 by the Government of Uganda posed significant challenges to TB contact investigation efforts. Both health staff and patients could not access health facilities, due to stringent lockdown measures, including travel restrictions and public and private transportation prohibitions. Additionally, health providers could not conduct home visits to screen the contacts. Similar findings were found in another study on the impact of COVID-19 on TB programs in Western Pacific nations [ 35 ]. Other COVID-19 related problems encountered in the Western Pacific study included a change in priorities towards the COVID-19 response, as demonstrated by the relocation of TB program staff to the COVID-19 response, and a reduced willingness of patients and contacts to visit health facilities [ 35 ]. Therefore, innovative strategies are required to streamline TB contact investigation in the context of the COVID-19 pandemic.

As reported by Cattamanchi et al., geographical challenges contribute to the failure of TB patients and contacts to present at health facilities for TB care [ 36 ]. In their study, health workers reported that the physical remoteness of patients’ homes from the health facility and the rugged terrain encountered during travel, was a challenge [ 36 ]. Likewise, in this study, health workers reported that some index TB patients and contacts came from distant and challenging areas, with steep hills and poor road networks, preventing access to health facilities. This challenge was aggravated by poverty, because patients and contacts from the periphery of the county could not travel to health facilities because of high transport costs.

Facilitators

All health workers interviewed in this study reported awareness of the intervention. They had even engaged in relevant programs to improve its uptake, including enlisting household contacts, home visits, screening, and sputum sample collection. In addition, the clarification of the various steps demonstrated health workers’ adherence to the organizational protocols for TB contact investigations. The increased awareness and fidelity to the guidelines may be attributed to the development and dissemination of local contact investigation guidelines through training and the use of electronic media, such as WhatsApp. Conversely, a similar study conducted in rural Ethiopia found that awareness and adherence to the guidelines were poor because of a lack of refresher training. [ 3 ].

The health system facilitators that emerged from this study include good provider knowledge and access to information, performance review meetings at the district level, and engagement of district stakeholders to obtain their support. In contrast to other studies in Uganda, Ethiopia, and the USA, provider knowledge and confidence (self-efficacy) worked as a facilitator in this study because staff involved in TB contact investigation had received on-the-job training on various aspects of TB management, including contact investigation, diagnosis, and management [ 3 , 15 , 37 ]. In this study, health workers reported that they had the knowledge, skills, and confidence to conduct TB contact investigations successfully. These results are partly attributed to the quarterly district performance review meetings, in which an orientation on TB contact investigation was done and guidelines were shared with health workers.

Reflection and evaluation in TB contact investigation performance were demonstrated by Karamagi et al., in a Quality Improvement study to improve case finding in Northern Uganda [ 38 ]. A review meeting was held to discuss progress on active case finding and develop scale-up plans for the intervention [ 38 ]. Similarly, this study found that quarterly district review meetings were held, to discuss district and health facility performance, challenges, and improvement strategies in various program components, including TB contact investigation. These reflection meetings involved district-based stakeholders such as NGOs, health workers, TB focal persons, and health facility managers, and this promoted ownership of the interventions, and helped in resource mobilization. These meetings were also used to review quarterly TB performance, and develop action plans to improve multiple TB indicators, including TB contact investigation.

Strengths and limitations of the study

This study had the following strengths. First, we included various health provider categories at different levels of the district healthcare system, including community, health facility and district levels, to obtain different perspectives from the participants. Second, this study used implementation science methods such as the CFIR to investigate the rural perceptions of the challenges and enablers of TB contact investigation coverage. The CFIR provided a framework for developing the semi-structured interview guides and interpretation of study findings and this promotes transferability of these results to other settings.

Some weaknesses were also observed. First, index TB patients and their contacts were not interviewed; therefore, some information on the challenges and enablers of contact investigation coverage from the patients’ and caregivers’ perspective may have been missed. Second, data collection was conducted during the COVID-19 lockdown, and some health workers were inaccessible, especially laboratory personnel involved in pandemic control activities at the time. Consequently, the laboratory may have challenges that were not identified in this study. Third, the COVID-19 pandemic may have aggravated some challenges, which were not so pronounced before the pandemic. Finally, the generalizability of our results to other geographical locations may be limited, because this study was conducted in one district in Uganda, which gives it a smaller scope. However, we included three health facilities in different counties, which may improve transferability to other settings.

This study explored health providers perceptions of the barriers and facilitators of TB contact investigation in rural Mbarara district, Southwestern Uganda. This study found that most of the challenges limiting TB contact investigations in rural communities are related to health system; for-example inadequate or delayed funding and human resource shortages. The Ministry of Health in Uganda therefore must strengthen the health system building blocks, particularly health financing and human resources to establish a robust TB control program that will enable the efficient identification of missing TB patients. It also demonstrated the unique challenges affecting the rural settings regarding tuberculosis contact investigation including lack of personal protective equipment, stock-out of Xpert MTB cartridges, shortage of airtime for communication, TB-related stigma, and inconsistent funding for TB contact investigation. Further research is needed to determine the effectiveness of potential implementation strategies for eliminating these barriers in rural communities. Also, having identified the disruptive nature of the COVID-19 pandemic to the achievement of optimal TB contact investigation coverage, there is a need to develop measures for integrating both COVID-19 and TB contact investigation interventions.

Data availability

The dataset used in the current study are available from the corresponding author on reasonable request.

World Health Organization. Global Tuberculosis Report 2023. 2023. https://www.who.int/teams/global-tuberculosis-programme/tb-reports/global-tuberculosis-report-2023

World Health Organization. Global Tuberculosis Report. 2019. https://www.who.int/tb/global-report-2019

Tesfaye L, Lemu YK, Tareke KG, Chaka M, Feyissa GT. Exploration of barriers and facilitators to household contact tracing of index tuberculosis cases in Anlemo District, Hadiya Zone, Southern Ethiopia: qualitative study. PLoS ONE. 2020;15(5):e0233358.

Article   PubMed   PubMed Central   CAS   Google Scholar  

Zumla A, Petersen E, Nyirenda T, Chakaya J. Tackling the tuberculosis epidemic in sub-saharan Africa–unique opportunities arising from the second European developing countries clinical trials Partnership (EDCTP) programme 2015–2024. Int J Infect Dis. 2015;32:46–9.

Article   PubMed   Google Scholar  

United Nations Program on HIV and, Tuberculosis AIDS. and HIV. 2020. https://www.unaids.org/sites/default/files/media_asset/tb-and-hiv_en.pdf

World Health Organization. Recommendations for investigating contacts of persons with infectious tuberculosis in low- and middle-income countries. 2012. https://www.who.int/tb/publications/2012/contact_investigation2012/en/

Marangu DM. Optimizing tuberculosis contact investigation and linkage to care in Nairobi. Kenya: TB KWISHA. University of Nairobi; 2018.

Google Scholar  

Uganda Ministry of Health. Tuberculosis contact investigation in Uganda, operational guide 2019. In. Kampala, Uganda: Ministry of Health; 2019.

Uganda Ministry of Health. Manual for management and control of tuberculosis and leprosy in Uganda. 2017. https://health.go.ug/sites/default/files/NTLP%20Manual%203rd%20edition_17th%20Aug_final.pdf

Blok L, Sahu S, Creswell J, Alba S, Stevens R, Bakker MI. Comparative meta-analysis of tuberculosis contact investigation interventions in eleven high burden countries. PLoS ONE. 2015;10(3):e0119822.

Article   PubMed   PubMed Central   Google Scholar  

Armstrong-Hough M, Turimumahoro P, Meyer AJ, Ochom E, Babirye D, Ayakaka I, Mark D, Ggita J, Cattamanchi A, Dowdy D, et al. Drop-out from the tuberculosis contact investigation cascade in a routine public health setting in urban Uganda: a prospective, multi-center study. PLoS ONE. 2017;12(11):e0187145.

Centres for Disease Control and Prevention. Finding the missing cases: The role of enhanced diagnostics and case-finding in reaching all people with TB (fact sheet). 2020. https://www.cdc.gov/globalhivtb/who-we-are/resources/keyareafactsheets/finding-the-missing-4-million.pdf

United Nations. Transforming Our World: The 2030 Agenda for Sustainable Development. 2015. https://www.unfpa.org/resources/transforming-our-world-2030-agenda-sustainable-development

Theobald S, Brandes N, Gyampong M, El-Saharty M, Proctor E, Diaz T, Wanji S, Elloker S, Raven J, Elsey H, et al. Implementation research: new imperatives and opportunities in global health. Lancet. 2018;392(10160):2214–28.

Ayakaka I, Ackerman S, Ggita JM, Kajubi P, Dowdy D, Haberer JE, Fair E, Hopewell P, Handley MA, Cattamanchi A, et al. Identifying barriers to and facilitators of tuberculosis contact investigation in Kampala, Uganda: a behavioral approach. Implement Sci. 2017;12(1):33.

Tlale L, Frasso R, Kgosiesele O, Selemogo M, Mothei Q, Habte D, Steenhoff A. Factors influencing health care workers’ implementation of tuberculosis contact tracing in Kweneng, Botswana. Pan Afr Med J. 2016;24:229.

Uganda Ministry of Health. The Uganda National Tuberculosis Prevalence Survey, 2014–2015 Survey Report. 2016. https://health.go.ug/sites/default/files/Uganda%20National%20TB%20Prevalence%20Survey%202014-2015_final%2023rd%20Aug17.pdf

Damschroder LJ, Aron DC, Keith RE, Kirsh SR, Alexander JA, Lowery JC. Fostering implementation of health services research findings into practice: a consolidated framework for advancing implementation science. Implement Sci. 2009;4:50.

Adamu AA, Uthman AO, Gadanya MA, Wiysonge CS. Using the consolidated framework for implementation research (CFIR) to assess the implementation context of a quality improvement program to reduce missed opportunities for vaccination in Kano, Nigeria: a mixed methods study. Hum Vaccin Immunother. 2019;16(2):465–75.

Gyamfi J, Allegrante JP, Iwelunmor J, Williams O, Plange-Rhule J, Blackstone S, Ntim M, Apusiga K, Peprah E, Ogedegbe G. Application of the Consolidated Framework for Implementation Research to examine nurses’ perception of the task shifting strategy for hypertension control trial in Ghana. BMC Health Serv Res. 2020;20(1):65.

Warner G, Lawson B, Sampalli T, Burge F, Gibson R, Wood S. Applying the consolidated framework for implementation research to identify barriers affecting implementation of an online frailty tool into primary health care: a qualitative study. BMC Health Serv Res. 2018;18(1):395.

Muddu M, Tusubira AK, Nakirya B, Nalwoga R, Semitala FC, Akiteng AR, Schwartz JI, Ssinabulya I. Exploring barriers and facilitators to integrated hypertension-HIV management in Ugandan HIV clinics using the Consolidated Framework for Implementation Research (CFIR). Implement Sci Commun. 2020;1:45.

Uganda Bureau of Statistics. Mbarara District Local Government statistical abstract 2016/17. 2017. https://www.mbarara.go.ug/sites/default/files/downloads/Statistical%20Abstract%202017%20Final.pdf

Uganda Ministry of Health. National Health Facility Master List; a complete list of all health facilities in Uganda. 2018.

Turyamureba M, Bruno LY, Oryema JB. Health Care Delivery System in Uganda: a review. Tanzan J Health Res 2023, 24(2).

Tong A, Sainsbury P, Craig J. Consolidated criteria for reporting qualitative research (COREQ): a 32-item checklist for interviews and focus groups. Int J Qual Health Care. 2007;19(6):349–57.

Saunders B, Sim J, Kingstone T, Baker S, Waterfield J, Bartlam B, Burroughs H, Jinks C. Saturation in qualitative research: exploring its conceptualization and operationalization. Qual Quant. 2018;52(4):1893–907.

Rowlands J. Interviewee Transcript Review as a Tool to Improve Data Quality and participant confidence in Sensitive Research. Int J Qual Methods. 2021;20:16094069211066170.

Article   Google Scholar  

Braun V, Clarke V. Using thematic analysis in psychology. Qual Res Psychol. 2006;3(2):77–101.

O’Connor C, Joffe H. Intercoder Reliability in Qualitative Research: debates and practical guidelines. Int J Qual Methods. 2020;19:1–13.

Davis JL, Turimumahoro P, Meyer AJ, Ayakaka I, Ochom E, Ggita J, Mark D, Babirye D, Okello DA, Mugabe F et al. Home-based Tuberculosis contact investigation in Uganda: a household randomised trial. ERJ Open Res 2019, 5(3).

Marangu D, Mwaniki H, Nduku S, Maleche-Obimbo E, Jaoko W, Babigumira J, John-Stewart G, Rao D. Stakeholder perspectives for optimization of tuberculosis contact investigation in a high-burden setting. PLoS ONE. 2017;12(9):e0183749.

Mutembo S, Mutanga JN, Musokotwane K, Kanene C, Dobbin K, Yao X, Li C, Marconi VC, Whalen CC. Urban-rural disparities in treatment outcomes among recurrent TB cases in Southern Province, Zambia. BMC Infect Dis. 2019;19(1087):1–8.

Togun T, Kampmann B, Stoker NG, Lipman M. Anticipating the impact of the COVID-19 pandemic on TB patients and TB control programmes. Ann Clin Microbiol Antimicrob. 2020;19(1):21.

Chiang CY, Islam T, Xu C. The impact of COVID-19 and the restoration of tuberculosis services in the Western Pacific Region. Eur Respir J. 2020;56:2003054.

Cattamanchi A, Miller CR, Tapley A, Haguma P, Ochom E, Ackerman S, Davis JL, Katamba A, Handley MA. Health worker perspectives on barriers to delivery of routine tuberculosis diagnostic evaluation services in Uganda: a qualitative study to guide clinic-based interventions. BMC Health Serv Res. 2015;15:10.

Wilce M, Shrestha-Kuwahara R, Taylor Z, Qualls N, Marks S. Tuberculosis Contact Investigation policies, practices, and challenges in 11 U.S. communites. J Public Health Manag Pract. 2017;8(6):69–78.

Karamagi E, Sensalire S, Muhire M, Kisamba H, Byabagambi J, Rahimzai M, Mugabe F, George U, Calnan J, Seyoum D, et al. Improving TB case notification in northern Uganda: evidence of a quality improvement-guided active case finding intervention. BMC Health Serv Res. 2018;18(1):954.

Download references

Acknowledgements

I acknowledge the contribution of Grace Ayebazibwe (GA), who supported me during the data collection and analysis by taking field notes, transcription, and translation of audio recordings.

This research work was supported by TDR, the Special Program for Research and Training in Tropical Diseases, which is hosted at the World Health Organization, and co-sponsored by UNICEF, UNDP, the World Bank and WHO. TDR grant number: B40299, first author ORCID ID: 0000-0001-9722-1202. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funder.

Author information

Authors and affiliations.

School of Public Health, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa

Paddy Mutungi Tukamuhebwa, Pascalia Munyewende & Ntombizodwa Ndlovu

Department of Epidemiology and Biostatistics, School of Public Health, Makerere University, Kampala, Uganda

Nazarius Mbona Tumwesigye

Department of Health Policy, Planning and Management, School of Public Health, Makerere University, Kampala, Uganda

Juliet Nabirye

You can also search for this author in PubMed   Google Scholar

Contributions

PT, NN and PM participated in the conceptualization and design of the study, developing interview guides, writing the initial version of the manuscript, and reviewing subsequent versions, with substantial input from NMT. With assistance from NN and PM, PT and JN conducted the data analysis. Each author contributed to the writing of the manuscript, and they all reviewed and gave their approval for publishing of the final draft.

Corresponding author

Correspondence to Paddy Mutungi Tukamuhebwa .

Ethics declarations

Ethics approval and consent to participate.

This study was cleared by the Human Research Ethics Committees at the University of the Witwatersrand Medical (M200101), and the Research Ethics Committee at Mbarara University of Science and Technology (MUREC 1/7). Permission to conduct the study was obtained from the Uganda National Council of Science and Technology (HS569ES). All participants provided written informed consent.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Supplementary material 2, supplementary material 3, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Tukamuhebwa, P.M., Munyewende, P., Tumwesigye, N.M. et al. Health worker perspectives on barriers and facilitators of tuberculosis investigation coverage among index case contacts in rural Southwestern Uganda: a qualitative study. BMC Infect Dis 24 , 867 (2024). https://doi.org/10.1186/s12879-024-09798-9

Download citation

Received : 04 May 2024

Accepted : 22 August 2024

Published : 26 August 2024

DOI : https://doi.org/10.1186/s12879-024-09798-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • TB contact investigation
  • Consolidated Framework for Implementation Research

BMC Infectious Diseases

ISSN: 1471-2334

exploratory data analysis case study

IMAGES

  1. exploratory case study methodology

    exploratory data analysis case study

  2. (PDF) Exploratory Data Analysis as an Efficient Tool for Statistical

    exploratory data analysis case study

  3. exploratory case study methodology

    exploratory data analysis case study

  4. Exploratory Data Analysis: Clustering Case Study

    exploratory data analysis case study

  5. Case Study: Exploratory Data Analysis & Natural Language Processing

    exploratory data analysis case study

  6. (PDF) A Case Study of Planning for Exploratory Data Analysis

    exploratory data analysis case study

VIDEO

  1. EXPLORATORY DATA ANALYSIS

  2. [R18] Case study 2 data analysis using R Language

  3. Data Analysis Case Study- Ashlyn Thomas

  4. Exploratory Data Analysis: Real-life Churn Analysis Case Study

  5. A Beginner's Guide To Data Analysis With Power Bi

  6. Exploratory Data Analysis Free Online Course 14

COMMENTS

  1. A Data Scientist's Essential Guide to Exploratory Data Analysis

    Introduction. Exploratory Data Analysis (EDA) is the single most important task to conduct at the beginning of every data science project. In essence, it involves thoroughly examining and characterizing your data in order to find its underlying characteristics, possible anomalies, and hidden patterns and relationships.

  2. Exploratory Data Analysis: A case study

    Exploratory Data Analysis: A case study. I presented a case study on Financial Data analysis: Credit card data at the first forLoop 2019 meetup and the presentation content is shared in this ...

  3. Beginner's Guide To Exploratory Data Analysis

    Includes a simple case study for better understanding. ... Exploratory Data Analysis (EDA) refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypotheses and to check assumptions with the help of summary statistics and graphical representations. 2. TYPES OF DATA:

  4. PDF Chapter 4 Exploratory Data Analysis

    Chapter 4Exploratory D. ta AnalysisA rst look at the data.As mentioned in Chapter 1, exploratory data analysis or \EDA" is a critical rst step in an. lyzing the data from an experiment. H. mong the explanatory variables, andassessing the direction and rough size of relationships betwee.

  5. Step-by-Step Exploratory Data Analysis (EDA) using Python

    Exploratory data analysis (EDA) is a critical initial step in the data science workflow. It involves using Python libraries to inspect, summarize, and visualize data to uncover trends, patterns, and relationships. Here's a breakdown of the key steps in performing EDA with Python: 1. Importing Libraries:

  6. Exploratory Data Analysis (EDA)

    This case study is focused to give you an idea of applying Exploratory Data Analysis (EDA) in a real business scenario. In this case study, apart from applying the various Exploratory Data Analysis (EDA) techniques, you will also develop a basic understanding of risk analytics and understand how data can be utilized in order to minimise the ...

  7. 9. Case Studies

    In each case study, the Exploratory Data Analysis (EDA) process plays a crucial role in uncovering insights, trends, and relationships within the data. By using various data cleaning, exploration, and visualization techniques, analysts can gain valuable insights to make data-driven decisions and optimize processes in different domains. The ...

  8. Exploratory Data Analysis in Python

    Exploratory data analysis (EDA) is an especially important activity in the routine of a data analyst or scientist. ... Study of the relationships between variables; ... In the case of our dataset, the context is always the chemical-physical one, so it's easy. In another context, for example that of real estate, a variable could belong to a ...

  9. Exploratory Data Analysis

    Welcome to Week 2 of Exploratory Data Analysis. This week covers some of the more advanced graphing systems available in R: the Lattice system and the ggplot2 system. ... This week, we'll look at two case studies in exploratory data analysis. The first involves the use of cluster analysis techniques, and the second is a more involved analysis ...

  10. EDA: Exploratory Data Analysis with example in Jupyter notebook

    Uncover underlying patterns and structures in you data. Identify outliers, missing data, class balance, and other data-related issues. Relate the available data to the business opportunity. Let's work with a case study that comes from the online retail data set and are available through the UCI Machine Learning Repository. This is a ...

  11. Exploratory Data Analysis Case Study

    The objective of this project is to apply the data analysis & visualization skills & techniques learned to a real-world dataset. Explore a massive dataset of 7 million+ companies from around the world with this case study on Exploratory Data Analysis. Learn data preprocessing, cleaning, and visualization techniques to gain insights into ...

  12. Goals, Process, and Challenges of Exploratory Data Analysis: An

    primarily involve data analysis. A few studies [21,44,57] examine data analysis and sensemaking within intelligence agencies, which share many challenges with our findings due to exploratory and collaborative nature of their work. However, these agencies often analyze text documents whereas our participants mostly explore structured data.

  13. Exploratory Data Analysis: a Case Study Example on Classification Task

    As shown on the kaggle page, the data contain train, test and sample submission data. This is a good data for you to start your exploration. The data that i use have been modified to give a better learning experience for you. As you may know, to start EDA, after loading the data, you can get the snapshot of the data through method .head(). df ...

  14. Exploratory Research

    Exploratory research can help you narrow down your topic and formulate a clear hypothesis and problem statement, as well as giving you the "lay of the land" on your topic. Data collection using exploratory research is often divided into primary and secondary research methods, with data analysis following the same model. Primary research

  15. What is Exploratory Data Analysis?

    Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. EDA helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test ...

  16. Exploratory Data Analysis (EDA)

    Exploratory Data Analysis (EDA) - Retail Case Study Example (Part 3) For the last couple of weeks we have been working on a marketing analytics case study example (read Part 1 and Part 2 ). In the last part ( Part 2) we defined a couple of advanced analytics objectives based on the business problem at an online retail company called DresSmart ...

  17. Exploratory Data Analysis

    Welcome to Week 2 of Exploratory Data Analysis. This week covers some of the more advanced graphing systems available in R: the Lattice system and the ggplot2 system. ... This week, we'll look at two case studies in exploratory data analysis. The first involves the use of cluster analysis techniques, and the second is a more involved analysis ...

  18. RPubs

    Case Study: Exploratory Data Analysis in R; by Daniel Pinedo; Last updated over 3 years ago; Hide Comments (-) Share Hide Toolbars

  19. Grounded Theory: A Guide for Exploratory Studies in Management Research

    Table 3 summarizes and compares the type of data, data collection, and analysis methods suggested by different authors for nascent theory and exploratory research studies. As presented in this table, the proper type of data is qualitative, and the most suitable data collection methods are exploratory, in-depth, or semi-structured interviews ...

  20. 1.4. EDA Case Studies

    Exploratory Data Analysis. 1.4. EDA Case Studies: Summary This section presents a series of case studies that demonstrate the application of EDA methods to specific problems. In some cases, we have focused on just one EDA technique that uncovers virtually all there is to know about the data. For other case studies, we need several EDA ...

  21. Using Exploratory Data Analysis to Improve the Fresh Foods ...

    This case study demonstrates how explorative data analysis and basic statistics helped reduce the inefficiencies in the retail inventory and ordering process of fresh foods within grocery chains. Low shelf life and fluctuating demand had led to the need to hold clearance sales with zero or negative margin or to write off the inventory as ...

  22. Exploratory Data Analysis

    💻 For real-time updates on events, connections & resources, join our community on WhatsApp: https://jvn.io/wTBMmV0Getting started with your Exploratory Data...

  23. 16 Data Analysis Case Study: Changes in Fine Particle Air Pollution in

    This book covers the essential exploratory techniques for summarizing data with R. These techniques are typically applied before formal modeling commences and can help inform the development of more complex statistical models. ... 16 Data Analysis Case Study: Changes in Fine Particle Air Pollution in the U.S. This chapter presents an example ...

  24. Dispensed prescription medications and short-term risk of ...

    Case-crossover analysis of dispensed prescription medication use and risk of pulmonary embolism. The above plot illustrates (A) unique drug types which were selected in Norway, (B) unique drug ...

  25. Health worker perspectives on barriers and facilitators of tuberculosis

    Analysis was conducted inductively using reflexive thematic analysis in six iterative steps: familiarizing with the data, creating initial codes, searching for themes, reviewing themes, developing theme definitions, and writing the report. Nineteen health care workers participated in this study which translates to a 100% response rate.