Big data applications: overview, challenges and future

  • Open access
  • Published: 16 September 2024
  • Volume 57 , article number  290 , ( 2024 )

Cite this article

You have full access to this open access article

latest research big data

  • Afzal Badshah 1 ,
  • Ali Daud 2 ,
  • Riad Alharbey 3 ,
  • Ameen Banjar 3 ,
  • Amal Bukhari 3 &
  • Bader Alshemaimri 4  

Big Data (i.e., social big data, vehicular big data, healthcare big data etc) points to massive and complex data, that require special technologies and approaches for storage, processing, and analysis. Similarly, big data applications are software and systems utilizing large and complex datasets to extract insights, support decision-making, and address diverse business and societal challenges. Recently, the significance of big data applications has grown immensely for organizations across diverse sectors as they increasingly rely on insights derived from data. The increasing reliance on data insights has rendered traditional technologies and platforms inefficient due to scalability limitations and performance issues. This study contributes by identifying key domains impacted by big data, examining its effect on decision-making, addressing inherent complexities and opportunities, exploring core technologies, and offering solutions for potential concerns. Additionally, it conducts a comparative analysis to demonstrate the superiority of this research. These contributions provide valuable insights into the evolving landscape shaped by big data applications.

Explore related subjects

  • Artificial Intelligence
  • Medical Ethics

Avoid common mistakes on your manuscript.

1 Introduction

In the present digital era, big data has emerged as a transformative force, transforming how organizations collect, store, and analyze extensive datasets. The significant increase in data produced from various origins such as social media, autonomous vehicles, and sensors has made Big Data Analytics (BDA) essential for businesses and industries globally. Big data, as defined by the 3 Vs, encompasses large volumes of diverse and rapidly arriving data with potential uncertainties about its quality and availability (Laney 2001 ). The three Vs comprise volume (referring to large datasets), variety (involving diverse data formats), and velocity (indicating the rapid generation of data) (Badshah et al. 2024 ).

figure 1

Big data ecosystem

The Big Data Ecosystem comprises six key tools essential for efficient large-scale data management (shown in Fig.  1 ). Data Technologies , including Apache Hadoop and Apache Spark, analyze and process Big Data beyond traditional capabilities. Analytics and Visualization tools , such as Tableau and SAS, uncover patterns, while Business Intelligence tools like Cognos transform raw data for business analysis. Cloud Service Providers , like AWS and GCP, offer fundamental infrastructure. NoSQL Databases , including MongoDB and Cassandra, handle Big Data processing, and Programming Tools like R and Python perform analytical tasks and operationalize Big Data, completing this vital ecosystem (Coursera 2023 ).

The applications of big data are diverse and far-reaching, spanning healthcare, supply chain and logistics, marketing and advertising, smart cities, media and entertainment, cybersecurity, climate & earth science, industry, and education. The primary objective of big data lies in its analysis for diverse purposes. Harnessing the capabilities of BDA enables organizations to discover important insights, recognize patterns, and make informed, data-driven decisions. These decisions, in turn, enhance operational efficiency, drive innovation, and improve customer experiences. From personalized healthcare treatments to predictive maintenance in manufacturing, big data is transforming industries and shaping the future of how we live and work (Himeur et al. 2023 ; Talaoui et al. 2023 ).

In the current technological research landscape, big data plays a pivotal role, focusing on the analysis, processing, and extraction of valuable information from extensive and intricate datasets. The foundation of BDA is intricately linked with advanced technologies, specifically Artificial Intelligence (AI), Machine Learning (ML), Deep Learning (DL), and the Internet of Things (IoT). Figure  2 illustrates the processing stages of big data. For a better understanding of this article, Table 1 outlines the terminologies used in the study.

figure 2

Structure of the big data cycle

figure 3

Devices, data and revenue forecast from 2017 to 2025

The big data market is expected to have remarkable growth globally, with revenue projections ranging to USD 473.6 Billion by 2030, reflecting a growth rate of 12.7% from 2022 to 2030 (Research and Consulting 2023 ). This substantial growth underscores the increasing recognition of big data’s critical role across industries and sectors. Simultaneously, current estimates indicate a massive increase in data generation, with the world expected to produce 175 zettabytes of data by 2025 (Statista 2023 ), as shown in Fig.  3 . This exponential increase highlights the expanding scope and importance of big data as a critical tool for managing, analyzing, and deriving insights from this colossal volume of information.

The massive utilization of big data is propelled by the exponential surge in data volume, the extensive utilization of cloud computing, global digital transformation, increasing internet and smartphone usage, and accelerated adoption due to the impact of the COVID-19 pandemic. Leading companies, such as Google, Amazon, and other tech giants play a crucial role in the big data ecosystem, significantly contributing to the development and advancement of big data technologies, influencing trends, and shaping the future trajectory of this dynamic field.

Numerous comprehensive literature survey papers have extensively explored big data applications. Focusing on healthcare, (Hong et al.  2018 ; Abouelmehdi et al.  2018 ; Rajabion et al.  2019 ; Galetsi et al.  2019 ) conducted a thorough review of big data’s impact in the healthcare sector, often termed Healthcare Big Data (HBD) (HBD). Vehicular Big Data (VBD), referring to big data in vehicles, has received significant attention with comprehensive reviews by researchers in Nguyen et al. ( 2018 ), Torre-Bastida et al. ( 2018 ), Ghofrani et al. ( 2018 ), Mishra et al. ( 2018 ). Concurrently, Urban Big Data (UBD), associated with smart cities, has been deeply explored by authors in Allam and Dhunny ( 2019 ), Karimi et al. ( 2021 ), Mohammadi and Al ( 2018 ), Huang et al. ( 2021 ). Exploring the intersection of big data and cybersecurity, Alani ( 2021 ), Ullah and Babar ( 2019 ), Srivastava and Jaiswal ( 2019 ) provide a comprehensive review. The industrial sector, often referred to as Industrial Big Data (IBD), underwent scrutiny in Qi ( 2020 ), Misra et al. ( 2020 ), Mosavi et al. ( 2018 ), while the education sector, explored under the umbrella of big data, is thoroughly reviewed in Luan et al. ( 2020 ), Baig et al. ( 2020 ), Li and Jiang ( 2021 ). Notably, authors in Akter and Wamba ( 2019 ), Amani et al. ( 2020 ), Huang et al. ( 2018 ), and Akter and Wamba ( 2019 ) have extensively explored the utilization of big data in earth sciences and disaster management, usually referred to as Earth Big Data (EBD). This collective exploration paints a comprehensive picture of the diverse applications and impacts of big data across various domains.

After an in-depth analysis of the available literature, it becomes apparent that individual literature reviews have been conducted across various domains, such as big data in healthcare, vehicles, finance, agriculture, education, etc. However, a wide gap exists in the collective analysis of big data applications. In bridging this gap, it is crucial to undertake a comprehensive assessment of how big data substantially contributes to diverse fields, discern the challenges it presents, delve into ethical concerns, and illuminate emerging applications. Therefore, this article aims to make the following contributions:

This research systematically identifies and analyzes key domains profoundly influenced by big data applications, providing a comprehensive understanding through the exploration of prominent use cases.

The study examines the transformation of decision-making processes in these domains due to big data, emphasizing how data-driven insights contribute to informed decision-making and enrich the existing knowledge on the subject.

This research addresses limitations and potentials in diverse fields’ big data applications, emphasizing inherent complexities and opportunities.

This research delves into the core technologies employed for storing, processing, and analyzing large datasets, elucidating their significance in big data applications.

This research systematically identifies and addresses potential concerns within big data, offering viable solutions and mitigation strategies.

The study conducts a comparative analysis of this research survey with related surveys to demonstrate the unique contributions and superiority of this study.

The subsequent sections encompass the following: In Sect.  2 , the research methodology employed for conducting this study is explained, covering details on the research string, as well as the inclusion and exclusion criteria. Section  3 undertakes the classification of literature concerning big data applications and conducts a thorough analysis of this body of work. Section  10 explored the technologies used to process and store big data. Exploring concerns associated with the utilization of big data in various domains, Sect.  5 delves into potential concerns with big data and introduces potential solutions that can be applied to address the aforementioned issues discussed earlier. Section  6 carried out a detailed comparison of this study with the related literature to show the uniqueness of this paper. Finally, section 7 concludes the study.

2 Research methodology

For the big data applications, we exclusively considered articles published from 2018 to 2023. Our search encompassed Google Scholar, Scopus, IEEE Xplore, and Science Direct to identify pertinent papers. Google Scholar offers access to papers published in any journal, whereas research libraries provide access to a more limited but high-quality selection of papers published in affiliated journals and by specific publishers.

In exploring the electronic world, a pivotal element is the search string, defining the search’s quality. The search string incorporates keywords that encapsulate the population, methodology, and outcomes. The methodology of this research paper is organized into three stages: (i) the Planning phase, (ii) the Conducting phase, and (iii) the Reporting phase. The following section of this part examines these phases.

2.1 Planning the review

In the initial stage, we precisely designed the review’s framework, which encompassed the formulation of the study protocol, identification of relevant journals and research papers, the establishment of including and excluding criteria, and defining our reporting strategy. The planning phase serves two fundamental purposes: (i) emphasizing the significance and necessity of this study, setting it apart from similar research endeavours; and (ii) formulating a robust protocol for conducting a comprehensive search of relevant studies while establishing clear criteria for their inclusion and exclusion.

Creating a well-defined review protocol is of utmost importance. A suitable protocol guides us towards a comprehensive review, while an invalid one may divert authors from the main focus. Therefore, this stage encompasses the examination and determination of developing research queries, exploration approaches, and criteria for selection.

2.2 Conducting the review

During this stage, the research is executed following the protocol delineated in Phase 1. The primary emphasis lies in identifying pertinent research studies, and subjecting them to scrutiny based on three pivotal criteria.

2.2.1 Population

To start, we define the population of the study. For instance, this research encompasses various domains impacted by big data applications, so the population of this review includes diverse fields where big data plays a significant role. The primary keyword is ’Big Data’. This overarching term encompasses various facets, including ’Earth Big Data (EBD)’, ’Vehicular Big Data (VBD)’, ’Healthcare Big Data (HBD)’, ’Urban Big Data (UBD)’, ’Industrial Big Data (IBD)’, and ’Education Big Data (EdBD)’. The inclusion of these synonyms ensures a comprehensive exploration of diverse fields impacted by big data applications.

2.2.2 Methodology or technique

The second criterion involves the methodology or technique employed to achieve the intended outcomes. In this context, big data applications serve as the fundamental technique for obtaining desired results across different domains. The methodology or technique employed in the research is centred around the domain of Data Processing Techniques. The keywords utilized for this area include ’Data Analytics’, ’Machine Learning’, and ’Artificial Intelligence’. These keywords are chosen to encapsulate the fundamental techniques employed to achieve intended outcomes across different domains.

2.2.3 Outcome

The third and final check revolves around the outcomes achieved in each research study. In the case of this review, the outcomes pertain to the applications and impacts of big data within the specified domains. The keywords employed in this context encompass ’Healthcare’, ’Supply-chain’, ’Logistics’, ’Marketing’, ’Advertisement’, ’Smart Cities’, ’Media’, ’Cybersecurity’, ’Climate’, ’Industry’, and ’Education’. These keywords are strategically chosen to align with specific domains, ensuring a focused investigation into the applications and impacts of big data within each specified area.

Therefore, the following research statement is used to search the paper on different platforms. Table 2 shows the details of the research string.

(Big Data) AND (Data Analytics OR Machine Learning OR Artificial Intelligence) AND (Healthcare OR Supply Chain OR Transport OR Marketing OR Advertisement OR Smart Cities OR Social Media OR Climate OR Earth Science OR Industry)

Pursuing these criteria, the subsequent critical step involves formulating research questions. In the context of this review, the investigation questions are as follows:

Which key domains are profoundly influenced by big data applications, and what are the prominent use cases within these domains?

How has big data transformed decision-making processes in these key domains, and how do data-driven insights contribute to informed decision-making?

What are the limitations and potentials associated with big data applications in diverse fields, and what inherent complexities and opportunities do they present?

What core technologies are employed for storing, processing, and analyzing large datasets in big data applications, and what is their significance?

What potential concerns exist within big data applications, and what viable solutions and mitigation strategies can address these concerns?

These research questions guide the review process, providing a structured framework for analyzing the selected research studies and synthesizing their findings. The questions address various aspects of big data applications, from their impact on different domains to the challenges, ethical considerations, and emerging trends associated with their use.

2.3 Quality assessment

The evaluation of this study’s quality depends on various essential parameters. To ensure the robustness and relevance of the papers included in this review, the following parameters have been established.

Inclusion criteria:

Relevant to big data applications : Selected papers must address topics related to big data applications, ensuring the content aligns with the central theme of this review.

methodology and results presentation : Included research articles should present their methodology and results in a clear and organized manner, enhancing the comprehensibility of their findings.

citation threshold : Research articles under consideration must have a minimum of 10 citations, reflecting their impact and recognition within the academic community. However, to include the recent articles, this count is reduced to 5 for 2023 publications.

Publication year: Selected research articles should have been published since 2018 to ensure the incorporation of recent developments in the field.

Exclusion criteria:

Irrelevance to big data role : Research papers that do not discuss the position of big data in any domain will be excluded from the review, as they fall outside the scope of this study.

Inadequate presentation of results and methodology : Papers that do not adequately present the results and methodology used to achieve desired outcomes will be excluded to maintain the quality and rigour of the review.

Insufficient citations : Research papers that have failed to garner at least 10 citations may be excluded to ensure the inclusion of well-recognized and influential works.

Publication year : Research articles not published between 2018 and 2023 will be excluded to focus on more recent and relevant literature.

2.4 Reporting the review

In the final stage, this study extracts and presents papers that are relevant to the keywords and research questions. The impact of the review depends on how the final assessment is presented in the paper.

We conducted a Google search that yielded a total of 22,080 results across various categories, including; Conferences (17,582), Journals (3485), Books (443), Magazines (328), Early Access Articles (228), Standards (12) and Courses (2).

We used the above-mentioned inclusion and exclusion criteria to filter the content. After careful consideration, we selected only 125 papers which were fulfilling the inclusion criteria.

Table 3 shows the number of publications yearwise and categorywise from 2018 to 2023, while Table 13 displays the category-wise publications since 2018.

figure 4

Big data applications classification

3 Big data application classification

Big data applications have become ubiquitous, establishing as a fundamental technology with a pervasive role across various fields, just like to computers. This section systematically categorizes these applications, delving into their impact, challenges, and prospects. The primary domains of big data applications include Healthcare, Supply Chain and Transport, Market and Transport, Smart Cities, Media, Cyber Security, Earth Science, Industry, Education, and others. Figure  4 visually presents the classification of these diverse big data applications.

3.1 Big data in healthcare

Big data plays a central role in Health 4.0, reshaping healthcare through data-driven research. Analyzing biomedical omics and clinical data offers both challenges and opportunities for healthcare improvement (Ahmed et al. 2023 ). The healthcare industry generates vast data, including hospital records, medical exams, and research, necessitating proper management for meaningful insights (Philip et al. 2022 ). Healthcare’s BDA can enable personalized medicine, clinical risk management, and forecasting, alongside standardizing medical terminology and patient registration (Tohka and Van Gils 2021 ; Masood et al. 2018b ). Table 4 summarizes the big data applications and Fig.  5 shows the HBD categorization.

figure 5

Big data in healthcare

Integrating biomedical and healthcare data empowers modern organizations to revolutionize medical therapies and personalize treatment (Mehta and Pandit 2018 ). Big data and e-business complement modern hospital management, transforming fragmented systems into comprehensive, omnidirectional healthcare management (Dash et al. 2019 ). Health analytics using big data aids in developing effective medical policies, improving healthcare services, and enhancing disease prediction, drug recommendations, and treatment outcomes (Zhang et al. 2023 ).

A robust BDA platform at Xiangya Hospital’s Gastroenterology Department, China, facilitates comprehensive digestive medicine analysis. The platform combines electronic medical records and colonoscopy data, offering insights for optimal colorectal cancer screening ages and improving healthcare management (Yan et al. 2019 ). Leveraging prescription big data can enhance dosage prediction in pediatric medication. Traditional clinical decision support systems often lack accurate pediatric data. In Wu et al. ( 2019 ), the authors propose a data-driven approach for precise pediatric medication dosage predictions. Authors in Zhou et al. ( 2021 ) introduce a track-able patient health data search system for smart city hospital management, ensuring data privacy and efficient analysis. In Makkie et al. ( 2018 ), authors discuss challenges in analyzing MRI big data and introduce a distributed computing platform using Hadoop and Spark for fMRI data processing.

In telemedicine, an innovative big-data visualization methodology is proposed (Galletta et al. 2018 ). This graphical tool allows remote monitoring of patient health using coloured circles to represent various health data, adhering to the geoJSON standard for data classification. Additionally, authors in Hong et al. ( 2019 ) suggest a medical-history-based algorithm for predicting potential diseases accurately. This algorithm utilizes HBD and DL technology, providing references for targeted medical examinations and reducing delays in treatment due to unclear symptoms or limited professional knowledge. Similarly, authors in Yadav and Jadhav ( 2019 ) employ medical big data in disease recognition.

While health big data is vital for disease detection, migrating to the cloud faces challenges like data standards and sensitivity. Authors endorse a cloud-native healthcare data ingestion service in Wu et al. ( 2019 ) to address these challenges and establish best practices. Similarly, authors in Zhou et al. ( 2020 ) present a scalable system that securely stores and analyzes healthcare data from IoT devices using big data systems and blockchain architecture.

In the future, healthcare organizations will increasingly embrace big data for success. The use of HBD will enhance marketing strategies, especially with the growing popularity of wearable technology and the IoT. The integration of constant patient monitoring data from these sources will provide valuable insights, enabling healthcare marketers to identify and engage patients more effectively.

However, several concerns are associated with the utilization of HBD. One of the primary challenges involves network congestion and delays. The massive data generation, particularly during peak hours, congests the network. Real-time healthcare applications running during these times are significantly affected (Adeghe et al. 2024 ). This is the main reason why healthcare real-time applications do not trust the network. Furthermore, HBD is directly linked to lives. Therefore, it will take time and maturity to build trust in this technology. The health care data is also used for several tasks, such as research and treatment, however, no consent is taken in this regard (Al Teneiji et al. 2024 ).

3.2 Big data in logistics and transport

The integration of big data in logistics and transport has gained significant attention (Yadav and Jadhav 2019 ). Researchers have delved into BDA within SCM, identifying its potential to rectify deficiencies, enhance efficiency, and reduce costs (Lwin et al. 2019 ; Jahani et al. 2023 ). Particularly, in the context of the COVID-19 pandemic, logistics firms harnessed big data and supply chain integration (SCI) to optimize supply chain performance (Ved and B 2019 ; Fosso Wamba et al. 2018 ). Table 5 summarizes the big data applications and Fig.  6 shows the logistic and transport big data categorization.

figure 6

Big data in supply chain and logistics

Moreover, the synergy between big data analytics technology capability (BDATC) and SCI has been observed to bolster supply chain performance by fostering proactive and reactive capabilities, as well as resource reconfiguration (Leng et al. 2020 ). Blockchain technology has also made inroads in logistics and supply chain systems, lending technical support and mitigating risks (Chen et al. 2022b ). In tandem, AI and big data analysis are utilized to scrutinize logistics service supply chain models, augmenting customer satisfaction and optimizing logistics operations (Farchi et al. 2023 ).

Significantly, the assessment of service capability in maritime logistics enterprises relies heavily on the extensive big data resources derived from the IoT supply chain system. This evaluation is crucial due to the numerous factors influencing maritime logistics, including overseas transportation routes. In Zhu and Du ( 2022 ), the authors suggest an approach for evaluating the service capabilities of maritime logistics enterprises by leveraging big data from the IoT supply chain system.

Moreover, an advanced cloud blockchain and Internet of Everything (IoE) enabled quality control platform seeks to improve quality management and bolster consumer confidence in perishable supply chain logistics, as discussed in Yang et al. ( 2022 ). This platform enables swift sensor data acquisition, ensuring authentication and transparency within cold supply chain logistics. In Jiang ( 2019 ), the authors endorse an intelligent supply chain model based on the IoT and big data. The objective of this model is to enhance information collaboration efficiency while mitigating the risks of supply chain disruption.

In the context of internet supply chain finance, compressed sensing proves to be a valuable method for conducting risk assessments within big data. Authors in Lyu and Zhao ( 2019 ) investigated the development of a risk assessment system for Internet supply chain finance, harnessing the power of compressed sensing and big data analysis. Furthermore, blockchain technology emerges as a robust solution to address security challenges in ITS and big data integration (Zhili et al. 2021 ). By using blockchain, data trustworthiness, transparency, and integrity are assured, surpassing the security standards of centralized databases.

The future holds significant promise for integrating big data in logistics and transportation. As highlighted by a research study (Insider 2023 ), last-mile delivery, a substantial portion of total shipping expenses, faces challenges such as carrier collaboration, manual processes, driver retention, fuel costs, WISMO ("Where is my order?") calls, and return costs. These challenges provide opportunities for optimization through the effective application of big data solutions.

While the integration of big data in logistics and transport is essential, there are associated concerns that need consideration. The primary concern is the privacy of drivers’ locations, which may be misused. Similarly, the utilization of big data in logistics may also jeopardize customer privacy. Therefore, it is necessary to address all these concerns when planning the future of big data in logistics and transport (Albqowr et al. 2024 ).

3.3 Big data in marketing and advertising

Big Data has a substantial influence on marketing and advertising, enabling organizations to collect and scrutinize vast data reservoirs for informed decision-making (Craig and Ludloff 2011 ). It empowers precise targeting and customization of advertising messages, guided by consumer behaviours and preferences (Chen 2022 ; Cockcroft and Russell 2018 ). Real behaviours data marketing entails the collection of internet-driven behavioural data for in-depth analysis of advertising content, timing, and format. This, in turn, fosters more effective customer relationship management and enhances customer retention (Del Vecchio et al. 2022 ; Beauvisage et al. 2023 ). Table 6 summarizes the big data applications and Fig.  7 shows the marketing and advertising big data categorization.

figure 7

Big data in marketing and advertising

In the financial sector, Big Data has ascended to prominence, with companies leveraging its capabilities for market analysis, customer insights, and informed decision-making. Authors in Hassani et al. ( 2018 ) explore into the pertinence of Big Data approaches in the financial realm, particularly within corporate banking, highlighting opportunities for technological advancements.

Furthermore, in the context of telecom big data, authors in Jia et al. ( 2019 ) propose a meticulous user classification scheme based on decision trees, aimed at amplifying marketing efficiency and effectiveness. The advent of Big Data technology has ushered in a paradigm shift in online advertising delivery, seamlessly integrating data, users, platforms, and businesses. Authors in Jieyu ( 2020 ) investigated the development of a precise online delivery system hinged on Big Data technology.

Cloud computing and Big Data technology have found extensive applications in the world of e-commerce advertising promotion, elevating the core competitiveness of enterprises within this industry. The authors in Zhang ( 2022 ) investigate the utilization of Big Data and cloud computing technology to enhance e-commerce advertising. They propose a distributed system built on Hadoop for this purpose. Similarly, authors in Ducange et al. ( 2018 ) furnish an in-depth analysis of SBD and its application in shaping marketing strategies, encompassing a comprehensive methodology and classification of contemporary use cases.

E-commerce and advertising cannot survive without big data. Nowadays, the action plans of e-commerce and advertising agencies rely heavily on big data analysis. This technology enhances targeted advertising, enabling businesses to reach potential customers more effectively. Therefore, it can be stated that the e-commerce and advertising domains represent significant applications of big data.

However, the deployment of Big Data in marketing and advertising gives rise to substantial concerns regarding privacy and the potential for government surveillance, as discussed in Tang et al. ( 2022 ). Despite its advantages, Big Data in marketing and advertising presents challenges such as the crucial need for consent and the complexities surrounding transparency, identity, power dynamics, and inclusivity (Yin et al. 2021 ). Therefore, it is necessary to prioritize customer data privacy when planning the integration of big data in commerce and advertising.

3.4 Big data in smart cities

The combination of the IoT and BDA technologies holds the potential to be a game-changer in the construction of smart cities (Bibri 2019 ). These technologies provide opportunities for efficient disaster management activities, analysis, and the acquisition of valuable information for decision-making (Shah et al. 2019 ; Ding et al. 2023 ). Table 7 summarizes the big data applications and Fig.  8 shows the UBD categorization.

figure 8

Big data in smart cities

A plethora of devices connected to the internet in smart cities continuously generates vast amounts of data. Addressing this data deluge, researchers in Wang et al. ( 2018 ) propose enhanced multi-order distributed algorithms to efficiently process this big data in the realm of smart city services. Similarly, authors in Alahakoon et al. ( 2020 ) advocate for a comprehensive framework designed to handle the substantial data inflow from sources such as sensors, IoT devices, and social networks within smart cities. This framework encompasses data processing workflows, ML algorithms, and statistical techniques aimed at extracting meaningful insights from the data.

The evaluation of smart cities has resulted in the generation of massive quantities of data. Unfortunately, a significant portion of this data often goes to waste due to the absence of established mechanisms and standards for extracting valuable information. Authors in Chang ( 2021 ) discuss the issues and approaches linked with leveraging big data and ML to enable cognitive smart cities, thereby enhancing the utilization of this data.

In alignment with this, authors in Wu et al. ( 2018 ) present a framework designed to efficiently process the large amounts of data generated by sensors in smart cities. This architectural model comprises various layers and components for data processing and analysis. ML techniques are integral to this framework, ensuring the acquisition of accurate data and the delivery of precise information to end-users, ultimately resulting in an elevated Quality of Experience (QoE) performance.

In anticipation of the growing prevalence of cameras in smart cities, video surveillance is becoming a key component of data collection. This evolution necessitates the development of efficient techniques for processing substantial volumes of video data. Several papers in the field look into this topic. Tian et al. ( 2018 ) propose a block-level background modelling (BBM) algorithm for efficient video coding, complemented by a rate-distortion optimization algorithm designed to enhance compression performance.

The part of big data in the implementation of smart cities is crucial, as it enables the analysis of extensive data volumes to extract valuable insights. In He et al. ( 2018 ), the authors utilize special technologies for municipal governance and planning in smart cities. Similarly, in Kandt and Batty ( 2021 ), authors delve into the value of big data in shaping long-term urban planning. They emphasize how urban analytics can inform these long-term urban policies within smart cities.

The perspective of big data in smart cities promises transformative advancements. The integration of big data and the IoT is set to revolutionize urban living. Expect more sophisticated data analytics, real-time insights for resource management, improved infrastructure planning, and AI-driven solutions to address urban challenges. This evolution aims to create proactive, sustainable, and resilient smart cities.

However, the widespread use of big data in smart cities brings critical concerns. Security and privacy issues surrounding the vast data generated by IoT devices and sensors need careful attention. Protecting data from unauthorized access and ensuring citizen privacy requires robust security measures and regulatory frameworks. Ethical considerations in data collection, storage, and usage demand scrutiny to prevent misuse. Striking a moderation between reaping the benefits of big data in urban development and safeguarding individual privacy is crucial for fostering trust and ensuring sustainable and inclusive smart city growth (Thilagavathi et al. 2019 ; Elhoseny et al. 2018 ).

3.5 Big data in media

The intersection of big data and entertainment is a dynamic field with vast potential for insights, innovation, and, at the same time, several challenges to navigate Abbasi et al. ( 2018 ); Daud et al. ( 2013 ). Table 8 summarizes the SBD applications and Fig.  9 shows the media big data categorization.

figure 9

Big data in media

Social media platforms are prolific producers of what’s referred to as SBD (Badshah et al. 2022b ). This treasure trove of data is a window into user behaviour, trends, and interactions, offering valuable insights (Esfahani et al. 2019 ). Companies recognize the power of this data and utilize it to personalize marketing strategies, pinpoint specific demographics, and boost sales (Ghani et al. 2019 ; Rahman and Reza 2022 ). Social media also serves as a powerful platform for businesses to engage with their customer base, foster loyalty, and even function as online retail spaces (Liu et al. 2021 ; Hayat et al. 2019 ).

However, the employing of SBD raises significant concerns related to privacy and the potential misuse of personal information, as highlighted in Bansal et al. ( 2018 ). Thus, the combination of big data and social media presents a dual landscape, offering opportunities for innovation, effective marketing, and improved decision-making. However, it is laden with challenges and ethical considerations. Similarly, authors in Mani and Chouk ( 2022 ) and Vargo et al. ( 2018 ) discussed privacy and security issues in media big data.

To investigate the role of social media big data, the authors in Jimenez-Marquez et al. ( 2019 ) propose a comprehensive two-stage framework tailored for the big data era. The first stage emphasizes data preparation and the selection of a ML model, while the second stage utilizes established layers of big data architectures to extract insights from the data. This versatile framework accommodates both large and small datasets and is illustrated through a case study focused on analyzing reviews of hotel-related businesses. Similarly, in the study (Zhang et al. 2022 ), the authors introduce the Big Data-assisted Social Media Analytics for Business (BD-SMAB) model to enhance decision-making in marketing strategies and competitive analysis.

Social media is a focal point for marketing, especially for business-to-business (B2B) organizations aiming to sustain and expand through strategic operations and marketing activities, as explained by authors in Sivarajah et al. ( 2020 ).

The potential of SBD is also recognized in the realm of urban sustainability research and practice. Its unique advantages, including vast scale and near-real-time observation, offer insights into human behaviour within urban environments. Authors in Ilieva and McPhearson ( 2018 ) delve into the potential and issues associated with harnessing social media data for urban sustainability research and practice, shedding light on a promising avenue for urban development.

The integration of big data in entertainment and social media is currently revolutionizing user experiences, content creation, and industry dynamics. With ongoing technological advancements, big data is driving personalized content recommendations, offering predictive insights, enhancing user engagement, enabling targeted advertising, optimizing content distribution, and facilitating real-time trend analysis (Hariri et al. 2019 ). Emphasizing data security and privacy measures, these developments are transforming the industry, providing tailored and immersive experiences, improving content relevance, and ensuring efficiency in advertising and content distribution. To remain competitive in these evolving sectors, a seamless integration of BDA is essential to meet the dynamic expectations of users in today’s rapidly advancing technological landscape (Amalina et al. 2019 ).

Despite its benefits, social media big data faces challenges such as misinformation and limited data, making it difficult to distinguish the truth. Current solutions struggle with scalability in large-scale events (Zhang et al. 2018 ). Furthermore, this big data is wrongly used by companies, as they share it with commercial entities. These companies enforce their narratives through advertising. Therefore, it is necessary to address these concerns while working on SBD, especially in entertainment, particularly on social media.

3.6 Big data in cyber security

Playing a crucial role in cybersecurity, big data is especially significant in domains such as intrusion detection, anomaly detection, spamming and spoofing detection, malware and ransomware detection, code security, and cloud security (Walters and Novak 2021 ). The integration of BDA with ML can effectively address unknown risks and insider threats, providing advanced threat analytics (Saravanan and Prakash 2021 ). It enables the discovery of irregularities and suspicious activities, leading to the deployment of effective intrusion detection systems (França et al. 2021 ). Additionally, BDA can enhance data security and privacy, mitigating cybersecurity breaches and supporting secure information sharing (Rassam et al. 2017 ). The application of BDA in cybersecurity is an emerging trend, presenting potential future directions for research and development (Wang and Jones 2021 ). Table 9 summarizes the applications and Fig.  10 shows the cybersecurity big data categorization.

By leveraging big data and advanced analytics techniques, organizations can improve their operational intelligence and security capabilities, staying ahead of evolving cyber threats. Authors in Kantarcioglu and Xi ( 2016 ) discussed security issues faced in the big data environment, particularly in the context of cloud computing.

figure 10

Big data in cyber security

Surveilling the security of the IoT through multidimensional streaming big data encounters various challenges, including substantial data volumes, redundancy, and scalability issues. To tackle these obstacles, the authors in Ullah et al. ( 2022 ) present an algorithm called ODIS. This algorithm extracts vital information from data across distributed sensor nodes, considering the spatial and temporal dependence structure of the data. ODIS establishes a precise data structure model to understand IoT system behaviours and employs testing methods to quantify the uncertainty linked with monitoring tasks. Adversarial data mining is an emerging field that combines BDA with cybersecurity. Authors in Li et al. ( 2019 ) used adversarial data mining techniques to handle malicious adversaries in cyber security applications.

In Tao et al. ( 2019 ), the authors introduced a parameter-wise adaptation that autonomously initiates the tuning process. This system adjusts the configuration parameters of the framework for various security datasets and subsequently executes the BDCA system with the adapted configuration. Similarly, Rawat et al. ( 2019 ) explores the economic aspects of safeguarding big data security and privacy, encompassing investment decisions and cyber insurance.

To tackle the challenges posed by cyber threats in the cloud, the authors in Subroto and Apriyana ( 2019 ) have devised a cloud computing-based system for cybersecurity management. This system aims to streamline the analysis process of extensive network data. The constructed system is built on the MapReduce framework and encompasses end-user devices, cloud infrastructure, and a monitoring center.

Big data is advancing cybersecurity, making it more intelligent for the future. This increased intelligence will enable systems to promptly counter cyber attacks. Consequently, cybersecurity experts are acquiring additional skills in both big data and cybersecurity, driven by the recognition of the crucial role played by these combined capabilities (Zhang and Ghorbani 2021 ).

Big data in cybersecurity offers potent advantages but introduces challenges, including privacy concerns, security issues, data accuracy, scalability, and cost management. Successfully navigating these hurdles requires a comprehensive strategy addressing legal compliance, robust security measures, data quality assurance, and cost-effective implementation (Rao and Lakshmanan 2024 ).

3.7 Big data in earth science

An extensive array of data about our planet, which is usually also referred to as Earth Big Data (EBD) is generated from Earth observation systems on diverse platforms, such as satellites, aeroplanes, and ground-based setups. This includes geoscience, statistical, and social data (Yang et al. 2019 ). Integrating Earth observation data with other forms within a geographic context offers the potential to model Earth systems more accurately, linking human activities with their impacts on Earth processes (EOS 2023 ). Table 10 summarizes the applications and Fig.  11 shows the earth’s big data categorization.

figure 11

Big data in earth science

Big data applications in climate and earth studies have gained increasing importance in recent years. These applications involve the utilization of large volumes of data generated from climate and weather modelling (Huang et al. 2018 ). The analysis of this big climate data has led to advancements in understanding climate change, assessing environmental conditions, and predicting future climate trends. Leveraging BDA, including data mining techniques and the integration of heterogeneous data sources, has empowered researchers to study climate change in a more comprehensive and interdisciplinary manner. Open data resources, like Google Earth Engine, have been used to evaluate environmental conditions and assess vulnerability to climate change in specific regions (Amani et al. 2020 ). Overall, big data tools and techniques have provided valuable insights into climate-related issues and have the potential to contribute to sustainability and resilience-building efforts.

Big data on climate and earth is used for several purposes. The foremost use is the monitoring. Authors in Hassani et al. ( 2019 ) designed BDA to enhance seasonal change monitoring and understanding of climate change. The second big use of big data is to predict the climate and conditions. Authors in Knüsel et al. ( 2019 ) used Big data techniques in rainfall prediction, helping farmers make wise decisions on crop yield and studying the timing of floods or droughts. Similar concepts are discussed and proposed by authors in Sebestyén et al. ( 2021 ) and use the big data collected by different sensors for climate monitoring and prediction. Authors in Silva et al. ( 2018 ) discussed in detail the studies, which investigated big data climate monitoring and prediction.

Along with climate monitoring and prediction, big data is used for Sustainable Urban Planning and Infrastructure. Authors in Leung et al. ( 2019 ); Ameer and Shah ( 2018 ) used big data and its analytics tools in urban planning and smart city decision management. Similarly, authors in Sarker et al. ( 2020 ) used BDA for smart cities’ air pollution prediction. They introduced a spark-based architecture for smart urban planning that utilizes BDA to classify air quality. This architecture is implemented on a dataset of vehicle pollution in Aarhus City, Denmark.

Disaster management has become a significant concern, and Big Data is being utilized for natural disaster management. Authors in Yu et al. ( 2018 ), utilized big data for disaster management derived from remote sensing imagery, social media data, crowdsourced data, GIS, and mobile metadata. Similarly, in Sarker et al. ( 2020b ), the authors investigated several studies exploring the use of big data in disaster management.

The main challenge associated with the Earth’s big data is its continuous growth. Every country deploys satellites, balloons, aeroplanes, and other tools that consistently gather data. However, reaping benefits from this data is contingent upon having appropriate tools. Regarding the sheer volume of Earth’s big data, our current tools are not advanced enough to thoroughly analyze it Sudmanns et al. ( 2019 ).

The foremost concern regarding Earth Big Data is individual privacy. This data is constantly generated without regard for the privacy of specific locations, making it accessible to anyone for various purposes. The data finds application in numerous fields, including science, weather prediction, and defence. The issue of precisely identifying the responsible party or owner of this data remains unresolved (Farley et al. 2018 ). Therefore, there is a need to explore whether it is feasible to collect this data with individual consent and whether regulations can be established to govern this vast dataset.

3.8 Big data in industry

Big data is being applied in various industries, including construction, sports, tourism, and the legal field. In the construction industry, big data is utilized to enhance construction efficiency, reduce material waste and expenses, improve planning and decision-making processes, and enhance construction site safety (Nguyen et al. 2020 ). In the sports industry, big data analysis and AI are used to analyze player performance, broadcast events, and improve sports marketing strategies (Patel et al. 2020 ). In tourism, big data is used for revenue management, marketing strategies, customer experience, and market research, aiding in the development and recovery of the industry (Li et al. 2022 ). In the legal industry, BDA tools are used for tasks such as billing, marketing, and identifying trends in cases (Bhure and Desai 2023 ). Table 11 summarizes the applications and Fig.  12 shows the industry big data categorization.

figure 12

Big data in industry

Authors in Lies ( 2019 ) covered big data’s transformative role in automotive marketing, emphasizing precision marketing and data-driven consumer insights. A similar theme is explored in Liu and You ( 2021 ), where big data correlates with a 2.895% increase in new energy vehicle technology innovation, advocating its integration with the industry for national benefits.

Classification benefits from big data too, as seen in Li et al. ( 2019 ), where cellular company customer records are categorized to enhance marketing efficiency. In Chen et al. ( 2022a ), big data analysis is used to create tailored data packages. The chemical industry harnesses big data for intelligent manufacturing, evaluating strengths, weaknesses, and future trends (Jiyang et al. 2020 ). Similarly, Huabei Oilfield adopts big data with a "seven-step method" system and a data mining for oil production engineering, enhancing data-driven processes (Mohammadpoor and Torabi 2020 ).

Big data is reshaping industries, particularly production, in alignment with market analysis. The expanding realm of big data is certain to amplify its influence on the industry. Utilizing big data analysis will enhance customer-centric production strategies, ultimately leading to improved revenue outcomes (Vassakis et al. 2018 ).

Big data utilized for market analysis is collected from various sources, raising significant concerns about the privacy and security of this data. Therefore, it is imperative to ensure that the data collection and analysis do not compromise someone’s privacy and security (Del Vecchio et al. 2018 ).

3.9 Big data in education

Big data has the potential to enhance teaching and learning, improve educational research, and advance education governance (Fischer et al. 2020 ). Although the utilization of big data in education is not a new concept, recent technological advancements have spurred increased research in this area (Ray and Saeed 2018 ; Amjad et al. 2018 ). There is an interest in leveraging big data to analyze student behavior and performance, enhance the educational system, and integrate big data into the curriculum (Baig et al. 2020 ). Popular tools and techniques for working with big data in the education industry include educational data mining and learning analytics (Qian et al. 2022 ). The convergence of the ability to collect, store, manage, and process data, along with data from online educational platforms, presents unprecedented opportunities for educational institutions, learners, educators, and researchers. Table 12 provides a summary of the applications, and Fig.  13 illustrates the categorization of big data in education.

figure 13

Big data in education

In educational technology, the most investigated these days is personalized learning. With the help of personalized learning, the personalized content or subjects are recommended to the learners and they can learn in their own space (Munshi and Alhindi 2021 ). Authors in Yuwen et al. ( 2018 ) carried out some experiences to appropriately suggest the courses to the learners using BDA. Their results show that their accuracy for the course recommendation is much better than the already working algorithms. Similarly, authors in Kanth et al. ( 2018 ) highlighted the challenges of identifying student misconceptions, predicting dropouts, and improving educational quality, with a focus on leveraging data and advanced technologies. The authors aim to enhance personalized learning and propose various supervised learning methods as solutions.

Student management and discipline represent significant challenges in educational institutions. The authors in Zhang et al. ( 2021 ) addressed this issue by leveraging big data. Through the analysis of students’ daily routines, learning styles, and behavior, they obtained insights to aid in student management. In Liang ( 2020 ), the authors present an education management model utilizing big data, demonstrating improved information levels and a broader application of big data in educational management. Similarly, authors in Badshah ( 2023a , 2023b ) utilize similar concepts for student management and enhancing their productive engagement.

The big data is also changing the way of teaching. Flipped classrooms (Hao 2021 ) and homeschooling (Inayatulloh et al. 2022 ) are the leading examples. Authors in Hu et al. ( 2022 ) explored the same by proposing the hybrid teaching method. Their investigation shows that students were more actively engaged in the learning concerning the normal classes.

Education is intricately linked with big data as both a producer and consumer. Millions of individuals, whether learners, teachers, or administrators, are actively engaged in this dynamic field. The demand for virtual classes has surged during the COVID-19 pandemic, further emphasizing the role of big data in meeting these evolving educational needs. Concepts like personalized learning and home-based schooling are gaining prominence, relying entirely on the insights and capabilities provided by big data. In this interconnected landscape, the symbiotic relationship between education and big data continues to shape the future of learning.

While there is a considerable list of advantages, the use of big data in education also raises several concerns. Foremost among them is the risk of misuse, as the data of thousands of learners, including institutional geography and learner locations, may be mishandled. Additionally, concerns about data bias and algorithmic bias pose potential challenges that need careful consideration to ensure fair and equitable outcomes (Lin et al. 2024 ).

4 Key technologies

In big data, several key enabling technologies play pivotal roles in facilitating the storage, processing, and analysis of extensive and intricate datasets. These technologies serve as the backbone for the vast potential of big data applications. Here are some of the key enabling technologies discussed.

At the forefront, Hadoop stands as a distributed storage and processing framework that enables parallelized handling of large datasets. Its architecture allows for efficient and scalable data processing, making it a cornerstone in the big data ecosystem (Apache 2023a ).

4.2 Apache spark

Complementing Hadoop, Spark emerges as an in-memory data processing engine that significantly enhances the speed and efficiency of BDA. It excels in iterative computations and ML algorithms, contributing to improved data processing capabilities, as discussed in Apache ( 2023c ).

4.3 NoSQL databases

In the era of diverse data types, NoSQL databases like MongoDB (Apache 2023d ) and Cassandra (Cassandra 2023 ) play a vital role. These non-relational databases accommodate unstructured and varied data, providing flexibility and scalability crucial for managing the complexities of modern data.

4.4 Data warehousing

Technologies such as Amazon Redshift (Amazone 2023 ) and Google BigQuery (Google 2023 ) exemplify the capacity to store and retrieve large volumes of structured data. These solutions for data warehousing enable organizations to effectively handle and retrieve their data for analytical purposes.

4.5 Machine learning

The integration of ML algorithms and frameworks, including TensorFlow (Tensor 2023 ) and scikit-learn Learning ( 2023 ), empowers data scientists to derive actionable insights and predictions from vast datasets. ML becomes an invaluable tool in uncovering patterns and trends within the data.

4.6 Data integration tools

Apache NiFi Apache ( 2023b ) and Talend ( 2023 ) exemplify the significance of data integration tools. These platforms facilitate the seamless integration of diverse data sources, ensuring a unified and coherent dataset ready for comprehensive analysis.

4.7 Data visualization tools

Platforms like Tableau ( 2023 ) and Power BI Microsoft ( 2023 ) add a layer of accessibility to big data insights. These visualization tools transform complex datasets into digestible visualizations, enabling stakeholders to interpret and understand data-driven narratives.

4.8 Blockchain technology

Highlighting security and transparency, blockchain technology contributes to safeguarding the integrity of transactions and data sharing in the realm of big data. The decentralized nature of blockchain enhances both trust and data immutability (Badshah 2023c ).

4.9 Edge computing

To fulfil the demand for real-time analytics, edge computing facilitates data handling near the data source. This minimizes latency and improves the efficiency of analytics for applications such as the IoT (Amjad et al. 2018 ).

4.10 Cloud computing

Services offered by AWS, Azure, and Google provide a scalable and flexible infrastructure for big data storage and processing. Cloud computing has become a cornerstone, furnishing organizations with the resources required to manage the continuously expanding volumes of data (Badshah 2023a ).

5 Potential concerns and solutions

As we have explored the expansive landscape of employing big data across various applications, it becomes imperative to acknowledge and address potential challenges and concerns associated with its widespread utilization (Ajah and Nweke 2019 ). These concerns, spanning privacy, security, biases, and misuse, highlight the need for understanding the implications and risks inherent in BDA (Ikegwu et al. 2024 ). In this section, we delve into these concerns, acknowledging the multifaceted nature of navigating complexities when harnessing vast datasets. Additionally, we present solutions to tackle these challenges, offering a roadmap for a more secure and responsible digital environment. This section provides a detailed examination of proactive measures, carefully crafted to address distinct aspects of concern.

5.1 Privacy

A substantial concern associated with the utilization of big data across various applications is the issue of privacy. Almost every field grapples with this concern due to the underdeveloped nature of regulations on data security and privacy. The existing rules lack maturity, posing a challenge in adequately protecting user data (Amaithi Rajan and V 2023 ; Masood et al. 2018a ).

To address privacy issues linked with big data, it is crucial to advocate for the development and implementation of robust regulations governing data security and privacy. Collaborating with regulatory bodies and policymakers to create comprehensive and mature frameworks will enhance the protection of user data (Price and Cohen 2019 ).

5.2 Security

The lack of data privacy raises security concerns, not just for the data itself but also for individual security. Organizational data exposure or the public availability of individual locations can lead to notable security problems. The connection between data vulnerability and individual security issues exacerbates the overall concern (Khan and Ahmad 2023 ).

Mitigating security risks involves reinforcing data privacy measures. Implementing encryption, access controls, and regular security audits can fortify the protection of organizational and individual data. Additionally, fostering awareness about cybersecurity practices among users is essential for minimizing vulnerabilities (Ikegwu et al. 2022 ).

Algorithmic bias, especially in IoT devices, is a common problem nowadays. Similarly, BDA may also exhibit biases in their calculations, disrupting decision-making processes (Rehman et al. 2022 ).

Addressing algorithmic biases in BDA requires continuous monitoring and evaluation of algorithms. Implementing diversity in datasets and adopting ethical guidelines for algorithm development can help mitigate biases, ensuring fair and unbiased decision-making processes (Favaretto et al. 2019 ; Amjad et al. 2012 ).

Misuse of big data is a major concern, with companies often utilizing this data without considering the welfare of customers. Many individuals are unaware of how their data is being used for the benefit of companies. Mitigating potential misuse requires increased transparency and ethical considerations (Stegenga et al. 2023 ).

Preventing the misuse of big data involves enhancing transparency in data usage and fostering ethical considerations. Implementing clear data usage policies, obtaining explicit consent from users, and educating individuals about how their data is utilized contribute to responsible and ethical data practices (Bag et al. 2023 ).

5.5 Different cyber laws

The internet has transformed the world into a global village, however, the issue is that cyber rules and regulations vary widely. Every country has different rules, leading to conflicts on the internet. An action may be a crime in some countries and not in others, highlighting the need for harmonizing international cyber regulations (Rawat et al. 2023 ).

When the big data concerns are collectively looked at, it is noticed that all these concerns are linked with international cyber laws. Due to this gap, the digital world has these issues. It is, therefore, important and need of the day to go ahead toward international cyber rules, which will equally work in all countries (Favaretto et al. 2019 ).

5.6 Doubted accuracy

Despite its advantages, social media big data faces challenges such as misinformation and limited data, making it difficult to distinguish the truth. Therefore, it is not always guaranteed that the big data used for decision-making is correct (Badshah et al. 2022b ).

Ensuring the accuracy of big data used for decision-making requires implementing rigorous data validation processes. Incorporating fact-checking mechanisms, promoting data transparency, and investing in data quality assurance measures contribute to the reliability of information derived from big data (Khan et al. 2016 ).

5.7 Reason for network congestion

The rapid growth in data generation leads to network congestion, slowing internet speed and impeding real-time communication. This poses challenges, particularly in critical applications like hospitals, where trust in the network’s reliability is compromised. Addressing congestion is crucial for ensuring seamless real-time interactions and maintaining the dependability of data-driven systems (Anitha et al. 2023 ).

Addressing network congestion involves optimizing data transmission protocols, investing in network infrastructure, and implementing load-balancing techniques. Prioritizing network reliability in critical sectors like healthcare ensures that real-time communication remains unaffected even during periods of high data traffic (Al-Jumaili et al. 2023 ).

5.8 Special hardware and software

In big data, concerns emerge regarding the need for specialized hardware and software. Access and compatibility challenges risk obstructing positive outcomes. Processing vast data volumes demands specialized tools, resist by limitations in hardware or software and the complexity of multiple data formats. The substantial processing needs also contribute to higher costs, necessitating careful cost management for optimal resource utilization (Badshah et al. 2022a ).

To overcome challenges related to specialized hardware and software, organizations should invest in versatile and scalable technologies. Collaborating with technology providers to develop solutions that enhance accessibility and compatibility can facilitate positive outcomes without compromising on processing efficiency (Selmy et al. 2023 ).

5.9 Dependency on tech experts

One limitation of big data lies in its dependency on technology experts for its collection, filtering, and processing. This reliance poses a challenge in ensuring that the necessary expertise is consistently available for the effective utilization of big data resources (Badshah 2023b ).

Reducing the dependency on tech experts requires investing in user-friendly interfaces and tools. Implementing training programs for non-experts and promoting the development of intuitive big data platforms can empower a wider range of professionals to harness the power of big data resources effectively (Selmy et al. 2023 ).

6 Comparative analysis

This section compares the current literature study with related surveys. Scholars have extensively studied and investigated big data and its applications. However, existing reviews often focus on a single application of big data, failing to explore it comprehensively. Big data has potential and challenges in every domain, necessitating a thorough investigation. Additionally, no studies have categorized big data applications or comprehensively discussed their future potentials and concerns.

To evaluate and compare our study with similar ones, we applied the criteria outlined in Table 14 . The criteria included examining challenges (C1), future potentials (C2), domain categorization (C3), privacy concerns (C4), and specific domains such as healthcare (C5), supply chain and logistics (C6), marketing and advertising (C7), smart cities (C8), media and entertainment (C9), cybersecurity (C10), climate and earth science (C11), industry (C12), and education (C13). Table 15 shows the overall comparison of the related surveys literature.

Big data applications in healthcare have been extensively reviewed, focusing on the benefits and challenges in this domain. The study in Hong et al. ( 2018 ) offers a comprehensive overview of big data in healthcare, addressing challenges (C1) and exploring applications (C5). The authors emphasize the importance of privacy (C4) and regulatory frameworks. Subsequently, a study (Abouelmehdi et al. 2018 ) investigate the transformative potential of big data within the healthcare domain (C2), highlighting privacy concerns (C4). This study provides valuable insights into disease prediction and cost reduction (C5). Furthermore, authors in Rajabion et al. ( 2019 ) contribute to understanding data processing mechanisms in healthcare (C3). Lastly, study in Galetsi et al. ( 2019 ) emphasizes the value of personalized services in healthcare (C5), acknowledging privacy concerns (C4).

Big data applications within the supply chain and logistics domain have shown significant potential for optimization and efficiency improvements. The study in Torre-Bastida et al. ( 2018 ) offers a comprehensive overview of big data applications within the transportation industry, addressing challenges (C1) and exploring opportunities in routing, planning, monitoring, and network design (C6). Building upon this, authors in Nguyen et al. ( 2018 ) extend the analysis to the broader supply chain management domain (C6), proposing a classification framework and identifying research gaps (C2). Focusing on the railway sector, study (Ghofrani et al. 2018 ) contributes to the understanding of big data applications in operations, maintenance, and safety, leveraging Mayring’s framework (C6). A broader perspective is offered by Mishra et al. ( 2018 ), which provides a bibliometric analysis of big data in supply chain management (C6), identifying key research clusters and managerial insights.

The application of big data in marketing and advertising has been explored to understand its impact on digital marketing strategies and customer engagement. The study by Miklosik and Evans ( 2020 ) delves into the application of big data and ML in the realm of digital marketing (C7), uncovering unexplored avenues for future research. Subsequently, authors in Anshari et al. ( 2019 ) explore the integration of big data into CRM, emphasizing its role in personalized marketing strategies (C7). While survey papers like Kushwaha et al. ( 2021 ); Sestino et al. ( 2020 ), and Lee et al. ( 2023 ) offer comprehensive overviews of big data in marketing and advertising (C7).

Big data plays a crucial role in the development and management of smart cities, enhancing sustainability and livability. The study (Karimi et al. 2021 ) delves into the urban potential of AI within Smart Cities, emphasizing the integration of culture, metabolism, and governance for sustainability and livability (C8). It prioritizes the livability of the urban fabric alongside economic growth, showcasing the potential of AI and Big Data integration. In alignment with this perspective, authors in Mohammadi and Al ( 2018 ) conduct a comprehensive review of big data handling in smart cities, categorizing techniques and exploring key ideas (C3). The study introduces crucial factors such as scalability, time, availability, and accuracy, contributing to the understanding of big data’s role in smart city development (C8). Similarly, the study in Huang et al. ( 2021 ) addresses the underutilized data in smart cities by proposing a three-level framework employing semi-supervised deep reinforcement learning to optimize control policies (C8). The interconnected studies collectively contribute to a more holistic understanding and advancement of AI and big data applications in the context of smart cities.

The media and entertainment industry has been significantly impacted by big data, especially through social media platforms. Big data is revolutionizing the media and entertainment industry. A significant portion of this data is generated by social media platforms. The study in Abkenar et al. ( 2021 ) explores the types of SBD, laying the groundwork for understanding its potential applications in this domain (C9). While studies (Sebei et al. 2018 ) and Muhammad et al. ( 2018 ) contribute to the growing body of knowledge in this area (C9).

The application of BDA in cybersecurity is critical for enhancing security measures and protecting against cyber threats. The study in Alani ( 2021 ) surveys the applications of BDA in cybersecurity, covering areas such as intrusion detection, spamming detection, and cloud security (C10). It highlights the rapid increase in data generation due to the growing number of internet users. Building on this foundation, authors in Ullah and Babar ( 2019 ) and Srivastava and Jaiswal ( 2019 ) further explore the role of big data in cybersecurity, expanding the knowledge base in this domain (C10).

Big data has significant applications in earth sciences and disaster management, aiding in visualization, analysis, and prediction. The study by Akter and Wamba ( 2019 ) examines the application of big data in natural disaster management (C11), emphasizing visualization, analysis, and prediction. It highlights the role of emerging technologies in enhancing disaster response and recovery strategies. Expanding on this, Amani et al. ( 2020 ) delves into the utilization of Google Earth Engine (GEE) in various domains, including land classification, hydrology, and climate analysis (C11). Shifting focus to agriculture, authors in Huang et al. ( 2018 ) explore the application of big data in precision agriculture, addressing challenges and proposing a management framework. A comprehensive overview of big data in disaster management is presented in Akter and Wamba ( 2019 ), providing valuable insights into research trends, challenges, and future directions (C11).

BDA is revolutionizing industries, offering advanced analytics, optimization, decision-making, modelling, and predictions. BDA is revolutionizing industries, offering advanced analytics, optimization, decision-making, modelling, and predictions. The study in Mosavi et al. ( 2018 ) explores the adoption of big data technologies in the engineering domain, highlighting its role in enhancing competitiveness. It reviews academic literature on big data applications within the engineering field (C12). Expanding the focus to industry-specific challenges, Qi ( 2020 ) delves into the mining industry, addressing hurdles in implementing big data management (BDM) (C12). The study outlines data sources, challenges, and future prospects for the mining industry (C1, C2). Furthermore, Misra et al. ( 2020 ) explores the impact of IoT, big data, and AI on agri-food systems (C12). It covers applications across the supply chain, from agriculture to food quality assessment, emphasizing commercialization and translational research outcomes.

The exploration of big data applications in education reveals a growing body of research. Authors in Luan et al. ( 2020 ) delve into challenges and trends (C1, C2), advocating for a balanced approach to technology integration (C13). A study in Baig et al. ( 2020 ) contributes by analyzing 40 studies, focusing on learner behaviour and performance (C13), while the investigation in Li and Jiang ( 2021 ) examines the impact of COVID-19 on educational big data, highlighting the role of educational psychology (C13).

In the context of these contributions and the existing literature, this study represents a pioneering investigation that deeply probes big data applications, categorization, challenges, and potential futures. This collective exploration paints a comprehensive picture of the diverse applications and impacts of big data across various domains.

7 Conclusion

This research explored the dynamic landscape of Big Data applications, unveiling their profound impact across diverse domains. The literature is meticulously categorized into distinct segments: healthcare, supply chain and logistics, marketing and advertising, smart cities, media and entertainment, cybersecurity, climate and earth science, industry, and education. Furthermore, it examined the transformative effects on decision-making processes, emphasizing the role of data-driven insights in various domains. Challenges and issues related to Big Data are thoroughly investigated, and recommendations are presented to overcome these hurdles. Additionally, core technologies for storing, processing, and analyzing large datasets are explored. The study also identifies and addresses potential concerns within Big Data, offering robust solutions and effective mitigation strategies.Through a comprehensive comparative analysis with related surveys, this research highlights its unique contributions and superiority. These contributions collectively bridge the existing gap in collective analysis, providing a holistic perspective on multifaceted Big Data applications.

Data availability

No datasets were generated or analysed during the current study.

Abbasi RA, Maqbool O, Mushtaq M, Aljohani NR, Daud A, Alowibdi JS, Shahzad B (2018) Saving lives using social media: analysis of the role of twitter for personal blood donation requests and dissemination. Telemat Inform 35(4):892–912

Article   Google Scholar  

Abkenar SB, Kashani MH, Mahdipour E, Jameii SM (2021) Big data analytics meets social media: a systematic review of techniques, open issues, and future directions. Telemat Inform 57:101517

Abouelmehdi K, Beni-Hessane A, Khaloufi H (2018) Big healthcare data: preserving security and privacy. J Big Data 5(1):1–18

Adeghe EP, Okolo CA, Ojeyinka OT (2024) The role of big data in healthcare: a review of implications for patient outcomes and treatment personalization. World J Biol Pharm Health Sci 17(3):198–204

Ahmed A, Xi R, Hou M, Shah SA, Hameed S (2023) Harnessing big data analytics for healthcare: a comprehensive review of frameworks, implications, applications, and impacts. IEEE Access 11:112891–112928

Ajah IA, Nweke HF (2019) Big data and business analytics: trends, platforms, success factors and applications. Big Data Cognitive Comput 3(2):32

Akter S, Wamba SF (2019) Big data and disaster management: a systematic review and agenda for future research. Ann Oper Res 283:939–959

Article   MathSciNet   Google Scholar  

Al-Jumaili AHA, Muniyandi RC, Hasan MK, Paw JKS, Singh MJ (2023) Big data analytics using cloud computing based frameworks for power management systems: Status, constraints, and future recommendations. Sensors 23(6):2952

Alahakoon D, Nawaratne R, Xu Y, De Silva D, Sivarajah U, Gupta B (2020) Self-building artificial intelligence and machine learning to empower big data analytics in smart cities. Int Syst Front. https://doi.org/10.1007/s10796-020-10056-x

Alani MM (2021) Big data in cybersecurity: a survey of applications and future trends. J Reliab Intell Environ 7(2):85–114

Albqowr A, Alsharairi M, Alsoussi A (2024) Big data analytics in supply chain management: a systematic literature review. VINE J Inform Knowled Manag Syst 54(3):657–682

Allam Z, Dhunny ZA (2019) On big data, artificial intelligence and smart cities. Cities 89:80–91

Amaithi Rajan A, V V (2023) Systematic survey: secure and privacy-preserving big data analytics in cloud.   J Comput Inform Syst 64:1–21

Google Scholar  

Amalina F, Hashem IAT, Azizul ZH, Fong AT, Firdaus A, Imran M, Anuar NB (2019) Blending big data analytics: review on challenges and a recent study. IEEE Access 8:3629–3645

Amani M, Ghorbanian A, Ahmadi SA, Kakooei M, Moghimi A, Mirmazloumi SM, Moghaddam SHA, Mahdavi S, Ghahremanloo M, Parsian S, Wu Q, Brisco B (2020) Google earth engine cloud computing platform for remote sensing big data applications: a comprehensive review. IEEE J Select Topics Appl Earth Observ Remote Sens 13:5326–5350

Amazone (2023). Amazon redshift. https://aws.amazon.com/redshift/

Ameer S, Shah MA (2018) Exploiting big data analytics for smart urban planning. In 2018 IEEE 88th Vehicular Technology Conference (VTC-Fall), (pp. 1–5). IEEE

Amjad T, Sher M, Daud A (2012) A survey of dynamic replication strategies for improving data availability in data grids. Futur Gener Comput Syst 28(2):337–349

Amjad T, Daud A, Aljohani NR (2018) Ranking authors in academic social networks: a survey. Library Hi Tech 36(1):97–128

Anitha P, Vimala H, Shreyas J (2023) Comprehensive review on congestion detection, alleviation, and control for iot networks. J Netw Comput Appl. https://doi.org/10.1016/j.jnca.2023.103749

Anshari M, Almunawar MN, Lim SA, Al-Mudimigh A (2019) Customer relationship management and big data enabled: personalization & customization of services. Appl Comput Inf 15(2):94–101

Apache (2023a). Apache hadoop. https://hadoop.apache.org/

Apache (2023b). Apache nifi. https://Nifi.apache.org/

Apache (2023c). Apache spark. https://spark.apache.org/

Apache (2023d). Mongodb. https:www.mongodb.com

Badshah A (2023a) Cloud storage future and its opportunities

Badshah A (2023b) The raceof lead in ai: micrososft, google and openai

Badshah A (2023c) Why edge computing. https://afzalbadshah.com/index.php/2022/12/11/why-edge-computing/

Badshah A, Ghani A, Daud A, Chronopoulos AT, Jalal A (2022a) Revenue maximization approaches in IAAS clouds: research challenges and opportunities. Trans Emerg Telecommun Technol 33(7):e4492

Badshah A, Iwendi C, Jalal A, Hasan SSU, Said G, Band SS, Chang A (2022b) Use of regional computing to minimize the social big data effects. Comput Ind Eng 171:108433

Badshah, A., Nasralla, M.M., Jalal, A. and Farman, H., (2023a) September. smart education in smart cities: challenges and solution. In 2023 IEEE International Smart Cities Conference (ISC2) (pp. 01-08). IEEE. https://doi.org/10.1109/ISC257844.2023.10293615

Badshah A, Ghani A, Daud A, Jalal A, Bilal M, Crowcroft J (2023b) Towards smart education through internet of things: a survey. ACM Comput Surv 56(2):1–33

Badshah A, Daud A, Khan HU, Alghushairy O, Bukhari A (2024) Optimizing the over and underutilization of network resources during peak and off-peak hours. IEEE Access

Bag S, Rahman MS, Srivastava G, Shore A, Ram P (2023) Examining the role of virtue ethics and big data in enhancing viable, sustainable, and digital supply chain performance. Technol Forecasting and Social Change 186:122154

Baig MI, Shuib L, Yadegaridehkordi E (2020) Big data in education: a state of the art, limitations, and future research directions. Int J Educ Technol Higher Educ 17(1):1–23

Bansal S, Kumar P, Rawat S, Choudhury T (2018) Analysis and impact of social media and it’s privacy on big data. In 2018 International Conference on Advances in Computing and Communication Engineering (ICACCE), pages 248–253. IEEE

Beauvisage T, Beuscart J-S, Coavoux S, Mellet K (2023)  How online advertising targets consumers: the uses of categories and algorithmic tools by audience planners. New Med Soc. https://doi.org/10.1177/14614448221146174

Bhure A, Desai S (2023) Exploring the intersection of big data and legal analytics: a survey of its application in the legal industry. J Big Data Technol Bus Anal (e-ISSN: 2583-7834) 2(1):5–14

Bibri SE (2019) On the sustainability of smart and smarter cities in the era of big data: an interdisciplinary and transdisciplinary literature review. J Big Data 6(1):1–64

Cassandra (2023). Apache cassandra. https://Cassandra.apache.org/

Chang V (2021) An ethical framework for big data and smart cities. Technol Forecast Social Change 165:120559

Chen B, Nie G, Jiang S, Hu N (2022a) Research on the big data-based product quality data package construction and application. In 2022 4th International Conference on Advances in Computer Technology, Information Science and Communications (CTISC), pages 1–6. IEEE

Chen L, Zhang Y, Wang Z (2022b) Logistics service supply chain model applying artificial intelligence and big data analysis. Secur Commun Netw 2022:1575813

Chen X (2022) High-concurrency big data precision marketing and advertising recommendation under 5g wireless communication network environment. J Sens 2022:7609555

Cockcroft S, Russell M (2018) Big data opportunities for accounting and finance practice and research. Aust Account Rev 28(3):323–333

Coursera (2023). Introduction to big data with spark hadoop. https://www.coursera.org/learn/introduction-to-big-data-with-spark-hadoop/home/week/1

Craig T, Ludloff ME (2011) Privacy and big data: the players, regulators, and stakeholders. " O’Reilly Media, Inc."

Dash S, Shakyawar SK, Sharma M, Kaushik S (2019) Big data in healthcare: management, analysis and future prospects. J Big Data 6(1):1–25

Daud A, Abbasi R, Muhammad F (2013) Finding rising stars in social networks. In Database Systems for Advanced Applications: 18th International Conference, DASFAA 2013, Wuhan, China, April 22-25, 2013. Proceedings, Part I 18, pp. 13–24. Springer

Del Vecchio P, Di Minin A, Petruzzelli AM, Panniello U, Pirri S (2018) Big data for open innovation in smes and large corporations: trends, opportunities, and challenges. Creativity Innov Manag 27(1):6–22

Del Vecchio P, Mele G, Siachou E, Schito G (2022) A structured literature review on big data for customer relationship management (crm): toward a future agenda in international marketing. Int Market Rev 39(5):1069–1092

Demirbaga U, Aujla GS (2022) Mapchain: a blockchain-based verifiable healthcare service management in iot-based big data ecosystem. IEEE Trans Netw Service Manag 19(4):3896–3907

Ding X, Gan Q, Shaker MP (2023) Optimal management of parking lots as a big data for electric vehicles using internet of things and long-short term memory. Energy 268:126613

Ducange P, Pecori R, Mezzina P (2018) A glimpse on big data analytics in the framework of marketing strategies. Soft Comput 22(1):325–342

Elhoseny H, Elhoseny M, Riad AM, Hassanien AE (2018) A framework for big data analysis in smart cities. In The international conference on advanced machine learning technologies and applications (AMLTA2018), pages 405–414. Springer

EOS (2023) Analyzing big earth data progress challenges opportunities. https://eos.org/editors-vox/analyzing-big-earth-data-progress-challenges-opportunities

Esfahani H, Tavasoli K, Jabbarzadeh A (2019) Big data and social media: a scientometrics analysis. Int J Data Netw Sci 3(3):145–164

Farchi F, Farchi C, Touzi B, Mabrouki C (2023) A comparative study on ai-based algorithms for cost prediction in pharmaceutical transport logistics. Acadlore Trans Mach Learn 2(3):129–141

Farley SS, Dawson A, Goring SJ, Williams JW (2018) Situating ecology as a big-data science: current advances, challenges, and solutions. BioScience 68(8):563–576

Favaretto M, De Clercq E, Elger BS (2019) Big data and discrimination: perils, promises and solutions: a systematic review. J Big Data 6(1):1–27

Fischer C, Pardos ZA, Baker RS, Williams JJ, Smyth P, Yu R, Slater S, Baker R, Warschauer M (2020) Mining big data in education: affordances and challenges. Rev Res Educ 44(1):130–160

Fosso Wamba S, Gunasekaran A, Papadopoulos T, Ngai E (2018) Big data analytics in logistics and supply chain management. Int J Logist Manag 29(2):478–484

França RP, Monteiro ACB, Arthur R, Iano Y (2021) The fundamentals and potential for cybersecurity of big data in the modern world. In: Maleh Y, Shojafar M, Alazab M, Baddi Y (eds) Machine intelligence and big data analytics for cybersecurity applications. Studies in computational intelligence. Springer, Cham. https://doi.org/10.1007/978-3-030-57024-8_3

Chapter   Google Scholar  

Galetsi P, Katsaliaki K, Kumar S (2019) Values, challenges and future directions of big data analytics in healthcare: a systematic review. Social Sci Med 241:112533

Galletta A, Carnevale L, Bramanti A, Fazio M (2018) An innovative methodology for big data visualization for telemedicine. IEEE Trans Indus Inform 15(1):490–497

Ghani NA, Hamid S, Hashem IAT, Ahmed E (2019) Social media big data analytics: a survey. Comput Human Behav 101:417–428

Ghofrani F, He Q, Goverde RM, Liu X (2018) Recent applications of big data analytics in railway transportation systems: a survey. Transport Res Part C: Emerging Technol 90:226–246

Google (2023) Google big data query. https://cloud.google.com/bigquery

Hao W (2021) Empirical study on the application of flipper classroom innovation teaching under the context of big data. In 2021 International Conference on Computer Technology and Media Convergence Design (CTMCD), pages 138–142

Hariri RH, Fredericks EM, Bowers KM (2019) Uncertainty in big data analytics: survey, opportunities, and challenges. J Big Data 6(1):1–16

Hassani H, Huang X, Silva E (2018) Digitalisation and big data mining in banking. Big Data Cognit Comput 2(3):18

Hassani H, Huang X, Silva E (2019) Big data and climate change. Big Data Cognit Comput 3(1):12

Hayat MK, Daud A, Alshdadi AA, Banjar A, Abbasi RA, Bao Y, Dawood H (2019) Towards deep learning prospects: insights for social media analytics. IEEE Access 7:36958–36979

He X, Wang K, Huang H, Liu B (2018) Qoe-driven big data architecture for smart city. IEEE Commun Mag 56(2):88–93

Himeur Y, Elnour M, Fadli F, Meskin N, Petri I, Rezgui Y, Bensaali F, Amira A (2023) Ai-big data analytics for building automation and management systems: a survey, actual challenges and future perspectives. Artif Intell Rev 56(6):4929–5021

Hong L, Luo M, Wang R, Lu P, Lu W, Lu L (2018) Big data in health care: applications and challenges. Data Inform Manag 2(3):175–197

Hong W, Xiong Z, Zheng N, Weng Y (2019) A medical-history-based potential disease prediction algorithm. IEEE Access 7:131094–131101

Hu G, Liu W, Xu H (2022) Research on hybrid teaching assessment driven by big data. In 2022 2nd International Conference on Big Data Engineering and Education (BDEE), pages 206–209

Huang YanBo HY, Chen ZhongXin CZ, Yu Tao YT, Huang XiangZhi HX, Gu XingFa GX (2018) Agricultural remote sensing big data: management and applications. J Integr Agricul 17(9):1915–1931

Huang H, Yao XA, Krisp JM, Jiang B (2021) Analytics of location-based big data for smart cities: opportunities, challenges, and future directions. Comput Environ Urban Syst 90:101712

Ikegwu AC, Nweke HF, Anikwe CV, Alo UR, Okonkwo OR (2022) Big data analytics for data-driven industry: a review of data sources, tools, challenges, solutions, and research directions. Cluster Comput 25(5):3343–3387

Ikegwu AC, Nweke HF, Mkpojiogu E, Anikwe CV, Igwe SA, Alo UR (2024) Recently emerging trends in big data analytic methods for modeling and combating climate change effects. Energy Inform 7(1):6

Ilieva RT, McPhearson T (2018) Social-media data for urban sustainability. Natu Sustain 1(10):553–565

Inayatulloh Prabowo H, Warnars H LHS, Napitupulu TA, Khairil, Deviarti H (2022) Extended e-learning model to support home schooling with collaboration between teacher, parents and student. In 2022 IEEE International Conference of Computer Science and Information Technology (ICOSNIKOM), (pp. 1–6)

Insider (2023) Last mile delivery shipping explained. https://www.insiderintelligence.com/insights/last-mile-delivery-shipping-explained/

Jahani H, Jain R, Ivanov D (2023) Data science and big data analytics: a systematic review of methodologies used in the supply chain and logistics research. Ann Oper Res. https://doi.org/10.1007/s10479-023-05390-7

Jia Y, Chao K, Cheng X, Xu L, Zhao X, Yao L (2019) Telecom big data based precise user classification scheme. In 2019 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), (pp. 1517–1520). IEEE

Jiang W (2019) An intelligent supply chain information collaboration model based on internet of things and big data. IEEE Access 7:58324–58335

Jieyu L (2020) Research on network advertisement precise delivery system based on big data technology. In 2020 IEEE International Conference on Power, Intelligent Computing and Systems (ICPICS), pages 794–797. IEEE

Jimenez-Marquez JL, Gonzalez-Carrasco I, Lopez-Cuadrado JL, Ruiz-Mezcua B (2019) Towards a big data framework for analyzing social media content. Int J of Inform Manag 44:1–12

Jiyang Y, Yanbin Z, Jian G (2020) Research on the development of intelligent chemical manufacturing industry in shandong province based on big data analysis. In 2020 2nd International Conference on Industrial Artificial Intelligence (IAI), (pp. 1–6). IEEE

Kandt J, Batty M (2021) Smart cities, big data and urban policy: Towards urban analytics for the long run. Cities 109:102992

Kantarcioglu M, Xi B (2016) Adversarial data mining: Big data meets cyber security. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, (pp. 1866–1867)

Kanth R, Laakso M-J, Nevalainen P, Heikkonen J (2018) Future educational technology with big data and learning analytics. In 2018 IEEE 27th International Symposium on Industrial Electronics (ISIE), (pp. 906–910)

Karimi Y, Haghi Kashani M, Akbari M, Mahdipour E (2021) Leveraging big data in smart cities: a systematic review. Concurr Comput: Pract Exp 33(21):e6379

Khan J, Ahmad N (2023) Security and privacy technique in big data: A review. In 2023 10th International Conference on Computing for Sustainable Global Development (INDIACom), (pp. 1575–1579)

Khan W, Daud A, Nasir JA, Amjad T (2016) A survey on the state-of-the-art machine learning models in the context of nlp. Kuwait J Sci, 43(4)

Knüsel B, Zumwald M, Baumberger C, Hirsch Hadorn G, Fischer EM, Bresch DN, Knutti R (2019) Applying big data beyond small problems in climate research. Natu Climate Change 9(3):196–202

Kushwaha AK, Kar AK, Dwivedi YK (2021) Applications of big data in emerging management disciplines: a literature review using text mining. Int J Inf Manage Data Insights 1(2):100017

Laney D et al (2001) 3d data management: controlling data volume, velocity and variety. META Group Res Note 6(70):1

Learning S (2023) Scikit learning. https://scikit-learn.org/stable/

Lee SE, Ju N, Lee K-H (2023) Service chatbot: Co-citation and big data analysis toward a review and research agenda. Int J Inf Manage Data Insights 194:122722

Leng P, Xiang L, Lin Y, Xiao W, Yang Z, Li D, Nai W (2020) Logistic regression based on artificial fish swarm algorithm with t-distribution parameters. In 2020 IEEE 9th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), volume 9, pages 1912–1915

Leung, C.K., Braun, P., Hoi, C.S., Souza, J. and Cuzzocrea, A., 2019. Urban analytics of big transportation data for supporting smart cities. In Big Data Analytics and Knowledge Discovery: 21st International Conference, DaWaK 2019, Linz, Austria, August 26–29, 2019, Proceedings 21 (pp. 24-33). Springer International Publishing.

Li B, Zhao S, Zhang R, Shi Q, Yang K (2019a) Anomaly detection for cellular networks using big data analytics. IET Commun 13(20):3351–3359

Li F, Xie R, Wang Z, Guo L, Ye J, Ma P, Song W (2019b) Online distributed iot security monitoring with multidimensional streaming big data. IEEE Interne of Things J 7(5):4387–4394

Li C, Chen Y, Shang Y (2022) A review of industrial big data for decision making in intelligent manufacturing. Eng Sci Technol Int J 29:101021

Li J, Jiang Y (2021) The research trend of big data in education and the impact of teacher psychology on educational development during covid-19: a systematic review and future perspective. Front Psychol 12:753388

Liang J (2020) Research on the application of big data in the informatization of higher education management mode. In 2020 International Conference on Intelligent Transportation, Big Data & Smart City (ICITBS), (pp. 799–802)

Lies J (2019) Marketing intelligence and big data: digital marketing techniques on their way to becoming social engineering techniques in marketing

Lin L, Zhou D, Wang J, Wang Y (2024) A systematic review of big data driven education evaluation. SAGE Open 14(2):21582440241242180

Liu X, You J (2021) Research on the impact of big data application on technological innovation of chinese new energy vehicle industry. In 2021 2nd International Conference on Big Data Economy and Information Management (BDEIM), pages 323–327. IEEE

Liu X, Shin H, Burns AC (2021) Examining the impact of luxury brands’ social media marketing on customer engagement: using big data analytics and natural language processing. J Bus Res 125:815–826

Luan H, Geczy P, Lai H, Gobert J, Yang SJ, Ogata H, Baltes J, Guerra R, Li P, Tsai C-C (2020) Challenges and future directions of big data and artificial intelligence in education. Front Psychol 11:580820

Lwin KK, Sekimoto Y, Takeuchi W, Zettsu K (2019) City geospatial dashboard: Iot and big data analytics for geospatial solutions provider in disaster management. In 2019 International Conference on Information and Communication Technologies for Disaster Management (ICT-DM), pages 1–4

Lyu X, Zhao J (2019) Compressed sensing and its applications in risk assessment for internet supply chain finance under big data. IEEE Access 7:53182–53187

Makkie M, Li X, Quinn S, Lin B, Ye J, Mon G, Liu T (2018) A distributed computing platform for fmri big data analytics. IEEE Trans Big Data 5(2):109–119

Mani Z, Chouk I (2022) Impact of privacy concerns on resistance to smart services: does the ‘big brother effect’matter? In the role of smart technologies in decision making, (pp. 94–113). Routledge

Masood I, Wang Y, Daud A, Aljohani NR, Dawood H (2018a) Privacy management of patient physiological parameters. Telem Inform 35(4):677–701

Masood I, Wang Y, Daud A, Aljohani NR, Dawood H (2018b) Towards smart healthcare: patient data privacy and security in sensor-cloud infrastructure. Wireless Commun Mobile Comput 2018:1–23

Mehta N, Pandit A (2018) Concurrence of big data analytics and healthcare: a systematic review. Int J Med Inform 114:57–65

Microsoft (2023) Power bi. https://www.microsoft.com/en-us/power-platform/products/power-bi

Miklosik A, Evans N (2020) Impact of big data and machine learning on digital transformation in marketing: a literature review. IEEE Access 8:101284–101292

Mishra D, Gunasekaran A, Papadopoulos T, Childe SJ (2018) Big data and supply chain management: a review and bibliometric analysis. Ann Oper Res 270:313–336

Misra N, Dixit Y, Al-Mallahi A, Bhullar MS, Upadhyay R, Martynenko A (2020) Iot, big data, and artificial intelligence in agriculture and food industry. IEEE Internet Things J 9(9):6305–6324

Mohammadi M, Al A (2018) Enabling cognitive smart cities using big data and machine learning: approaches and challenges. IEEE Commun Mag 56(2):94–101

Mohammadpoor M, Torabi F (2020) Big data analytics in oil and gas industry: an emerging trend. Petroleum 6(4):321–328

Mosavi A, Lopez A, Varkonyi-Koczy AR (2018) Industrial applications of big data: state of the art survey. In Recent Advances in Technology Research and Education: Proceedings of the 16th International Conference on Global Research and Education Inter-Academia 2017 16, pages 225–232. Springer

Muhammad SS, Dey BL, Weerakkody V (2018) Analysis of factors that influence customers’ willingness to leave big data digital footprints on social media: a systematic review of literature. Inf Syst Front 20:559–576

Munshi AA, Alhindi A (2021) Big data platform for educational analytics. IEEE Access 9:52883–52890

Nguyen T, Zhoul L, Spiegler V, Ieromonachou P, Lin Y (2018) Big data analytics in supply chain management: a state-of-the-art literature review. Comput Oper Res 98:254–264

Nguyen T, Gosine RG, Warrian P (2020) A systematic review of big data analytics for oil and gas industry 4.0. IEEE Access 8:61183–61201

Patel D, Shah D, Shah M (2020) The intertwine of brain and body: a quantitative analysis on how big data influences the system of sports. Ann Data Sci 7:1–16

Philip NY, Razaak M, Chang J, M, S., O’Kane, M., and Pierscionek, B. K. (2022) A data analytics suite for exploratory predictive, and visual analysis of type 2 diabetes. IEEE Access 10:13460–13471

Price WN, Cohen IG (2019) Privacy in the age of medical big data. Natu Med 25(1):37–43

Qi C-C (2020) Big data management in the mining industry. Int J Mineral Metall Mater 27:131–139

Qian R, Sengan S, Juneja S (2022) English language teaching based on big data analytics in augmentative and alternative communication system. Int J Speech Technol 25(2):409–420

Rahman MS, Reza H (2022) A systematic review towards big data analytics in social media. Big Data Min Anal 5(3):228–244

Rajabion L, Shaltooki AA, Taghikhah M, Ghasemi A, Badfar A (2019) Healthcare big data processing mechanisms: the role of cloud computing. Int J Inf Manage 49:271–289

Rao SUM, Lakshmanan L (2024) Securing communicating networks in the age of big data: an advanced detection system for cyber attacks. Opt Quantum Electron 56(1):116

Rassam MA, Maarof M, Zainal A, et al (2017) Big data analytics adoption for cybersecurity: A review of current solutions, requirements, challenges and trends. J Inf Assur Secur, 12(4)

Rawat DB, Doku R, Garuba M (2019) Cybersecurity in big data era: from securing big data to data-driven security. 14:2055–2072. IEEE

Rawat R, Oki OA, Sankaran KS, Olasupo O, Ebong GN, Ajagbe SA (2023) A new solution for cyber security in big data using machine learning approach. In Mobile Computing and Sustainable Informatics: Proceedings of ICMCSI 2023, pages 495–505. Springer

Ray S, Saeed M (2018) Applications of educational data mining and learning analytics tools in handling big data in higher education. In: Alani M, Tawfik H, Saeed M, Anya O (eds) Applications of big data analytics. Springer, Cham. https://doi.org/10.1007/978-3-319-76472-6_7

Rehman GU, Zubair M, Qasim I, Badshah A, Mahmood Z, Aslam M, Jilani SF (2022) Ems: Efficient monitoring system to detect non-cooperative nodes in iot-based vehicular delay tolerant networks (vdtns). Sensors 23(1):99

Research A, Consulting (2023) Big data market size to reach usd 473.6 billion by 2030. https://www.acumenresearchandconsulting.com/press-releases/big-data-market

Saravanan S, Prakash G (2021) A comprehensive survey on big data technology based cybersecurity analytics systems. Applied soft computing and communication networks: proceedings of ACN 2020:123–143

Sarker MNI, Peng Y, Yiran C, Shouse RC (2020a) Disaster resilience through big data: way to environmental sustainability. Int J Disaster Risk Reduct 51:101769

Sarker MNI, Yang B, Yang L, Huq ME, Kamruzzaman M (2020b) Climate change adaptation and resilience through big data. Int J Adv Comput Sci Appl. https://doi.org/10.14569/ijacsa.2020.0110368

Sebei H, Hadj Taieb MA, Ben Aouicha M (2018) Review of social media analytics process and big data pipeline. Social Netw Anal Min 8(1):30

Sebestyén V, Czvetkó T, Abonyi J (2021) The applicability of big data in climate change research: the importance of system of systems thinking. Front Environ Sci 9:70

Selmy HA, Mohamed HK, Medhat W (2023) Big data analytics deep learning techniques and applications: a survey. Inform Syst. https://doi.org/10.1016/j.is.2023.102318

Sestino A, Prete MI, Piper L, Guido G (2020) Internet of things and big data as enablers for business digitalization strategies. Technovation 98:102173

Shah SA, Seker DZ, Rathore MM, Hameed S, Yahia SB, Draheim D (2019) Towards disaster resilient smart cities: can internet of things and big data analytics be the game changers? IEEE Access 7:91885–91903

Silva BN, Khan M, Jung C, Seo J, Muhammad D, Han J, Yoon Y, Han K (2018) Urban planning and smart city decision management empowered by real-time data processing using big data analytics. Sensors 18(9):2994

Sivarajah U, Irani Z, Gupta S, Mahroof K (2020) Role of big data and social media analytics for business to business sustainability: a participatory web context. Indus Market Manag 86:163–179

Srivastava N, Jaiswal UC (2019) Big data analytics technique in cyber security: a review. In 2019 3rd International Conference on Computing Methodologies and Communication (ICCMC), (pp. 579–585). IEEE

Statista (2023). Volume of data/information created, captured, copied, and consumed worldwide from 2010 to 2020, with forecasts from 2021 to 2025. https://www.statista.com/statistics/871513/worldwide-data-created/

Stegenga SM, Steltenpohl CN, Lustick H, Meyer MS, Renbarger R, Standiford Reyes L, Lee LE (2024) Qualitative research at the crossroads of open science and big data: ethical considerations. Social Personal Psychol Compass 18(1):12912

Subroto A, Apriyana A (2019) Cyber risk prediction through social media big data analytics and statistical machine learning. J Big Data 6(1):50

Sudmanns M, Tiede D, Lang S, Bergstedt H, Trost G, Augustin H, Baraldi A, Blaschke T (2019) Big earth data: disruptive changes in earth observation data management and analysis? Int J Digital Earth 13(7):832–850

Tableau (2023). Tableau. https://www.tableau.com/

Talaoui Y, Kohtamäki M, Ranta M, Paroutis S (2023) Recovering the divide: a review of the big data analytics—strategy relationship. Long Range Plan 56(2):102290

Talend (2023). Talent. https://www.talend.com/

Tang L, Li J, Du H, Li L, Wu J, Wang S (2022) Big data in forecasting research: a literature review. Big Data Res 27:102290

Tao H, Bhuiyan MZA, Rahman MA, Wang G, Wang T, Ahmed MM, Li J (2019) Economic perspective analysis of protecting big data security and privacy. Futur Gener Comput Syst 98:660–671

Teneiji AL, Salim TYA, Riaz Z (2024) Factors impacting the adoption of big data in healthcare: a systematic literature review. Int J Med Inform. https://doi.org/10.1016/j.ijmedinf.2024.105460

Tensor (2023). Tensorflow. https://www.tensorflow.org/

Thilagavathi C, Rajeswari M, Sheethal M, Devassy D, Priya K, Divya R (2019) Security issues on internet of things in smart cities. In Handbook of Research on Implementation and Deployment of IoT Projects in Smart Cities, (pp. 149–164). IGI Global

Tian L, Wang H, Zhou Y, Peng C (2018) Video big data in smart city: background construction and optimization for surveillance video processing. Futur Gener Comput Syst 86:1371–1382

Tohka J, Van Gils M (2021) Evaluation of machine learning algorithms for health and wellness applications: a tutorial. Comput Biol Med 132:104324

Torre-Bastida AI, Del Ser J, Laña I, Ilardia M, Bilbao MN, Campos-Cordobés S (2018) Big data for transportation and mobility: recent advances, trends and challenges. IET Intell Trans Syst 12(8):742–755

Ullah F, Babar MA (2019) Architectural tactics for big data cybersecurity analytics systems: a review. J Syst Softw 151:81–118

Ullah F, Babar MA, Aleti A (2022) Design and evaluation of adaptive system for big data cyber security analytics. Expert Syst Appls 207:117948

Vargo CJ, Guo L, Amazeen MA (2018) The agenda-setting power of fake news: a big data analysis of the online media landscape from 2014 to 2016. New Med Society 20(5):2028–2049

Vassakis K, Petrakis E, Kopanakis I (2018) Big data analytics: Applications, prospects and challenges. A roadmap from models to technologies, Mobile big data, pp 3–20

Ved M, B, R (2019) Big data analytics in telecommunication using state-of-the-art big data framework in a distributed computing environment: A case study. In 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), volume 1, pages 411–416

Walters R, Novak M (2021) Cyber Security, Artificial Intelligence. Springer, Data Protection & the Law

Wang L, Jones R (2021) Big data analytics in cyber security: network traffic and attacks. J Comput Inform Syst 61(5):410–417

Wang X, Yang LT, Chen X, Deen MJ, Jin J (2018) Improved multi-order distributed hosvd with its incremental computing for smart city services. IEEE Trans Sustain Comput 6(3):456–468

Wu SM, Chen T-C, Wu YJ, Lytras M (2018) Smart cities in Taiwan: a perspective on big data applications. Sustainability 10(1):106

Wu M, Hong L, Zhao Y, Chen L, Wang J (2019) Dosage prediction in pediatric medication leveraging prescription big data. IEEE Access 7:94285–94292

Yadav SS, Jadhav SM (2019) Deep convolutional neural network based medical image classification for disease diagnosis. J Big Data 6(1):1–18

Yan L, Huang W, Wang L, Feng S, Peng Y, Peng J (2019) Data-enabled digestive medicine: a new big data analytics platform. IEEE/ACM Trans Comput Biol Bioinform 18(3):922–931

Yang C, Yu M, Li Y, Hu F, Jiang Y, Liu Q, Sha D, Xu M, Gu J (2019) Big earth data analytics: a survey. Big Earth Data 3(2):83–107

Yang C, Lan S, Zhao Z, Zhang M, Wu W, Huang GQ (2022) Edge-cloud blockchain and ioe-enabled quality management platform for perishable supply chain logistics. IEEE Internet Things J 10(4):3264–3275

Yin P, Huang H, Zhao M, Zhu Y (2021) Application of big data marketing in customer relationship management. In Proceedings of the 2021 5th International Conference on E-Education, E-Business and E-Technology, (p. 1)

Yu M, Yang C, Li Y (2018) Big data in natural disaster management: a review. Geosciences 8(5):165

Yuwen Z, Changqin H, Qintai H, Jia Z, Yong T (2018) Personalized learning full-path recommendation model based on lstm neural networks. Inform Sci 444:135–152

Zhang J (2022) Application of computer big data and cloud computing technology in the promotion of e-commerce advertising. In 2022 IEEE 2nd International Conference on Data Science and Computer Application (ICDSCA), pages 834–837. IEEE

Zhang X, Ghorbani AA (2021) Human factors in cybersecurity: issues and challenges in big data. Res Anthol Privat Secur Data, (pp. 1695–1725)

Zhang D, Wang D, Vance N, Zhang Y, Mike S (2018) On scalable and robust truth discovery in big data social media sensing applications. IEEE Trans Big Data 5(2):195–208

Zhang N, Geng B, Hu W, Wen R (2021) The applications of big data analysis in student management education. In 2021 2nd International Conference on Big Data and Informatization Education (ICBDIE) , pages 55–58

Zhang H, Zang Z, Zhu H, Uddin MI, Amin MA (2022) Big data-assisted social media analytics for business model for business decision making system competitive analysis. Inform Process Manag 59(1):102762

Zhang Y, Hong J, Chen S (2023) Medical big data and artificial intelligence for healthcare. Appl Sci 13(6):3745

Zhili Wang, M., Huang, J., Lin, S., and Lv, Z. (2021) Blockchain in big data security for intelligent transportation with 6g. IEEE Transactions on Intelligent Transportation Systems 23(7):9736–9746

Zhou S, He J, Yang H, Chen D, Zhang R (2020) Big data-driven abnormal behavior detection in healthcare based on association rules. IEEE Access 8:129002–129011

Zhou R, Zhang X, Wang X, Yang G, Guizani N, Du X (2021) Efficient and traceable patient health data search system for hospital management in smart cities. IEEE Internet Things J 8(8):6425–6436

Zhu S, Du G (2022) Evaluation of the service capability of maritime logistics enterprises based on the big data of the internet of things supply chain system. IEEE Consum Electron Mag 12(2):100–108

Download references

Author information

Authors and affiliations.

Department of Software Engineering, University of Sargodha, Sargodha, Pakistan

Afzal Badshah

Faculty of Resilience, Rabdan Academy, Abu Dhabi, United Arab Emirates

Department of Information Systems and Technology, College of Computer Science and Engineering, University of Jeddah, Jeddah, Saudi Arabia

Riad Alharbey, Ameen Banjar & Amal Bukhari

Software Engineering Department, College of Computing and Information Sciences, King Saud University, Riyadh, Saudi Arabia

Bader Alshemaimri

You can also search for this author in PubMed   Google Scholar

Contributions

Afzal and Ali have written a major part of the paper under the supervision of Riad and Ameen. Ameen, Bader, and Riad have helped design and improve the methodology and wrote the paper initial draft with Afzal and Ali. Ameen and Amal have helped in improving the paper sections, such as, review methodology, datasets, and challenges and future directions. Amal, Bader and Ameen have improved the technical writing of paper. All authors are involved in revising the manuscript critically and have approved the final version of the manuscript.

Corresponding author

Correspondence to Ali Daud .

Ethics declarations

Conflict of interest.

he authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Badshah, A., Daud, A., Alharbey, R. et al. Big data applications: overview, challenges and future. Artif Intell Rev 57 , 290 (2024). https://doi.org/10.1007/s10462-024-10938-5

Download citation

Accepted : 29 August 2024

Published : 16 September 2024

DOI : https://doi.org/10.1007/s10462-024-10938-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Big data applications
  • Data analytics
  • Find a journal
  • Publish with us
  • Track your research
  • Data, AI, & Machine Learning
  • Managing Technology
  • Social Responsibility
  • Workplace, Teams, & Culture
  • AI & Machine Learning
  • Hybrid Work
  • Big ideas Research Projects
  • Artificial Intelligence and Business Strategy
  • Responsible AI
  • Future of the Workforce
  • Future of Leadership
  • All Research Projects

AI in Action

  • Most Popular
  • The Truth Behind the Nursing Crisis
  • Coaching for the Future-Forward Leader
  • Measuring Culture

Fall 2024 Issue

MIT SMR ’s fall 2024 issue highlights the need for personal and organizational resilience amid global uncertainty.

  • Past Issues
  • Upcoming Events
  • Video Archive
  • Me, Myself, and AI
  • Three Big Points

MIT Sloan Management Review Logo

Five Key Trends in AI and Data Science for 2024

These developing issues should be on every leader’s radar screen, data executives say..

latest research big data

  • Data, AI, & Machine Learning
  • AI & Machine Learning
  • Data & Data Culture
  • Technology Implementation

latest research big data

Carolyn Geason-Beissel/MIT SMR | Getty Images

Artificial intelligence and data science became front-page news in 2023. The rise of generative AI, of course, drove this dramatic surge in visibility. So, what might happen in the field in 2024 that will keep it on the front page? And how will these trends really affect businesses?

During the past several months, we’ve conducted three surveys of data and technology executives. Two involved MIT’s Chief Data Officer and Information Quality Symposium attendees — one sponsored by Amazon Web Services (AWS) and another by Thoughtworks . The third survey was conducted by Wavestone , formerly NewVantage Partners, whose annual surveys we’ve written about in the past . In total, the new surveys involved more than 500 senior executives, perhaps with some overlap in participation.

Get Updates on Leading With AI and Data

Get monthly insights on how artificial intelligence impacts your organization and what it means for your company and customers.

Please enter a valid email address

Thank you for signing up

Privacy Policy

Surveys don’t predict the future, but they do suggest what those people closest to companies’ data science and AI strategies and projects are thinking and doing. According to those data executives, here are the top five developing issues that deserve your close attention:

1. Generative AI sparkles but needs to deliver value.

As we noted, generative AI has captured a massive amount of business and consumer attention. But is it really delivering economic value to the organizations that adopt it? The survey results suggest that although excitement about the technology is very high , value has largely not yet been delivered. Large percentages of respondents believe that generative AI has the potential to be transformational; 80% of respondents to the AWS survey said they believe it will transform their organizations, and 64% in the Wavestone survey said it is the most transformational technology in a generation. A large majority of survey takers are also increasing investment in the technology. However, most companies are still just experimenting, either at the individual or departmental level. Only 6% of companies in the AWS survey had any production application of generative AI, and only 5% in the Wavestone survey had any production deployment at scale.

Surveys suggest that though excitement about generative AI is very high, value has largely not yet been delivered.

Production deployments of generative AI will, of course, require more investment and organizational change, not just experiments. Business processes will need to be redesigned, and employees will need to be reskilled (or, probably in only a few cases, replaced by generative AI systems). The new AI capabilities will need to be integrated into the existing technology infrastructure.

Perhaps the most important change will involve data — curating unstructured content, improving data quality, and integrating diverse sources. In the AWS survey, 93% of respondents agreed that data strategy is critical to getting value from generative AI, but 57% had made no changes to their data thus far.

2. Data science is shifting from artisanal to industrial.

Companies feel the need to accelerate the production of data science models . What was once an artisanal activity is becoming more industrialized. Companies are investing in platforms, processes and methodologies, feature stores, machine learning operations (MLOps) systems, and other tools to increase productivity and deployment rates. MLOps systems monitor the status of machine learning models and detect whether they are still predicting accurately. If they’re not, the models might need to be retrained with new data.

Producing data models — once an artisanal activity — is becoming more industrialized.

Most of these capabilities come from external vendors, but some organizations are now developing their own platforms. Although automation (including automated machine learning tools, which we discuss below) is helping to increase productivity and enable broader data science participation, the greatest boon to data science productivity is probably the reuse of existing data sets, features or variables, and even entire models.

3. Two versions of data products will dominate.

In the Thoughtworks survey, 80% of data and technology leaders said that their organizations were using or considering the use of data products and data product management. By data product , we mean packaging data, analytics, and AI in a software product offering, for internal or external customers. It’s managed from conception to deployment (and ongoing improvement) by data product managers. Examples of data products include recommendation systems that guide customers on what products to buy next and pricing optimization systems for sales teams.

But organizations view data products in two different ways. Just under half (48%) of respondents said that they include analytics and AI capabilities in the concept of data products. Some 30% view analytics and AI as separate from data products and presumably reserve that term for reusable data assets alone. Just 16% say they don’t think of analytics and AI in a product context at all.

We have a slight preference for a definition of data products that includes analytics and AI, since that is the way data is made useful. But all that really matters is that an organization is consistent in how it defines and discusses data products. If an organization prefers a combination of “data products” and “analytics and AI products,” that can work well too, and that definition preserves many of the positive aspects of product management. But without clarity on the definition, organizations could become confused about just what product developers are supposed to deliver.

4. Data scientists will become less sexy.

Data scientists, who have been called “ unicorns ” and the holders of the “ sexiest job of the 21st century ” because of their ability to make all aspects of data science projects successful, have seen their star power recede. A number of changes in data science are producing alternative approaches to managing important pieces of the work. One such change is the proliferation of related roles that can address pieces of the data science problem. This expanding set of professionals includes data engineers to wrangle data, machine learning engineers to scale and integrate the models, translators and connectors to work with business stakeholders, and data product managers to oversee the entire initiative.

Another factor reducing the demand for professional data scientists is the rise of citizen data science , wherein quantitatively savvy businesspeople create models or algorithms themselves. These individuals can use AutoML, or automated machine learning tools, to do much of the heavy lifting. Even more helpful to citizens is the modeling capability available in ChatGPT called Advanced Data Analysis . With a very short prompt and an uploaded data set, it can handle virtually every stage of the model creation process and explain its actions.

Of course, there are still many aspects of data science that do require professional data scientists. Developing entirely new algorithms or interpreting how complex models work, for example, are tasks that haven’t gone away. The role will still be necessary but perhaps not as much as it was previously — and without the same degree of power and shimmer.

5. Data, analytics, and AI leaders are becoming less independent.

This past year, we began to notice that increasing numbers of organizations were cutting back on the proliferation of technology and data “chiefs,” including chief data and analytics officers (and sometimes chief AI officers). That CDO/CDAO role, while becoming more common in companies, has long been characterized by short tenures and confusion about the responsibilities. We’re not seeing the functions performed by data and analytics executives go away; rather, they’re increasingly being subsumed within a broader set of technology, data, and digital transformation functions managed by a “supertech leader” who usually reports to the CEO. Titles for this role include chief information officer, chief information and technology officer, and chief digital and technology officer; real-world examples include Sastry Durvasula at TIAA, Sean McCormack at First Group, and Mojgan Lefebvre at Travelers.

Related Articles

This evolution in C-suite roles was a primary focus of the Thoughtworks survey, and 87% of respondents (primarily data leaders but some technology executives as well) agreed that people in their organizations are either completely, to a large degree, or somewhat confused about where to turn for data- and technology-oriented services and issues. Many C-level executives said that collaboration with other tech-oriented leaders within their own organizations is relatively low, and 79% agreed that their organization had been hindered in the past by a lack of collaboration.

We believe that in 2024, we’ll see more of these overarching tech leaders who have all the capabilities to create value from the data and technology professionals reporting to them. They’ll still have to emphasize analytics and AI because that’s how organizations make sense of data and create value with it for employees and customers. Most importantly, these leaders will need to be highly business-oriented, able to debate strategy with their senior management colleagues, and able to translate it into systems and insights that make that strategy a reality.

About the Authors

Thomas H. Davenport ( @tdav ) is the President’s Distinguished Professor of Information Technology and Management at Babson College, a fellow of the MIT Initiative on the Digital Economy, and senior adviser to the Deloitte Chief Data and Analytics Officer Program. He is coauthor of All in on AI: How Smart Companies Win Big With Artificial Intelligence (HBR Press, 2023) and Working With AI: Real Stories of Human-Machine Collaboration (MIT Press, 2022). Randy Bean ( @randybeannvp ) is an industry thought leader, author, founder, and CEO and currently serves as innovation fellow, data strategy, for global consultancy Wavestone. He is the author of Fail Fast, Learn Faster: Lessons in Data-Driven Leadership in an Age of Disruption, Big Data, and AI (Wiley, 2021).

More Like This

Add a comment cancel reply.

You must sign in to post a comment. First time here? Sign up for a free account : Comment on articles and get access to many more articles.

Comment (1)

Nicolas corzo.

MINI REVIEW article

How big is big data a comprehensive survey of data production, storage, and streaming in science and industry.

\r\nLuca Clissa,
&#x;

  • 1 Department of Physics and Astronomy, University of Bologna, Bologna, Italy
  • 2 National Institute for Nuclear Physics, Bologna, Italy
  • 3 CERN, Genève, Switzerland

The contemporary surge in data production is fueled by diverse factors, with contributions from numerous stakeholders across various sectors. Comparing the volumes at play among different big data entities is challenging due to the scarcity of publicly available data. This survey aims to offer a comprehensive perspective on the orders of magnitude involved in yearly data generation by some public and private leading organizations, using an array of online sources for estimation. These estimates are based on meaningful, individual data production metrics and plausible per-unit sizes. The primary objective is to offer insights into the comparative scales of major big data players, their sources, and data production flows, rather than striving for precise measurements or incorporating the latest updates. The results are succinctly conveyed through a visual representation of the relative data generation volumes across these entities.

1. Introduction

In the last twenty years, we have witnessed an unprecedented and ever-increasing trend in data production. Hilbert and López (2011) date the rise of this phenomenon back to 2002, marking the onset of the digital age. Indeed, the transition from analog to digital storage devices dramatically augmented the capacity for data accumulation, thereby ushering in the Big Data era.

The term “big data” was first coined in 1990s ( Mashey, 1998 ; Lohr, 2013 ) and it is typically used to denote datasets whose size exceeds the potential to manipulate and analyze them within reasonable time limits ( Snijders et al., 2012 ). However, the expression does not refer to any specific storage size but assumes a more profound meaning that extends far beyond the sheer volume of data points. In fact, big data embrace a broad spectrum of data sources including structured, semi-structured and, predominantly, unstructured data ( Dedić and Stanier, 2016 ). Although multiple connotations have been attributed to the concept of big data over the years, a commonly shared definition revolves around the so-called 5 Vs ( Jain, 2016 ):

• Volume: the actual quantity of generated data is large, in the order of magnitude of terabytes and petabytes ( Sagiroglu and Sinanc, 2013 ). More generally, it indicates volumes that are too large and complex to be handled with conventional data storage and processing technologies;

• Variety: the data can originate from a multitude of sources and types, including sensors, social media, log files and more, and it covers a diverse range of formats like text, images, audio or video;

• Velocity: the data are generated and/or processed at high rates ( Kitchin and McArdle, 2016 ), typically nearly real-time;

• Value: the data must carry valuable information that provides business value and profitable insights ( Uddin et al., 2014 ). In a scientific context, this translates to information that contributes to the advancement of human knowledge;

• Veracity: the data sources must be reliable and generate high-quality data that can yield value ( Schroeck et al., 2012 ; Onay and Öztürk, 2018 ).

However, the community has yet to reach a full consensus on the definition of big data ( Grimes, 2013 ; Kitchin and McArdle, 2016 ), with some authors advocating for a shift in characterization from the intrinsic data properties to the techniques employed for acquisition, storage, circulation and analysis ( Balazka and Rodighiero, 2020 ).

1.1. Big data origins and trends

The rise of the big data era is not solely due to advancements in storage capabilities. In fact, numerous other factors have significantly amplified data generation. The widespread adoption of the internet and the evolution of computer technologies have expanded processing capabilities and simplified data access, catalyzing further data generation. Consequently, there has been an increased contribution from various stakeholders, including tech giants, traditional industries, governments, healthcare institutions, scientific collaborations, and others. Moreover, the emergence of smart everyday objects designed for both receiving and producing data exponentially increased individual contributions to the overall data produced. Modern objects are often equipped with technologies that enable data collection and sharing via a network, commonly referred to as Internet of Things ( Ashton et al., 2009 ). This phenomenon has further fueled the data production rate. For example, sensors measuring status and operation are now commonly used in industrial machinery and household appliances, simplifying their control and enabling automated maintenance. This trend has also extended to the personal items market, with tech companies increasingly investing in wearable devices such as watches and glasses. These objects allow users to stay connected to a rapidly evolving environment, track personal progress, and explore the world through virtual reality in unprecedented ways. Furthermore, digitization solutions are being explored to address the emerging challenges of our times. For instance, consider the urgent need for modernization of institutional processes posed by the pandemic. The massive spread of the infections has required unprecedented access to health assistance. However, the inability to scale up services and equipment correspondingly has led to significant issues and compromised people's safety. In such circumstances, intelligent systems capable of remotely monitoring patients' conditions and providing specialist support would have been enormously beneficial.

Essentially, the trends observed in data production are primarily driven by two key factors: the digital services provided by a multitude of stakeholders from diverse sectors, and their extensive adoption by millions of users globally. This study thoroughly explores this phenomenon by integrating various sources and making two significant contributions: (i) providing informed and up-to-date “guesstimates” of the yearly data production for some of currently top big data entities, and (ii) enabling comparisons among different sectors or data streams, including data production, storage, and transmission.

1.2. Prominent big data producers and sources

The list of organizations contributing to the generation and dissemination of digital data in the modern society is extensive, encompassing tech companies, media agencies, institutions, research centers and more. Conducting an comprehensive survey involving all these stakeholders would be exceedingly challenging, if even impossible. Consequently, this study focuses solely on a subset of these entities and conducts a comparative analysis of their yearly data production. Specifically, various online sources are extensively mined to gather information about the volume of contents produced, hosted or streamed by some of the major players in the field of big data. The corresponding yearly production rates are then derived based on reasonable estimates of unitary sizes for such contents, e.g., the average size of emails or pictures, average data traffic for one hour of video, and so on. Notably, considerations related to storage space are omitted due to the lack of information regarding data management policies, such as data replication and redundancy. Figure 1 illustrates the results of this comparative analysis, while Table 1 summarizes the estimation procedure and the sources of information considered. The reported values are not meant to be pinpoint accurate; rather, they provide a general understanding of the orders of magnitude involved.

www.frontiersin.org

Figure 1 . Big data sizes . Orders of magnitude involved in different data sources for several big data players. The area of each bubble represents the amount of data streamed, hosted or generated. The accompanying text annotations emphasize the crucial factors considered in the estimation process. Average per-unit sizes are indicated in parentheses, where italic denotes measures derived from reasonable assumptions due to the absence of available references.

www.frontiersin.org

Table 1 . Summary of the estimation process.

Despite not being the widely known among the mainstream audience, the CERN community ( CERN, 2023b ) holds a prominent position in terms of big data production . Indeed, the readout electronics of the physics experiments conducted by CERN scientists utilizing the Large Hadron Collider (LHC) ( CERN, 2023a ) generated roughly 40 ZettaBytes (ZB) of raw data during its last run (2018) ( Grandi, 2017 ). In comparison, Amazon Simple Storage Service (S3) stored over 100 trillion objects until 2021 according to Amazon Web Service (AWS) chief evangelist, Barr (2021) . Assuming an average size of 5 MB per object in a representative S3 bucket (for instance, see Hampton, 2021 ), the total amount of data produced by LHC collisions in one year would exceed the total size of files ever stored on Amazon cloud storage services by approximately one order of magnitude, i.e., 40 ZB against roughly 500 Exabytes (EB), respectively. However, storing the raw readout electronics is currently unattainable with existing technology and budget constraints. Moreover, only a fraction of that data is genuinely relevant for the study of new physics phenomena, making it unnecessary to retain all the information. Consequently, the vast majority of raw data is promptly discarded using hardware and software trigger selection systems, significantly reducing the recorded data volume. As a result of this cut, the actual acquisition rate stands at nearly 1 PetaByte (PB) per day ( CERN, 2017 ), equivalent to roughly 160 PB 1 a year in 2018. In addition to the actual data collected by LHC, physics analyses necessitate the comparison of experimental results with Monte Carlo data, simulated based on current theories, resulting in ~1–2 times 2 additional data ( Grandi, 2017 ). Furthermore, the CERN community is actively working on enhancing the capabilities of the Large Hadron Collider for the High Luminosity (HL-LHC) upgrade ( Aberle et al., 2020 ). As a consequence, the generated data are expected to increase of a factor ≥5 ( Aberle et al., 2020 ), resulting in an estimated 800 PB of new data each year by 2026. In terms of other renowned big data stakeholders such as Google and Meta, the services they provide generate a yearly data production comparable to the effective figures of LHC, amounting to a few hundreds petabytes.

For instance, the Google search index tracked at least 30 billion webpages in 2021 ( Van den Bosch et al., 2016 ; Indig, 2020 ; De Kunder, 2021 ; Djuraskovic, 2021 ), which gives a total of 62 PB when considering an average page size of 2.15 MB ( Teague et al., 2021 ). Regarding YouTube video uploads, instead, 720 thousands hours of footage were uploaded daily ( Dean, 2021b ), resulting in roughly 263 PB when assuming an average size of 1 GB ( Vera et al., 2019 ). Similarly, the photos shared on Instagram and Facebook amount to an estimated 68 PB and 252 PB, respectively, given that 65,000 and 24,0000 pictures where shared every minute on these social media ( Domo, 2021 ) and assuming 2 MB as the average picture size ( Adobe, 2021 ). The yearly data production even increases when considering storage services like Dropbox. In 2020, the company reported 100 million new users, 1.17 millions of which were paid subscriptions ( Dean, 2021a ). Assuming that free accounts utilized 75% of the 2 GB storage available, and that paid accounts occupied 25% of the total 2 TB, the amount of new storage required by Dropbox users in 2020 is ~768 PB.

Apart from the nominal values of generated information, data streaming constitutes a significant slice of the big data market. The continuous flow of small- to medium-sized files results in massive traffic when scaled up to millions of users. For instance, Statista reports that nearly 131 trillion electronic communications were exchanged from October 2020 to September 2021, comprising 71 trillion emails and 60 trillion spam messages ( Statista Research Department, 2021 ). Assuming average sizes of 75 and 5 KB for standard ( Tschabitscher, 2021 ) and junk ( Baker, 2014 ) emails, respectively, this leads to an estimated 5.7 EB traffic during the analyzed period, surpassing the amounts discussed so far. Another example of substantial data streaming is represented by Netflix, which operates on an even larger scale. The company's user base has experienced significant growth in recent years, particularly due to changes in daily routines imposed by the pandemic. According to the 9 -th edition of the Data Never Sleeps report by Domo , Netflix users consumed 140 million hours of streaming per day in 2021 ( Domo, 2021 ). This translates to a total of roughly 51.1 EB assuming 1 GB of data for standard definition videos ( Perry, 2021 ). Surprisingly, the scientific community also plays an important role in the data streaming context. Indeed, large collaborations comprising thousands of researchers worldwide orchestrate the LHC experiments at CERN. Consequently, the data collected at CERN are continuously transferred via the Worldwide LHC Computing Grid to fuel innovative research ( WLCG, 2023 ). For example, a throughput of 60 GB/s was achieved in 2018 ( WLCG, 2019 ), resulting in a yearly projection of 1.9 EB, which is close to half of the global email traffic and only one order of magnitude lower than Netflix usage.

2. Discussion

The data production rate is currently at its peak, and this trend is expected to continue growing in the coming years. Conducting an exact comparison of the information generated by various organizations contributing to this surge is extremely challenging, if not practically unfeasible. This study aims to offer reasonable indications of the latest orders of magnitude of yearly data production for some of today's main players in the realm big data. However, it is important to note that the lack of official sources prevents precise estimations of the big data volumes produced by individual organizations. For the same reason, the amount of storage space occupied by these organizations is not considered in this study, as it would require more detailed information about their data management policies.

A fundamental observation that emerges from this survey is that streaming data already account for a significant portion of the big data market, and this is expected to persist in the future due to the growing adoption of smart everyday objects capable of generating and sharing data.

Additionally, a noteworthy finding is that the experimental data collected by the scientific community play a substantial role in the big data phenomenon. Specifically, the data volumes generated by nuclear physics experiments conducted at CERN are comparable to the traffic experienced by some of the most prominent commercial players, such as Google, Meta, and Dropbox.

Author contributions

LC: Conceptualization, Data curation, Investigation, Methodology, Project administration, Validation, Visualization, Writing—original draft, Writing—review and editing. ML: Validation, Writing—review and editing. LR: Writing—review and editing.

The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. This research partly funded by PNRR - M4C2 - Investimento 1.3, Partenariato Esteso PE00000013 - “FAIR - Future Artificial Intelligence Research” - Spoke 8 “Pervasive AI” and the European Commission under the NextGeneration EU programme.

Acknowledgments

This work was a reviewed version of the content described by the authors in Clissa (2022a , b ).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The author(s) declared that they were an editorial board member of Frontiers, at the time of submission. This had no impact on the peer review process and the final decision.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

1. ^ LHC registered 161 days of physics data taking in 2018 ( Todd et al., 2018 ).

2. ^ A factor of 1.5 was adopted here for the bubble plot.

Aberle, O., Béjar Alonso, I., Brüning, O., Fessia, P., Rossi, L., et al. (2020). High-Luminosity Large Hadron Collider (HL-LHC): Technical Design Report . CERN Yellow Reports: Monographs. Geneva: CERN.

Google Scholar

Adobe (2021). The Ultimate Guide to Facebook Image Sizes. Blog post . Available online at: https://www.adobe.com/express/discover/sizes/facebook

Ashton, K. (2009). That ‘internet of things' thing. RFID J . 22, 97–114.

Baker, I. (2014). Why Are Email Files So Large? Blog post . Available online at: https://medium.com/@raindrift/how-big-is-email-305bbdb69776

Balazka, D., and Rodighiero, D. (2020). Big data and the little big bang: an epistemological (R) evolution. Front. Big Data 3, 31. doi: 10.3389/fdata.2020.00031

PubMed Abstract | CrossRef Full Text | Google Scholar

Barr, J. (2021). Celebrate 15 Years of Amazon s3 With ‘Pi Week' Livestream Events. Blog post . Available online at: https://aws.amazon.com/blogs/aws/amazon-s3s-15th-birthday-it-is-still-day-1-after-5475-days-100-trillion-objects/

CERN (2017). What Data to Record? CERN . Available online at: https://home.cern/science/computing/storage

CERN (2023a). Large Hadron Collider. CERN . Available online at: https://home.cern/science/accelerators/large-hadron-collider

CERN (2023b). Who We Are. CERN . Available online at: https://home.cern/about/who-we-are

Clissa, L. (2022a). Supporting Scientific Research through Machine and Deep Learning: Fluorescence Microscopy and Operational Intelligence Use Cases . Bologna: Alma Mater Studiorum Universitá di Bologna.

Clissa, L. (2022b). Survey of big data sizes in 2021. arXiv preprint arXiv: 2202.07659. doi: 10.48550/arXiv.2202.07659

CrossRef Full Text | Google Scholar

De Kunder, M. (2021). Daily Estimated Size of the World Wide Web. WorldWideWebSize.com . Available online at: https://www.worldwidewebsize.com/

Dean, B. (2021a). Dropbox Usage and Revenue Stats. Blog post. Backlinko . Available online at: https://backlinko.com/dropbox-users

Dean, B. (2021b). How Many People Use Youtube in 2021? Blog post. Backlinko . Available online at: https://backlinko.com/youtube-users#youtube-statistics

Dedić, N., and Stanier, C. (2016). “Towards differentiating business intelligence, big data, data analytics and knowledge discovery,” in International Conference on Enterprise Resource Planning Systems (Cham: Springer), 114–122. doi: 10.1007/978-3-319-58801-8_10

Djuraskovic, O. (2021). Google Search Statistics and Facts 2021 (You Must Know). Blog post. First Site Guide . Available online at: https://firstsiteguide.com/google-search-stats/

Domo (2021). Data Never Sleeps 9. Businesswire . Available online at: https://www.businesswire.com/news/home/20210929005835/en/Domo-Releases-Ninth-Annual-%E2%80%9CData-Never-Sleeps%E2%80%9D-Infographic

Grandi, C. (2017). Computing for Hep Experiments. CERN . Available online at: https://indico.cern.ch/event/605204/contributions/2440577/attachments/1471783/2277732/Calcolo-HEP-Perugia-20170605.pdf

Grimes, S. (2013). Big Data: Avoid ‘Wanna V' Confusion. InformationWeek . Available online at: https://www.informationweek.com/big-data-analytics/big-data-avoid-wanna-v-confusion

Hampton, M. (2021). Why Is AWS S3 Object Count Is Measured iin Units Used to Measure File Size? Serverfault . Available online at: https://www.ringingliberty.com/2021/06/30/why-is-aws-s3-object-count-is-measured-iin-units-used-to-measure-file-size/

Hilbert, M., and López, P. (2011). The world's technological capacity to store, communicate, and compute information. Science 332, 60–65. doi: 10.1126/science.1200970

Indig, K. (2020). Google's Index Is Smaller Than We Think - and Might Not Grow at All. Blog post. kevin-indig.com . Available online at: https://www.kevin-indig.com/googles-index-is-smaller-than-we-think-and-might-not-grow-at-all/

Jain, A. (2016). The 5 V's of Big Data. Blog post. Watson Health Perspectives . Available online at: https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data/

Kitchin, R., and McArdle, G. (2016). What makes big data, big data? Exploring the ontological characteristics of 26 datasets. Big Data Soc . 3, 2053951716631130. doi: 10.1177/2053951716631130

Lohr, S. (2013). The Origins of ‘Big Data': an Etymological Detective Story. Blog post. The New York Times . Available online at: https://bits.blogs.nytimes.com/2013/02/01/the-origins-of-big-data-an-etymological-detective-story/

Mashey, J. R. (1998). “Big data …and the next wave of infrastress,” in Proceedings of the 1999 USENIX Annual Technical Conference (USENIX) . Available online at: https://static.usenix.org/event/usenix99/invited_talks/mashey.pdf

Onay, C., and Öztürk, E. (2018). A review of credit scoring research in the age of big data. J. Financ. Regul. Compliance . 26, 382–405. doi: 10.1108/JFRC-06-2017-0054

Perry, N. (2021). How Much Data Does Netflix Use? Blog post. Digitaltrends . Available online at: https://www.digitaltrends.com/movies/how-much-data-does-netflix-use/

Sagiroglu, S., and Sinanc, D. (2013). “Big data: a review,” in 2013 International Conference on Collaboration Technologies and Systems (CTS) (San Diego, CA: IEEE), 42–47. doi: 10.1109/CTS.2013.6567202

Schroeck, M., Shockley, R., Smart, J., Romero-Morales, D., and Tufano, P. (2012). Analytics: The Real-world Use of Big Data. How Innovative Organizations are Extracting Value from Uncertain Data. How Innovative Organizations are Extracting Value from Uncertain Data . New York, NY: IBM Institute for Business Value.

Snijders, C., Matzat, U., and Reips, U.-D. (2012). ‘big data': big gaps of knowledge in the field of internet science. Int. J. Internet Sci . 7, 1–5.

Statista Research Department. (2021). Average Daily Spam Volume Worldwide From October 2020 to September 2021. Statista . Available online at: https://www.statista.com/statistics/1270424/daily-spam-volume-global/

Teague, J., Karamalegos, S., Rebecca, H., Peck, J., and Pollard, B. (2021). Web Almanac. HTTP Archive . Available online at: https://almanac.httparchive.org/en/2021/page-weight

Todd, B., Ponce, L., Apollonio, A., and Walsh, D. J. (2018). LHC Availability 2018: Proton Physics . Technical report. Geneva: CERN.

Tschabitscher, H. (2021). Why Are Email Files So Large? Blog post. Lifewire . Available online at: https://www.lifewire.com/what-is-the-average-size-of-an-email-message-1171208

Uddin, M. F., and Gupta, N. (2014). “Seven v's of big data understanding big data to extract value,” in Proceedings of the 2014 zone 1 conference of the American Society for Engineering Education (Bridgeport, CT: IEEE), 1–5. doi: 10.1109/ASEEZone1.2014.6820689

Van den Bosch, A., Bogers, T., and De Kunder, M. (2016). Estimating search engine index size variability: a 9-year longitudinal study. Scientometrics 107, 839–856. doi: 10.1007/s11192-016-1863-z

Vera, P., James, M., and Dan, S. (2019). What Is the File Size of a One Hour Youtube Video? Quora . Available online at: https://www.quora.com/What-is-the-file-size-of-a-one-hour-YouTube-video

WLCG (2019). LHC Run 2 (2014–2018). WLCG . Available online at: https://wlcg-public.web.cern.ch/about/

WLCG (2023). Welcome to the Worldwide LHC Computing Grid. CERN . Available online at: https://wlcg.web.cern.ch/

Keywords: big data, data production, data volumes, data storage, streaming data

Citation: Clissa L, Lassnig M and Rinaldi L (2023) How big is Big Data? A comprehensive survey of data production, storage, and streaming in science and industry. Front. Big Data 6:1271639. doi: 10.3389/fdata.2023.1271639

Received: 02 August 2023; Accepted: 20 September 2023; Published: 19 October 2023.

Reviewed by:

Copyright © 2023 Clissa, Lassnig and Rinaldi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Luca Clissa, luca.clissa@bo.infn.it

† ORCID: Luca Clissa orcid.org/0000-0002-4876-5200 Mario Lassnig orcid.org/0000-0002-9541-0592 Lorenzo Rinaldi orcid.org/0000-0001-9608-9940

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

  • Data management strategies

Top trends in big data for 2024 and beyond

Big data is driving changes in how organizations process, store and analyze data. the benefits are spurring even more innovation. here are four big trends..

TechTarget Contributor

  • TechTarget Contributor

Big data is proving its value to organizations of all types and sizes in a wide range of industries. Enterprises that make advanced use of it are realizing tangible business benefits , from improved efficiency in operations and increased visibility into rapidly changing business environments to the optimization of products and services for customers.

The result is that as organizations find uses for these typically large stores of data, big data technologies, practices and approaches are evolving. New types of big data architectures and techniques for collecting, processing, managing and analyzing the gamut of data across an organization continue to emerge.

Dealing with big data is more than just dealing with large volumes of stored information. Volume is just one of the many V's of big data that organizations need to address. There usually is also a significant variety of data -- from structured information sitting in databases distributed throughout the organization to vast quantities of unstructured and semistructured data residing in files, images, videos, sensors, system logs, text and documents, including paper ones that are waiting to be digitized. In addition, this information often is created and changed at a rapid rate (velocity) and has varying levels of data quality (veracity), creating further challenges on data management, processing and analysis.

Four major trends in big data, identified by industry experts, are helping organizations meet those challenges and get the benefits they're seeking. Here's a look at the trends and what they mean for organizations that are investing in big data deployments.

This article is part of

The ultimate guide to big data for businesses

  • Which also includes:

8 benefits of using big data for businesses

What a big data strategy includes and how to build one

  • 10 big data challenges and how to address them

Download this entire guide for FREE now!

1. Generative AI, advanced analytics and machine learning continue to evolve

With the vast amount of data being generated, traditional analytics approaches are challenged because they're not easily automated for data analysis at scale. Distributed processing technologies, especially those promoted by open source platforms such as Hadoop and Spark , enable organizations to process petabytes of information at rapid speed. Enterprises are then using big data analytics technologies to optimize their business intelligence and analytics initiatives, moving past slow reporting tools dependent on data warehouse technology to more intelligent, responsive applications that enable greater visibility into customer behavior, business processes and overall operations.

Big data analytics evolutions continue to focus around machine learning and AI systems. Increasingly, AI is used by organizations of all sizes to optimize and improve their business processes. In the Enterprise Strategy Group spending intentions survey, 63% of the 193 respondents familiar with AI and machine learning initiatives in their organization said they expected it to spend more on those tools in 2023.

Machine learning enables organizations to identify data patterns more easily, detect anomalies in large data sets, and to support predictive analytics and other advanced data analysis capabilities. Some examples of that include the following:

  • Recognition systems for image, video and text data.
  • Automated classification of data.
  • Natural language processing (NLP) capabilities for chatbots and voice and text analysis.
  • Autonomous business process automation.
  • Personalization and recommendation features in websites and services.
  • Analytics systems that can find optimal solutions to business problems among a sea of data.

Indeed, with the help of AI and machine learning, companies are using their big data environments to provide deeper customer support through intelligent chatbots and more personalized interactions without requiring significant increases in customer support staff. These AI-enabled systems are able to collect and analyze vast amounts of information about customers and users, especially when paired with a data lake strategy that can aggregate a wide range of information across many sources.

Enterprises are also seeing innovations in the area of data visualization. People understand the meaning of data better when it's represented in a visualized form , such as charts, graphs and plots. Emerging forms of data visualization are putting the power of AI-enabled analytics into the hands of even casual business users. This helps organizations spot key insights that can improve decision-making. Advanced forms of visualization and analytics tools even let users ask questions in natural language, with the system automatically determining the right query and showing the results in a context-relevant manner.

Generative AI and large language models (LLMs) improve an organization's data operations even more with benefits across the entire data pipeline. Generative AI can help automate data observability monitoring functions, improve quality and efficiency with proactive alerts and fixes for identified issues, and even write lines of code. It can scan large sets of data for errors or inconsistencies or identify patterns and generate reports or visualizations of the most important details for data teams. LLMs provide new data democratization capabilities to organizations. Data cataloging, integration, privacy, governance and sharing are all on the rise as generative AI weaves itself into data management processes.

The power of Generative AI and LLMs is dependent on the quality of the data used to train the model. Data quality is more important than ever as the interest and use of generative AI continues to rise in all industries. Data teams must carefully monitor the results of all AI-generated data operations. Incorrect or misguided data can lead to wrong decisions and costly outcomes.

Four major trends in big data.

2. More data, increased data diversity drive advances in processing and the rise of edge computing

The pace of data generation continues to accelerate. Much of this data isn't generated from the business transactions that happen in databases -- instead, it comes from other sources, including cloud systems, web applications, video streaming and smart devices such as smartphones and voice assistants. This data is largely unstructured and in the past was left mostly unprocessed and unused by organizations, turning it into so-called dark data.

That brings us to the biggest trend in big data: Non-database sources will continue to be the dominant generators of data, in turn forcing organizations to reexamine their needs for data processing. Voice assistants and IoT devices, in particular, are driving a rapid ramp-up in big data management needs across industries as diverse as retail, healthcare, finance, insurance, manufacturing and energy and in a wide range of public-sector markets. This explosion in data diversity is compelling organizations to think beyond the traditional data warehouse as a means for processing all this information.

In addition, the need to handle the data being generated is moving to the devices themselves, as industry breakthroughs in processing power have led to the development of increasingly advanced devices capable of collecting and storing data on their own without taxing network, storage and computing infrastructure. For example, mobile banking apps can handle many tasks for remote check deposit and processing without having to send images back and forth to central banking systems for processing.

The use of devices for distributed processing is embodied in the concept of edge computing , which shifts the processing load to the devices themselves before the data is sent to the servers. Edge computing optimizes performance and storage by reducing the need for data to flow through networks. That lowers computing and processing costs, especially cloud storage, bandwidth and processing expenses. Edge computing also helps to speed up data analysis and provides faster responses to the user.

3. Big data storage needs spur innovations in cloud and hybrid cloud platforms, growth of data lakes

To deal with the inexorable increase in data generation, organizations are spending more of their resources storing this data in a range of cloud-based and hybrid cloud systems optimized for all the V's of big data. In previous decades, organizations handled their own storage infrastructure, resulting in massive data centers that enterprises had to manage, secure and operate. The move to cloud computing changed that dynamic. By shifting the responsibility to cloud infrastructure providers -- such as AWS, Google, Microsoft, Oracle and IBM -- organizations can deal with almost limitless amounts of new data and pay for storage and compute capability on demand without having to maintain their own large and complex data centers.

Some industries are challenged in their use of cloud infrastructure due to regulatory or technical limitations . For example, heavily regulated industries -- such as healthcare, financial services and government -- have restrictions that prevent the use of public cloud infrastructure . As a result, over the past decade, cloud providers have developed ways to provide more regulatory-friendly infrastructure, as well as hybrid approaches that combine aspects of third-party cloud systems with on-premises computing and storage to meet critical infrastructure needs. The evolution of both public cloud and hybrid cloud infrastructures will no doubt progress as organizations seek the economic and technical advantages of cloud computing.

In addition to innovations in cloud storage and processing, enterprises are shifting toward new data architecture approaches that allow them to handle the variety, veracity and volume challenges of big data. Rather than trying to centralize data storage in a data warehouse that requires complex and time-intensive extract, transform and load processes, enterprises are evolving the concept of the data lake . Data lakes store structured, semistructured and unstructured data sets in their native format. This approach shifts the responsibility for data transformation and preparation to end users who have different data needs. The data lake can also provide shared services for data analysis and processing.

4. DataOps and data stewardship move to the fore

Many aspects of big data processing, storage and management will see continued evolution for years to come. Much of this innovation is driven by technology needs, but also partly by changes in the way we think about and relate to data.

One area of innovation is the emergence of DataOps , a methodology and practice that focuses on agile, iterative approaches for dealing with the full lifecycle of data as it flows through the organization. Rather than thinking about data in piecemeal fashion with separate people dealing with data generation, storage, transportation, processing and management, DataOps processes and frameworks address organizational needs across the data lifecycle from generation to archiving.

Likewise, organizations are increasingly dealing with data governance, privacy and security issues, a situation that is exacerbated by big data environments. In the past, enterprises often were somewhat lax about concerns around data privacy and governance, but new regulations make them much more liable for what happens to personal information in their systems. Generative AI adds another layer of privacy and ethics concerns for organizations to consider.

Due to widespread security breaches, eroding customer trust in enterprise data-sharing practices, and challenges in managing data over its lifecycle, organizations are becoming more focused on data stewardship and working harder to properly secure and manage data, especially as it crosses international boundaries. New tools are emerging to make sure that data stays where it needs to stay, is secured at rest and in motion, and is appropriately tracked over its lifecycle.

Collectively, these big data trends will continue to shape the big data shape in 2024.

Editor's note: Trends were identified by industry experts and research. This article was written in 2021. TechTarget editors revised it in 2024 to improve the reader experience.

Essential big data best practices for businesses

Top big data tools and technologies to know about

Related Resources

  • Divedeep Into Ai With Data-Fitness –Hitachi Pentaho
  • Don’t drown in data debt; champion your Data First culture –Syniti
  • Master Data Management Strategy Template (eBook) –Semarchy
  • How to Craft Data Quality and Master Data Management Strategies (Podcast) –Semarchy

Dig Deeper on Data management strategies

latest research big data

How do big data and AI work together?

RonaldSchmelzer

Enterprises struggle to find business value with GenAI

AntoneGonsalves

Trusted data key for Qlik as it develops foundation for AI

EricAvidon

The new platform marks an evolution for the longtime analytics vendor, making AI the focus with capabilities such as agent-based ...

New capabilities for the vendor's GenAI assistant include insight generation, while a studio for data science and machine ...

The vendor added new features that simplify developing and deploying machine learning models as well as provide greater ...

Compare Datadog vs. New Relic capabilities including alerts, log management, incident management and more. Learn which tool is ...

Many organizations struggle to manage their vast collection of AWS accounts, but Control Tower can help. The service automates ...

There are several important variables within the Amazon EKS pricing model. Dig into the numbers to ensure you deploy the service ...

These enterprise content management certifications can help business and IT professionals advance their careers and get more out ...

Copilot -- Microsoft's AI chatbot tool -- works in several other apps, including SharePoint. Together, Copilot and SharePoint can...

Organizations often neglect information governance in favor of more revenue-generating initiatives. This leads to challenges with...

With its Cerner acquisition, Oracle sets its sights on creating a national, anonymized patient database -- a road filled with ...

Oracle plans to acquire Cerner in a deal valued at about $30B. The second-largest EHR vendor in the U.S. could inject new life ...

The Supreme Court ruled 6-2 that Java APIs used in Android phones are not subject to American copyright law, ending a ...

SAP CTO Juergen Mueller is leaving the company as the result of an incident at an event, leaving questions about the direction of...

SAP and Collibra expand their partnership, integrating Collibra's data governance tools into SAP Datasphere, bolstering data ...

As SAP pushes its clean core methodology for S/4HANA Cloud environments, the partners who customized legacy SAP systems will need...

Big data: The next frontier for innovation, competition, and productivity

The amount of data in our world has been exploding, and analyzing large data sets—so-called big data—will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus, according to research by MGI and McKinsey's Business Technology Office. Leaders in every sector will have to grapple with the implications of big data, not just a few data-oriented managers. The increasing volume and detail of information captured by enterprises, the rise of multimedia, social media, and the Internet of Things will fuel exponential growth in data for the foreseeable future.

MGI studied big data in five domains—healthcare in the United States, the public sector in Europe, retail in the United States, and manufacturing and personal-location data globally. Big data can generate value in each. For example, a retailer using big data to the full could increase its operating margin by more than 60 percent. Harnessing big data in the public sector has enormous potential, too. If US healthcare were to use big data creatively and effectively to drive efficiency and quality, the sector could create more than $300 billion in value every year. Two-thirds of that would be in the form of reducing US healthcare expenditure by about 8 percent. In the developed economies of Europe, government administrators could save more than €100 billion ($149 billion) in operational efficiency improvements alone by using big data, not including using big data to reduce fraud and errors and boost the collection of tax revenues. And users of services enabled by personal-location data could capture $600 billion in consumer surplus. The research offers seven key insights.

1. Data have swept into every industry and business function and are now an important factor of production, alongside labor and capital. We estimate that, by 2009, nearly all sectors in the US economy had at least an average of 200 terabytes of stored data (twice the size of US retailer Wal-Mart's data warehouse in 1999) per company with more than 1,000 employees.

2. There are five broad ways in which using big data can create value. First, big data can unlock significant value by making information transparent and usable at much higher frequency. Second, as organizations create and store more transactional data in digital form, they can collect more accurate and detailed performance information on everything from product inventories to sick days, and therefore expose variability and boost performance. Leading companies are using data collection and analysis to conduct controlled experiments to make better management decisions; others are using data for basic low-frequency forecasting to high-frequency nowcasting to adjust their business levers just in time. Third, big data allows ever-narrower segmentation of customers and therefore much more precisely tailored products or services. Fourth, sophisticated analytics can substantially improve decision-making. Finally, big data can be used to improve the development of the next generation of products and services. For instance, manufacturers are using data obtained from sensors embedded in products to create innovative after-sales service offerings such as proactive maintenance (preventive measures that take place before a failure occurs or is even noticed).

3. The use of big data will become a key basis of competition and growth for individual firms. From the standpoint of competitiveness and the potential capture of value, all companies need to take big data seriously. In most industries, established competitors and new entrants alike will leverage data-driven strategies to innovate, compete, and capture value from deep and up-to-real-time information. Indeed, we found early examples of such use of data in every sector we examined.

4. The use of big data will underpin new waves of productivity growth and consumer surplus. For example, we estimate that a retailer using big data to the full has the potential to increase its operating margin by more than 60 percent. Big data offers considerable benefits to consumers as well as to companies and organizations. For instance, services enabled by personal-location data can allow consumers to capture $600 billion in economic surplus.

5. While the use of big data will matter across sectors, some sectors are set for greater gains. We compared the historical productivity of sectors in the United States with the potential of these sectors to capture value from big data (using an index that combines several quantitative metrics), and found that the opportunities and challenges vary from sector to sector. The computer and electronic products and information sectors, as well as finance and insurance, and government are poised to gain substantially from the use of big data.

6. There will be a shortage of talent necessary for organizations to take advantage of big data. By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.

7. Several issues will have to be addressed to capture the full potential of big data. Policies related to privacy, security, intellectual property, and even liability will need to be addressed in a big data world. Organizations need not only to put the right talent and technology in place but also structure workflows and incentives to optimize the use of big data. Access to data is critical—companies will increasingly need to integrate information from multiple data sources, often from third parties, and the incentives have to be in place to enable this.

Related Articles

soec12_frth

The social economy: Unlocking value and productivity through social technologies

inma11_frth

Internet matters: The Net's sweeping impact on growth, jobs, and prosperity

grtr11_frth

The great transformer: The impact of the Internet on economic growth and prosperity

More From Forbes

The Top 5 Data Science And Analytics Trends In 2023

  • Share to Facebook
  • Share to Twitter
  • Share to Linkedin

Data is increasingly the differentiator between winners and also-rans in business. Today, information can be captured from many different sources, and technology to extract insights is becoming increasingly accessible.

Moving to a data-driven business model – where decisions are made based on what we know to be true rather than “gut feeling” – is core to the wave of digital transformation sweeping through every industry in 2023 and beyond. It helps us to react with certainty in the face of uncertainty – especially when wars and pandemics upset the established order of things.

But the world of data and analytics never stands still. New technologies are constantly emerging that offer faster and more accurate access to insights. And new trends emerge, bringing us new thinking on the best ways to put it to work across business and society at large. So, here’s my rundown of what I believe are the most important trends that will affect the way we use data and analytics to drive business growth in 2023.

Data Democratization

One of the most important trends will be the continued empowerment of entire workforces – rather than data engineers and data scientists – to put analytics to work. This is giving rise to new forms of augmented working, where tools, applications, and devices push intelligent insights into the hands of everybody in order to allow them to do their jobs more effectively and efficiently.

Best Travel Insurance Companies

Best covid-19 travel insurance plans.

In 2023, businesses will understand that data is the key to understanding customers, developing better products and services, and streamlining their internal operations to reduce costs and waste. However, it’s becoming increasingly clear that this won’t fully happen until the power to act on data-driven insights is available to frontline, shop floor, and non-technical staff, as well as functions such as marketing and finance.

Some great examples of data democracy in practice include lawyers using natural language processing (NLP) tools to scan pages of documents of case law, or retail sales assistants using hand terminals that can access customer purchase history in real time and recommend products to up-sell and cross-sell. Research by McKinsey has found that companies that make data accessible to their entire workforce are 40 times more likely to say analytics has a positive impact on revenue.

Artificial Intelligence

Artificial intelligence (AI) is perhaps the one technology trend that will have the biggest impact on how we live, work and do business in the future. Its effect on business analytics will be to enable more accurate predictions, reduce the amount of time we spend on mundane and repetitive work like data gathering and data cleansing, and to empower workforces to act on data-driven insights, whatever their role and level of technical expertise (see Data Democratization, above).

Put simply; AI allows businesses to analyze data and draw out insights far more quickly than would ever be possible manually, using software algorithms that get better and better at their job as they are fed more data. This is the basic principle of machine learning (ML), which is the form of AI used in business today. AI and ML technologies include NLP, which enables computers to understand and communicate with us in human languages, computer vision which enables computers to understand and process visual information using cameras, just as we do with our eyes; and generative AI, which can create text, images, sounds and video from scratch.

Cloud and Data-as-a-Service

I’ve put these two together because cloud is the platform that enables data-as-a-service technology to work. Basically, it means that companies can access data sources that have been collected and curated by third parties via cloud services on a pay-as-you-go or subscription-based billing model. This reduces the need for companies to build their own expensive, proprietary data collection and storage systems for many types of applications.

As well as raw data, DaaS companies offer analytics tools as-a-service. Data accessed through DaaS is typically used to augment a company’s proprietary data that it collects and processes itself in order to create richer and more valuable insights. It plays a big part in the democratization of data mentioned previously, as it allows businesses to work with data without needing to set up and maintain expensive and specialized data science operations. In 2023, it’s estimated that the value of the market for these services will grow to $10.7 billion .

Real-Time Data

When digging into data in search of insights, it's better to know what's going on right now – rather than yesterday, last week, or last month. This is why real-time data is increasingly becoming the most valuable source of information for businesses.

Working with real-time data often requires more sophisticated data and analytics infrastructure, which means more expense, but the benefit is that we’re able to act on information as it happens. This could involve analyzing clickstream data from visitors to our website to work out what offers and promotions to put in front of them, or in financial services, it could mean monitoring transactions as they take place around the world to watch out for warning signs of fraud. Social media sites like Facebook analyze hundreds of gigabytes of data per second for various use cases, including serving up advertising and preventing the spread of fake news. And in South Africa’s Kruger National Park, a joint initiative between the WWF and ZSL analyzes video footage in real-time to alert law enforcement to the presence of poachers .

As more organizations look to data to provide them with a competitive edge, those with the most advanced data strategies will increasingly look towards the most valuable and up-to-date data. This is why real-time data and analytics will be the most valuable big data tools for businesses in 2023.

Data Governance and Regulation

Data governance will also be big news in 2023 as more governments introduce laws designed to regulate the use of personal and other types of data. In the wake of the likes of European GDPR, Canadian PIPEDA, and Chinese PIPL, other countries are likely to follow suit and introduce legislation protecting the data of their citizens. In fact, analysts at Gartner have predicted that by 2023, 65% of the world’s population will be covered by regulations similar to GDPR.

This means that governance will be an important task for businesses over the next 12 months, wherever they are located in the world, as they move to ensure that their internal data processing and handling procedures are adequately documented and understood. For many businesses, this will mean auditing exactly what information they have, how it is collected, where it is stored, and what is done with it. While this may sound like extra work, in the long term, the idea is that everyone will benefit as consumers will be more willing to trust organizations with their data if they are sure it will be well looked after. Those organizations will then be able to use this data to develop products and services that align more closely with what we need at prices we can afford.

To stay on top of the latest on the latest trends, make sure to subscribe to my newsletter , follow me on Twitter , LinkedIn , and YouTube , and check out my books ‘Data Strategy: How To Profit From A World Of Big Data, Analytics And Artificial Intelligence’ and ‘ Business Trends in Practice ’.

Bernard Marr

  • Editorial Standards
  • Reprints & Permissions

Join The Conversation

One Community. Many Voices. Create a free account to share your thoughts. 

Forbes Community Guidelines

Our community is about connecting people through open and thoughtful conversations. We want our readers to share their views and exchange ideas and facts in a safe space.

In order to do so, please follow the posting rules in our site's  Terms of Service.   We've summarized some of those key rules below. Simply put, keep it civil.

Your post will be rejected if we notice that it seems to contain:

  • False or intentionally out-of-context or misleading information
  • Insults, profanity, incoherent, obscene or inflammatory language or threats of any kind
  • Attacks on the identity of other commenters or the article's author
  • Content that otherwise violates our site's  terms.

User accounts will be blocked if we notice or believe that users are engaged in:

  • Continuous attempts to re-post comments that have been previously moderated/rejected
  • Racist, sexist, homophobic or other discriminatory comments
  • Attempts or tactics that put the site security at risk
  • Actions that otherwise violate our site's  terms.

So, how can you be a power user?

  • Stay on topic and share your insights
  • Feel free to be clear and thoughtful to get your point across
  • ‘Like’ or ‘Dislike’ to show your point of view.
  • Protect your community.
  • Use the report tool to alert us when someone breaks the rules.

Thanks for reading our community guidelines. Please read the full list of posting rules found in our site's  Terms of Service.

  • Open access
  • Published: 28 May 2023

Time series big data: a survey on data stream frameworks, analysis and algorithms

  • Ana Almeida 1 , 2 ,
  • Susana Brás 2 , 3 ,
  • Susana Sargento 1 , 2 &
  • Filipe Cabral Pinto 1 , 4  

Journal of Big Data volume  10 , Article number:  83 ( 2023 ) Cite this article

7246 Accesses

11 Citations

1 Altmetric

Metrics details

Big data has a substantial role nowadays, and its importance has significantly increased over the last decade. Big data’s biggest advantages are providing knowledge, supporting the decision-making process, and improving the use of resources, services, and infrastructures. The potential of big data increases when we apply it in real-time by providing real-time analysis, predictions, and forecasts, among many other applications. Our goal with this article is to provide a viewpoint on how to build a system capable of processing big data in real-time, performing analysis, and applying algorithms. A system should be designed to handle vast amounts of data and provide valuable knowledge through analysis and algorithms. This article explores the current approaches and how they can be used for the real-time operations and predictions.

Introduction

The concept of big data was mentioned for the first time in a paper published in 1997 [ 1 ]. The authors called the problem of dealing with large data sets, “the problem of big data”. These large data sets were characterized by not fitting in the main memory, making it challenging or even impossible to analyze and visualize them. Even 25 years later, most computers cannot load 100 GB to memory, let alone process it.

In the current era in which data is produced at high rates, information has a decisive role, and most computers cannot process vast amounts of data; thus, it was necessary to create new ways to process the data. These aspects were the big impulse for the appearance of big data technologies.

The first approach to deal with big data sets was to divide them into smaller segments. However, even then, the segments could be very large in most cases. Besides, few computers were able to make this type of processing. To tackle this issue, frameworks started to appear to deal with batches of data. Nevertheless, none of these approaches deals with one big problem: what can be done if the data set keeps growing, and data continues to be received over time? To answer this question, several frameworks that deal with data streams have appeared.

The main goals of using big data are: (1) predicting future events, and (2) gaining insights and discovering relationships; in multidimensional and large sample-sized datasets [ 2 ]. However, these goals bring challenges in terms of computation and methods.

Predicting future events is also known as forecasting. Forecasting tasks foresee dealing with time series data. Processing and analyzing time series data in real-time can be a game-changer for an organization. This article will focus on time series data. Three tasks stand out on the analysis and prediction of time series data: monitoring, forecasting, and anomaly detection. These tasks benefit from being executed in real-time. Moreover, these tasks can be applied to many contexts and use cases. Therefore, it is important to use a streaming framework to process data as it arrives.

Anomaly detection in data streams is beneficial and essential for organizations to detect problems before they achieve more significant dimensions: for instance, to notice an intrusion before the intruder can steal or damage data. Another example is to detect unexpected traffic congestion and activate the responsible authorities. Therefore, the anomaly prediction connected to time series data will also be dealt in this article.

Using data streams in different contexts allows us to extract knowledge and make decisions in real-time (or near real-time). This article will explore how we can deal with big data, particularly, time series big data. This article will also analyse which algorithms can be applied to data to make forecasts and detect anomalies.

The main contributions of this work can be summarized as follows:

A comparative analysis of Stream Processing Engines (SPEs), including their characteristics and provenance, processing techniques, delivery of events, performance, and popularity.

A discussion on forecasting algorithms, including statistical and Machine Learning (ML) algorithms, and the advantages and disadvantages of using each type of algorithm.

A discussion on anomaly detection algorithms, the challenges of working with datasets containing anomalies, and the methods used to detect anomalies, such as statistical and ML approaches.

A comparative analysis of SPEs led us conclude that Spark is the most popular framework; however, Flink is better for data-intensive applications, and Heron scales better. Forecasting and anomaly detection methods bring value to organizations. While forecasting can allow better management of resources, anomaly detection can mitigate and eliminate problems. Regarding the type of methods used, statistical methods are usually lighter and more explainable, while machine learning methods are better when we have complex hidden patterns. The most recent published papers show a preference for deep learning techniques.

Working with huge amounts of streaming time series data can be a challenging task. With this in mind, we want to guide the reader on how this can be achieved. We will focus on three key relevant aspects:

Stream processing frameworks: these frameworks enable to process huge amounts of data, perform analysis, and apply algorithms in real-time.

Forecasting algorithms: these algorithms allow to predict future events. Therefore, they are essential for many organizations to perform informed decisions, manage resources, improve services, among others.

Anomaly detection algorithms: these algorithms allow to identify abnormal or unusual patterns. They can be early symptoms of something wrong, and we should be careful. They help us to improve security, quality, and efficiency.

Although the main focus of this work is the literature review on streaming frameworks, since we aim to work with time series data, we will also review the forecasting and anomaly detection algorithms; they play a crucial role in taking advantage of real-time processing capabilities. Therefore, with this survey, we aim to:

Identify the most relevant state-of-the-art regarding both data streams and algorithms.

Evaluate and compare different frameworks and methods to highlight each method or framework’s strengths, weaknesses, and limitations and when they should be applied.

Provide a guide for future research by identifying gaps in the current literature, areas that need further investigation, and other opportunities.

Related work

This subsection provides an overview of other related surveys presented in the literature. Table  1 summarizes the subjects mentioned in the works presented in this article, both surveys and research works. In this section we will address the survey articles.

This article presents a literature review on how to process huge amounts of time series that are continuously being produced over time and need to be processed in real-time. Therefore, in Table  1 , we consider papers regarding big data, stream processing, real-time processing, machine learning and deep learning, forecasting, and anomaly detection. In addition, we revised both surveys and research articles. Unfortunately, to the best of our knowledge, we did not find a paper analyzing all these topics. Nevertheless, we will compare our study with the most relevant works.

The most significant difference with work [ 9 ] regarding big data streams is that the authors of work [ 9 ] compared several tools, technologies, methods and techniques regarding data streams. However, we are more focused on data stream processing frameworks. In addition, the authors of [ 3 ] also discussed the concept of real-time associated with the processing of data streams, while the authors of [ 10 ] only perform a brief comparison of streaming processing frameworks. The authors of [ 10 ] conducted some practical evaluations of the streaming processing frameworks. Our survey presents a literature review. Similar to the work presented in [ 11 ], we are also researching progress in big data-oriented stream data mining; however, we focus on time series related problems, namely forecasting and anomaly detection.

Article structure

The remainder of this article is organized as follows. " Big data stream processing frameworks " section is focused on big data and data stream processing frameworks. It starts by discussing the problem definition, followed by existing solutions, it presents the elaborations and a summary. This section characterizes big data and discusses its relationship with data streams, forecasting methods, and anomaly detection. We also present frameworks for processing data streams, compare them, and discuss some example cases where each one can be applied. Next, " Analysis and algorithms for streaming data " section discusses algorithms that can be applied in the context of big data, namely forecasting concepts and methods (" Time series forecasting " section) and anomaly detection strategies (" Anomaly detection " section). In this section, we focus on statistical, ML, and Deep Learning (DL) methods and their advantages and disadvantages. Each of these 2 sections presents a similar organization. Finally, " Conclusions and future research directions " section presents the conclusions and the challenges envisaged for future work, as well as some future research directions.

Big data stream processing frameworks

Problem definition.

The evolution of traditional systems to streaming systems brings new processing and analysis capabilities and challenges. Firstly, we are no longer limited to bounded data, since we can process bounded and unbounded data. We are no longer required to divide or process data into multiple steps. Usually, a single step is enough. Besides, we no longer have to wait long periods for data to be processed. As we receive data, we process and obtain results and insights.

Designing the architecture of an application is an important task that should be well thought out. Considering that the streaming processing is part of an entire system, as a first step in the deployment of this component, the system requirements should be analyzed and task prioritization shall be evaluated. Choosing a SPE is not different. Some of the desired requirements that might be considered for real-time data stream processing are:

Process large volumes of data;

Integrate data from multiple data sources;

Deal with data with different properties (multi-dimensional data, multiple entities, spatial-temporal dependencies);

Deal with bounded and unbounded data streams;

Deal with unsorted data, or delayed data;

Detect data anomalies;

Computation performance metrics (low latency, high throughput, high availability, high scalability).

As we stated before, the true value of big data comes from taking insights from the data and helping decision-makers. Therefore, efficient and precise algorithms implemented on scalable frameworks are needed to explore the data potentials. If we consider ML and DL in our analysis, we might add the model performance (error and training time) to the list. In the context of forecasting, metrics such as the Mean Squared Error (MSE) or \(\hbox {R}^{{\textbf {2}}}\) -Score can be useful [ 38 ]. In the case of anomaly detection, we may choose a high accuracy, high precision or even high recall method [ 16 ]. Since explanations play a crucial role in decision-making, the explainability of the ML model should also be considered [ 78 ].

There are several SPEs. Each SPE provides different features and has different properties. Moreover, each one can be more or less adequate according to the application.

The concept of big data has evolved through the years. First, big data started being depicted as a massive amount of data that does not fit in the main memory and requires more sophisticated ways of processing and visualizing [ 1 ]. This definition remains true; however, it is incomplete, since it is always being updated due to the data explosion [ 18 ] that occurred during the last decades. Defining big data is not a simple task because of its complexity. Figure  1 summarizes big data characteristics, challenges and opportunities.

figure 1

Big data taxonomy—information collected from [ 2 , 5 , 15 , 17 , 19 ]

As previously mentioned, this massive amount of data is characterized by massive sample size and high dimensionality [ 2 ]. Besides, data can arrive at high velocities and different flow rates [ 19 ]. Moreover, data can come from different sources [ 2 ], making it more complex. Data stream frameworks can receive data from multiple sources and process huge volumes of data, continuously arriving at high velocities. Several factors increase the complexity of dealing with big data, such as the variety of data that can be received [ 19 ]. For example, we can receive numerical values, text, images, sounds, video, or a combination of more than one type. In addition, our data can have a temporal component that brings additional complexity to the problem.

The maximum potential of big data is achieved when we trust the data and take advantage of it by analyzing it. Thus, we must identify inaccurate and uncertain data and deal with it [ 19 ]. In this context, the importance of anomaly detection methods is highlighted, especially the real-time detection of anomalies in data streams to mitigate anomalies as soon they happen.

Some of these characteristics bring statistical, computational, and visualization problems. For example, we can have algorithm instability, noise accumulation, spurious correlation, incidental endogeneity, and measurement errors regarding statistical problems [ 2 ]. On the other hand, regarding computation problems, we have storage, scalability, and bottleneck problems [ 2 , 79 ]. Finally, visualization can be complex or even impossible when we have high-dimensional data.

Statistical problems can bring dangerous consequences, since they can lead to wrong statistical inferences or false scientific discoveries. For instance, an excellent example of a spurious correlation is the strong correlation (99.79%) between “US spending on science, space, and technology” and “Suicides by hanging, strangulation and suffocation” [ 80 ]. As we can understand, these two phenomena are unrelated. This is a well-known phenomenon in statistics, meaning that correlation does not imply causality. However, spurious correlations can go unnoticed depending on the context and the available knowledge.

To summarize, big data requires demanding computational resources, and its potential is unlocked through trust in data analysis. Therefore, several streaming frameworks emerged to process big amounts of data with low latency, high throughput, and high scalability. Furthermore, anomaly detection methods are essential in data streams [ 19 ], since they can suffer security attacks, have malfunctioning devices, or something unexpected may occur. We can also execute these methods in batch; however, when applied to real-time streaming data, they achieve their full potential. Besides, big data allows to (1) forecast future events, and (2) gain insights and discover relationships in data [ 2 ], both being important tasks, especially for decision-makers.

Big data analysis, forecasting, and anomaly detection are achieved through statistical, machine learning, or deep learning methods. Note that deep learning is a subset of machine learning. Figure  2 depicts Google searching trends through the years, by keywords. Big data, machine learning, and deep learning have a growing trend over the years. On the other hand, anomaly detection had a very soft increase. The searching trend forecasting decreases and reaches its peak in 2022; however, we can use other terms to express forecasting, such as prediction. Note that Google trends do not allow complex queries.

figure 2

Google research trends over time—data collected from [ 81 ]

We can apply big data to a vast amount of scientific fields. We will present examples of use cases and applications for analyzing time series data streams in real-time. We will also include some examples that benefit from forecasting or anomaly detection methods.

In finances and economics, monitoring the stock market, detecting fraud, or forecasting the performance of assets, are high relevant tasks. In [ 25 ], the authors used Artificial Neural Network (ANNs) and data streams to forecast stock prices. Monika Arya et al. [ 21 ] proposed a real-time method to detect credit card fraud in data streams, using ANNs with ensemble trees.

Regarding health care and well-being, monitoring patients and having real-time processing capabilities can save lives. For instance, Leo Kobayashi et al. [ 82 ] created a patient monitor system using streams and multimodal data fusion. Their approach allowed them to analyse the data, conduct experiments and develop and apply algorithms. Another interesting application is to monitor and forecast the spread of infectious diseases. For instance, Ensheng Dong et al. [ 83 ] created an interactive dashboard to monitor COVID-19 using data streams.

We can also find works that benefit from using frameworks to process data streams in informatics and communications, such as monitor resource usage or detect security attacks. In [ 4 ], the authors propose an internet traffic monitoring system using streaming frameworks. And in [ 7 ], Liu et al. perform resource management and scheduling.

Other main areas with big data characteristics are smart cities and industry 4.0. One significant advantage is that they allow the creation of living labs, creating a space for learning and innovation. We can find several works to monitor and improve urban mobility, monitor water consumption and detect water leaks [ 84 ], and forecast traffic flow [ 38 ], among many others. Leonhard Hennig et al. [ 23 ] built a system to extract mobility and industry events from data streams. Qinglong Dai et al. [ 13 ] used a data stream framework with customized changes to process data from smart grids. Still, in the context of energy systems, Philsy Baban [ 24 ] could process and validate real-time streaming data. In [ 8 ], Sahal et al. discussed streaming frameworks and other tools to perform predictive maintenance for railway transportation and wind energy.

As can be observed, we can find big data applications in several different fields. Society can benefit greatly from big data; however, big data can also be dangerous. In this article, we will not explore the “dark side” of big data. For instance, it can serve for mass surveillance and persecution or increase the disparities among minorities. However, we hope that governments and institutions use big data for good. In this context, it emerged a new research area: “fair AI”, whose biggest goal is to combat racism, sexism, and other types of discrimination against minorities [ 85 ].

Real-time data stream processing

We use the term “big data” to define huge amounts of data [ 1 ] and the term “stream” to express data continuously being created and arriving [ 86 ]. This data can come from different sources and have different formats; its processing is not always trivial, especially if it is required in real-time.

Big data applications can have five types of components: data sources, a messaging platform, a processing module, a storage mechanism, and a presentation module. The data sources can be, among others, Internet of Things (IoT) sensors and social networks. These sources of information usually come from users, devices or activity logs. The messaging platform is responsible for sending data between modules. The processing module can be a streaming processing framework to ensure real-time processing capabilities. The storage mechanism can be a database or a data warehouse. Processed data can be presented in different ways, such as a web application, a mobile application, and a technical report. Figure  3 depicts the components of big data applications.

figure 3

Big data applications components

Existing solutions

Fundamental concepts.

In " Problem definition " we mentioned application requirements that can restrict the choice of a SPE. Now, we will discuss fundamental concepts that make it possible to have different data-processing techniques.

We may consider three types of processing: batch-based, stream-based, or event-based [ 87 ]. Batch processing is characterized by processing bounded data streams with a beginning and an ending. On the contrary, stream processing is characterized by the processing of unbounded data streams that do not have a known end. Besides, the data processing is performed as data arrives. If our application requires that we generate alerts or triggers if our data meets some conditions, we have event-based processing.

Concerning the processing model, we also have three types: at most once, at least once, and exactly once. At most once processing does not guarantee that the data is processed or persisted. In case of failure, we may have to deal with missing data. Usually, applications that choose at most once processing are more concerned with latency than reliability. On the contrary, at least once processing may process or persist duplicated data, but at least it guarantees that every data is processed or persisted at least one time. At last, exactly once processing just processes or persists data once.

Window mechanisms specify how to divide the stream in order to aggregate time series data. There are six main processing techniques [ 26 , 88 ]. The most basic mechanism is the single-pass in which we process each new sample only once. Several windowing mechanisms will be discussed. Nevertheless, a windowing mechanism can be defined as a function of the time or the number of events [ 27 ]. A sliding window mechanism is defined as a window with a fixed size that slides over the data stream [ 26 ]. Tumbling windows are non-overlapping sliding windows [ 88 ]. Session windows are similar to tumbling windows; however, in session windows, we have a gap between windows [ 88 ]. In a landmark window, it is specified a sample from which the window keeps growing [ 26 ]. This sample can be updated from time to time. At last, the damped window mechanism uses a fading mechanism in which, the most recent samples have a bigger weight, and, as time goes by, the samples loose their weight [ 26 ]. Figure  4 represents some of these window mechanisms.

figure 4

Processing window mechanisms

Regarding stream-based processing, its methods can be considered stateless or stateful. If the processing is stateless, then the state is not preserved. We can use stateful processing if we want to know how many people buy a specific game per month. On the other hand, if the state is retained, the processing is stateful. This can be useful to measure how many people buy the game over time in a commulative maner.

Data processing frameworks

As aforementioned, we will discuss and compare different SPEs. We selected six SPEs: Apache Spark, Apache Flink, Apache Storm, Apache Heron, Apache Samza, and Amazon Kinesis. Besides, we decided to include Apache Hadoop for historical reasons.

Hadoop Footnote 1 was the first framework that appeared to process large datasets using the MapReduce programming model. Hadoop is very scalable, since it can run on a single cluster, in a single machine, or spread on several clusters in multiple machines. Moreover, Hadoop takes advantage of distributed storage to improve performance by transmitting the code that processes the data instead of the data [ 89 ]. Besides, Hadoop provides high availability and high throughput. However, it can have efficiency problems when dealing with small files.

The major drawback of using Hadoop is that it does not support real-time stream processing. To deal with this problem, Apache Spark emerged. Spark Footnote 2 is a framework for processing batch and streaming data, and allows distributed processing. According to Matei Zaharia [ 90 ], the creator of Spark, Spark was designed to respond to three big problems of Hadoop:

Avoid iterative algorithms that make several passes through the data;

Allow real-time streaming;

Allow interactive queries.

Instead of MapReduce, Spark uses Resilient Distributed Datasets (RDDs) that are fault-tolerant and can be processed in parallel. Spark also provides scalability, and since its early releases, it has proved to outperform Hadoop [ 33 ]. Spark is helpful for data science related projects. Besides its main component, Spark provides several libraries for Exploratory Data Analysis (EDA), ML, graph analysis, stream processing and SQL analytics.

Two years later, Apache Flink Footnote 3 and Apache Storm Footnote 4 were created. While Spark uses micro-batches for stream processing, Flink and Storm can perform stream processing natively. Flink can process batch and streaming data. In Flink, we can process streams with specific temporal requirements. For example, we may consider processing or event time. In case of event time, Flink allows to deal with delayed events. Besides, Flink provides watermark support, allowing a trade-off between latency and completeness of data. Storm and Flink are similar frameworks, generating some discussion regarding their differences [ 91 ] and which of the following stand out:

Storm only allows stream processing;

They both can perform stream processing with low latency;

The API offered by Flink is more high-level and provides more functionalities;

They have different strategies to provide fault tolerance (Storm employs record-level acknowledgements while Flink uses a snapshot algorithm).

Storm is a good streaming framework; however, its capabilities to scale are not enough for more demanding applications. Besides this, debugging and managing Storm can be complex tasks. In this context, Apache Heron Footnote 5 emerges, as the successor of Storm. A paper published in 2015 [ 34 ] announced this transition at Twitter.

Apache Samza Footnote 6 is a framework that provides real-time processing, event-based applications, and Extract, Transform and Load capabilities. Samza provides several APIs and presents an architecture similar to Hadoop, but instead of using MapReduce, it has the Samza API, and it uses Kafka instead of the Hadoop Distributed File System .

Finally, Amazon Kinesis Footnote 7 is the only framework presented in this article that does not belong to the Apache Software Foundation. Kinesis is actually a set of four frameworks instead of a data stream framework. In this work, we refer to Amazon Kinesis to talk about the Kinesis Data Streams framework to simplify. Kinesis can easily be integrated with Flink.

Elaboration

The processing frameworks present different properties, which makes it challenging to choose one framework without understanding the differences. Therefore, we should choose the framework that suits best our use case.

Firstly, we decided to look at the nature of each framework. Although several frameworks belong to the Apache ecosystem, most were not created by Apache. They were later integrated into the Apache family through The Apache Incubator. Footnote 8 Table  2 resumes the nature of each one of them.

Table  3 contains information about the processing techniques available (batch or stream) and the delivery of events (at most once, at least once, exactly once). As we already mentioned, Hadoop only provides batch processing. Storm and Heron only provide stream processing. All other frameworks offer both batch and stream processing. However, Spark provides stream processing through micro-batches. Regarding the delivery of events, most frameworks guarantee that the events are processed exactly once or at least once. Heron offers three types of delivery, the two mentioned above and at most once. Besides, these frameworks provide drivers for several programming languages, the most popular are Python and Java.

Performance-wise, some experiments have been conducted to compare the different SPEs. Note that it is difficult to make a fair comparison due to the lack of experiments that contemplate all frameworks. Therefore, we started by a performance comparison regarding the frameworks. This comparison considers the information available in the official documentation of each framework, which is present in Table  4 . One of the most important characteristics when choosing a framework is the ability to process information in real-time. However, there needs to be a consensual definition of what real-time means. Gomes et al. [ 3 ] focused their study on this concept in the context of data streams and big data. According to the authors, there are different intents when discussing real-time. For example, real-time could mean an immediate response. Another possibility is the guarantee of low latency: some consider the time the system should answer, while others refer the time the system must answer. For a more fair comparison, in this discussion, we will focus on real-time as the property of having low latency.

Most of these frameworks present low latency, which is good when we are processing significant amounts of data and want to process it in real-time. Hadoop is the only one that is considered to have high latency. All frameworks present high throughput and high scalability. However, Hadoop only allows scaling vertically. Regarding fault tolerance mechanisms, all frameworks deal with fault tolerance.

After this initial study, we look for works that compare some of these frameworks to make an unbiased comparison. In 2015, Namiot et al. [ 10 ] made an introductory comparison of the properties of Storm, Spark, Samza, Apache Flume, Apache Kafka, Amazon Kinesis, and IBM InfoSphere.

Besides the noticeable differences between Hadoop and Spark, Pooja Choudhary et al. [ 28 ] conducted some experiments to compare these two frameworks. They concluded that Spark uses more memory than Hadoop, needing less execution time. However, the authors of [ 35 ] mentioned that Spark might not be the best framework if our application requires low latency and high throughput.

The authors of [ 29 ] compared the performance of Spark, Flink, and Storm under saturation conditions (the maximum streaming load that the frameworks could support without delay). This comparison is insightful if we want to choose the best framework for a data-intensive application. Flink presented the highest saturation level, while Storm had the worst CPU usage. Even when failure recovery mechanisms are activated, Storm performance decreases by 50%, while Flink only decreases 10%. Nevertheless, Spark can surpass Flink if we are not concerned with latency.

Inoubli et al. [ 12 ] performed experiments in which they compared Spark, Storm, Flink, and Samza. They observed that Spark achieved the worst processing rates compared to the other three frameworks. Flink and Samza were more efficient, especially when messages had a more considerable size. Flink CPU usage was lower; however, Flink could outperform Storm if the CPU consumption allowed was increased. Spark requires more RAM, less disk access, it is slower in processing messages, and uses less bandwidth.

In 2019, in the context of a smart city, Hamid Nasiri et al. [ 30 ] evaluated three different frameworks: Spark, Flink and Storm. They started by fixing the input rate and compared the performance with two nodes versus eight nodes. With two nodes, Flink presented the lowest latency and the highest throughput. Flink delivered a similar performance with a slightly higher throughput with eight nodes. The improvements on Spark and Storm were more significant, but Flink was still the best. On the other hand, Spark had the worst latency. With eight nodes, Spark presented a similar throughput to Flink; however, it reached the highest throughput peaks. They analyzed the impact of changing the input rate and the number of worker nodes. We can conclude that the performance of Flink is similar to Storm, even when using no acknowledgements in Storm. The most significant difference is the throughput in which Flink is better than Storm; however, Storm seems to scale better, and with eight nodes, Spark is the best of them all in terms of throughput. At last, they measured CPU and network utilization. Flink achieved the lowest CPU utilization and the highest network utilization. Storm and Spark achieved similar performances.

Kolajo et al. [ 9 ] compared 19 tools and technologies for data streaming; however, only half of them supported both batch and streaming processing. On another work [ 31 ], in 2019, the authors compared the performance of five stream processing systems: Storm, Flink, Spark, Kafka Stream, and Hazelcast Jet. Storm has the best memory consumption, and presents good stability. Flink presents the lowest latency. Spark presents the highest throughput and has a good compatibility with ML libraries.

In 2020, LinkedIn published a post [ 92 ] showing some improvements performed on Samza. These improvements provided Samza with more considerable throughput capabilities when compared with Flink.

Later in 2021, Krzysztof Wecel et al. [ 32 ] selected six frameworks, but has chosen to focus their analysis on comparing Spark and Flink. They concluded that Spark is more memory efficient while Flink is more CPU efficient. The authors also mentioned that, while performing their experiment, they found a problem that led to delays in the implementation phase: missing detailed documentation. We were already aware of this problem, especially with Flink.

Heron brings an extensive set of advantages to users that want to transit from Storm to a more scalable framework. The API available for Heron is compatible with the one available for Storm. Heron requires fewer resources (less CPU usage) and provides performance improvements (more throughput and less latency). Currently, Heron is in the incubating phase at The Apache Incubator [ 93 ].

To understand the frameworks popularity, we decided to perform two experiments using Scopus. Footnote 9 These experiments were performed on August 9th, 2022. In the first experiment, we try to understand the popularity of the different frameworks over the years. In the second experiment, we try to perceive how many publications exist when we consider different criteria.

For the first experiment, we created three queries. The example below contains the queries for the Apache Hadoop framework. Similar queries were performed for the remaining frameworks.

apache w/ hadoop

TITLE-ABS-KEY (apache w/ hadoop)

TITLE-ABS-KEY (apache w/ hadoop) AND (LIMIT-TO (SUBJAREA,“COMP”) OR LIMIT-TO (SUBJAREA,“ENGI”))

Firstly, we perform a general search using only the framework’s name. Secondly, we restrict papers with the framework’s name in the title, abstract or keywords. Lastly, we limit the subject area to papers published in the engineering field or computer science.

Figure  5 contains the results of the first query. We can visualize that Hadoop is the dominant framework in the first years. This happens because Hadoop is the oldest, and most frameworks did not exist or did not belong to the Apache Software Foundation at the time. The most popular streaming framework is Spark. Following Spark, the popularity of Flink and Storm is similar. Finally, Heron, Samza and Kinesis are the most unpopular frameworks.

figure 5

Data processing frameworks: Popularity over the years first query

Figure  6 presents the results of the second query. When we restrict papers with the framework’s name in the title, abstract or keywords, we can visualize that Spark is the dominant framework. This might indicate that most papers that mention Hadoop only mention it because it was the first relevant framework. Another explanation is that Hadoop is the framework used in the study, but was not the subject of the study. Therefore, this second query is more focused on studying the framework, not its usage.

figure 6

Data processing frameworks: Popularity over the years second query

What we can visualize in Fig.  6 is intensified in Fig.  7 when we limit the subject area. Figure  7 shows the results of the third query.

figure 7

Data processing frameworks: Popularity over the years third query

In the second experiment, we evaluate the number of papers that considered stream-related concepts and algorithms. Our goal is to understand, for instance, how many articles that addressed forecasting also addressed streams. We started with two basic queries. First, query 4 helps to understand how many papers contain the word forecast or other words derivated from the word forecast, such as forecasting or forecasts. Query 5 helps to understand how many papers include anomaly detection or outlier detection. Query 6 is an additional query to understand how many papers also include ML or DL.

(anomaly w/ detection) OR (outlier w/ detection)

(machine w/ learning) OR (deep w/ learning)

Figure  8 contains the results for forecasting terms. We start by performing query 4, and we named forecast-term. Then, we also included query 6, which we called ML-term. Then, we selected only the papers that had both terms in the title, abstract, or keywords. The next step was to limit by subject area (as in the first experiment). Then, we limited the search by the years from 2012 until 2023. Finally, we included different terms in order to answer our initial question. We separated the terms stream and the several frameworks. As we can visualize, we started with 1.5 million papers, and in the end, only 1 thousand had terms related to streams.

figure 8

Forecast versus Stream

Figure  9 contains the results for anomaly detection. The only thing that changed with Fig. 5 was the initial term that, in this case, was the anomaly detection term, query 5. As we can visualize, we began with 136 thousand papers, and in the end, only five hundred had terms related to streams.

figure 9

Anomaly detection versus Stream

Only a few papers consider streaming and forecasting concepts because a forecasting algorithm, to provide the most benefits, should perform real-time forecasting. Moreover, given the complexity of implementing a stream-based forecasting system and a forecasting algorithm, researchers can be more focused on developing one of these tasks when they publish their work. The same can be applied to anomaly detection concepts and other applications.

Choosing the best SPE is a critical engineering task that should consider the following. Foremost, only Spark, Flink, Samza and Kinesis allow both batch and stream processing. In addition, Spark and Flink do not allow missing or repeated data. However, Heron enables the choice of any delivery. Flink is the best framework for data-intensive applications, presenting the lowest latency and highest throughput. However, Storm seems to scale better. Recent studies have proven that Samza has a better throughput than Flink, and Heron scales better than Storm. Nevertheless, Spark and Storm are the most popular stream frameworks. Heron is a good substitute for Storm, allowing Storm users to transition easily.

Analysis and algorithms for streaming data

In the scope of ML, several tasks can take advantage of streaming technologies, such as regression, classification, clustering, forecasting, anomaly detection, and frequent pattern mining.

In this section, we decided to focus on two tasks related with time series: forecasting (" Time series forecasting " secrtion) and anomaly detection (" Anomaly detection " section).

Time series forecasting

Humans are constantly trying to predict the future. Millions of years ago, when we started counting time, we also began to make predictions. One of the questions that most hunt humanity, and that several societies, religions and individuals tried to guest, is when doomsday will occur. Several dates have been proposed over the years, but until now, none of them has been correct.

Forecasting is a prediction task in which we try to predict future events accurately. To make good forecasts, we should understand the phenomenon and the causes that influence the phenomenon. We can use historical data, events that may occur, and other information that may contribute to the forecasting task [ 94 ]. For example, when we look at the sky and see dark clouds, we can (most certainly) guess it will rain.

Accordingly, with the domain of our problem, we should look for data other than the phenomenon’s data. For instance, Wasiat Khan et al. [ 45 ] used data from social media and financial news to predict the stock market’s performance. However, the authors recognize that not all stocks are influenced the same way. Besides, the authors noticed that some stocks were more influenced by social media news, while others were more influenced by financial news. Ahmad Ali et al. [ 46 ] considered the spatial-temporal dependencies and several temporal patterns (current, daily, and weekly) to predict crow flows. The use of external factors, such as weather conditions, holidays, and events was also crucial in this context.

Forecasting tasks can be classified as short, medium or long-term forecasts [ 94 ]. These terms are used if the forecast is made for the near future, medium future or distant future. For instance, we may want to predict how many people will travel to a tourist destination in the next hour, in the next week, or in the next year.

Usually, short-term forecasting is only relevant in a short interval. Therefore, we might benefit from performing the forecasting in real-time or near-real-time. On the other hand, medium and long-term forecasting is not needed immediately; therefore, we can perform them offline.

Forecasting problems use time series data. A time series is the evolution of one variable (or more) over time. A time series is a stochastic process, time-indexed, thus making statistical properties relevant. When we only have one variable, we have a univariate time series. We have a multivariate time series when we have more than one variable. Usually, when we are in the presence of a univariate time series, we call it a time series [ 94 , 95 , 96 ].

figure 10

Forecasting methods

Time series data is similar to streaming data, since we can look at the data arriving from the streaming with a temporal component and a sequential order. However, this does not mean that all data from streams are time series, even though they might have a timestamp associated.

There are three types of forecasting methods: historical, statistical, and ML. Historical methods only look at past values to forecast new ones. The most popular historical method is the Historical Average (HA), which can be found in the literature [ 47 ], especially as a baseline. Statistical methods are mainly based on the Auto Regressive (AR) method. They are also considered usually as a baseline. For instance, we can find Auto Regressive Integrated Moving Average (ARIMA) in work [ 47 ]. ML approaches, particularly DL, have been highlighted more recently, and several novelty methods have been proposed.

We can find forecasting works related to energy consumption and pricing. Bangzhu Zhu et al. [ 48 ] used an SVM-based method with mixture kernels to forecast carbon prices. Razak Olu-Ajayi et al. [ 49 ] predicted the energy consumption of buildings using ML and DL models, and concluded that ANNs are more suitable to make predictions. In [ 50 ], Zhang et al. proposed a Multi-view Ensemble Learning Model (MELM) to forecast traffic of base stations to save power in cellular networks. Their multi-view methods had four views: a temporal, a spatial, one dedicated to events, and the last view for residual information. For the temporal component, they analyzed the auto-correlation, the trend, and the seasonality of the data, and they used the Seasonal Auto Regressive Integrated Moving Average (SARIMA) to perform short and long-term forecasting. They used a spreading model based on a grid system to observe and capture the spatial dependencies. The authors observed that different regions have a different number of users, and they observed mobility transferring from nearby regions. They used a decision tree to capture the influence of events, since they cause changes in traffic. They considered four types of events (holidays, weather, concerts, and news). For the residual information, they used a top-k regression tree.

Another explored topic is related to traffic. To predict the flow of crowds, in [ 51 ] it is proposed a framework called Forecasting Citywide Crowd Flows (FCCF). The authors used human mobility data, weather conditions, and road network data. First, they divided the human mobility data into two edge flow categories: inflow and outflow. Besides that, they split the region into small regions. Then, they decomposed the flows into seasonal, trend, and residual and built a model for each one of the flows. For the seasonal and trend components, they created an Intrinsic Guassian Markov-Random-Field (IGMRF) for each component. For the residual, they explored the spatiotemporal dependence and built a spatiotemporal residual model that uses a Bayesian network. Then, the models were aggregated to give the final prediction.

The authors in [ 52 ] proposed a multi-view network model called Deep Multi-View Spatial-Temporal Network (DMVST-NET). They observed that, in most cases, including a region that presents a weak correlation with the region we want to predict decreases the model’s performance. Usually, distant regions are less correlated, but this is not always true. Considering this all, the authors chose to create three views: a view for the temporal component, another for the spatial component (they only consider nearby regions), and the last one for semantic relations (the regions are far away but present similar demands). They used a Long Short-Term Memory (LSTM) for the temporal component, a Convolutional Neural Network (CNN) for the spatial component, and a Graph Neural Network (GNN) to capture the semantic relations.

In [ 53 ], the Multi-Task Learning Temporal Convolutional Neural Network (MTLTCNN) method is proposed for short-term passenger demand prediction. The authors started by using a Spatio-Temporal Dynamic Time Warping (ST-DTW) algorithm to select the most relevant features. The proposed method is multi-task, having one task per region. Each task comprises a Temporal Convolutional Neural Network (TCNN), and the tasks share information between them, namely spatiotemporal correlations. Ahmad Ali et al. [ 46 ] proposed an ANN model based on graphs and convolution to predict crowd flows. In addition, they explored spatiotemporal dependencies and external factors. The authors of [ 47 ] proposed an architecture that uses graphs, convolution, and recurrency to forecast traffic. Their approach explores spatiotemporal dependencies.

In 2018, Spyros Makridakis et al. [ 39 ] published the results of the fourth edition of a forecasting accuracy competition. This competition discouraged the submission of complicated ML models that required high computational capabilities. Most of the best methods were combinations of statistical models. One of the best methods was a hybrid ML (using Recurrent Neural Network (RNN)) and a statistical approach (exponential smoothing). Unfortunately, some of the submitted methods were based only on ML and achieved the worst results. Later in 2021, Spyros Makridakis et al. [ 40 ] published the results of the fifth edition of the forecasting accuracy competition. The goal was to predict the sales of a retail company represented by 42.840 time series. Most of the competitors used LightGBM-based methods, a ML method based on trees. In the top five, the first two top methods were essentially a weighted combination of LightGBM models, the third winner was a weighted combination of a Neural Network (NN), the fourth place was a non-recursive LightGBM, and the fifth was a recursive LightGBM.

A literature review on deep learning methods for financial time series forecasting [ 43 ] presented eight methods commonly used: Deep Multi Layer Perceptron (DMLPs), RNNs, LSTMs, CNNs, Restricted Boltzman Machines (RBMs), Deep Belief Networks (DBNs), Autoencoders (AEs), and Deep Reinforcement Learning (DRL). The authors highlight the preference of researchers in using RNNs, specially LSTMs, with financial data. However, as the authors identified, CNNs and Graph-based networks still need to be explored when using financial data. Meanwhile, Masini et al. [ 44 ] reviewed both ML and DL methods for financial forecasting; their main focus was NN, regression trees, bagging, and regression. The authors emphasized the use of ML models (including DL models) in the presence of large datasets.

Table 5 resumes the revised works. In this comparison, we did not include the survey articles. As we can visualize, different approaches emerged over the last years for both ML and DL methods. Most of the authors used more than one metric to compare the methods.

Figure  10 contains some of the methods used in forecast tasks. Forecasting may be accomplished using statistical methods or DL-based methods. Both approaches have advantages and disadvantages. Depending on the context, statistical methods may be more advantageous than DL methods and vice-versa. While statistical methods are explainable, they are usually more robust in short-time predictions, and they present the best results in short-time contexts. They are usually not suitable for long-term forecasting.

ANNs present some disadvantages. The first problem is to find the weights of the inputs. The training process will update the model weights in each iteration; however, the optimization algorithm used may not lead to the minimum error or loss and can lead to overfitting. The training process can be extensive, making its adoption difficult in some contexts. ANNs also require a lot of information and great computational power when compared with statistical methods.

One of the big problems with ML algorithms is the lack of transparency, especially in ANNs. ANNs are often seen as “black boxes” [ 41 ]. In order to solve this issue, a new topic has emerged in the scope of ML: explainable models. Explainability plays a crucial role in the understanding of a particular problem. A correct prediction is not always enough, since it can have real impacts in terms of security, ethics, mismatched objectives, privacy, and others [ 42 ].

The more relevant advantage of using DL based methods is the possibility of working with multidimensional data, in some cases exploring the relationships between space, time, and other factors that may influence the prediction. Statistical methods may be more beneficial regarding forecasting methods with real-time stream processing, since they are lighter. However, we should consider the application requirements, the data, and the threshold between execution time and other performance metrics.

We decided to compare the type of methods used in forecasting in terms of popularity over the years, highlighting the last years. Figure  11 contains the relationship between the number of documents retrieved from Scopus when we perform the query example Q7. As we can observe, the use of machine learning and deep learning for forecasting increased over the last few years.

figure 11

Evolution of the popularity of type of methods regarding forecasting over the years. ML stands for Machine Learning, DL for Deep Learning, SL for Statistical Learning, and RL for Reinforcement Learning

TITLE-ABS-KEY ( forecasting AND ( “machine learning” OR “ml”) )

We also compared the methods used. Figure  12 contains the obtained results. Before 2018, the type of methods that were more mentioned were the ANNs. This can happen for two reasons: it was used the generic architecture of ANN, or the authors used the word when referring to a specific type of ANN. For instance, a LSTM is a type of ANN. Over the years, we can observe an increase in the use of LSTMs, CNN, RNNs, AE, and GNNs. The popularity of Deep Learning methods does not mean that the statistical ones are not important. It just reflects the evolution and trends of research methods.

figure 12

Evolution of the popularity of methods regarding forecasting over the years. ANN stands for Artificial Neural Network, SVM for Support Vector Machine, LSTM for Long Short-Term Memory, A &S for ARIMA and SARIMA, RNN for Recurrent Neural Network, CNN for Convolution Neural Network, FNN for Feedforward Neural Network, AE for Autoencoder, GNN for Graph Neural Network, DBN for Deep Belief Network, LGBM for LightGBM, HA for Historical Average and RBM for Restricted Boltzmann Machines

Forecasting is an essential task when working with time series datasets. We can have different forecasting horizons, such as short, medium, and long-term. We can apply this type of method to different contexts and use cases.

Classical methods are mainly based on Auto-Regression. Regarding machine learning methods, LightGBM proved to be efficient. In the case of deep learning methods, the most used are based on LSTMs, CNNs, AEs, and GNNs. As we discussed, all methods have their positive and negative aspects. In addition, the application and intent of the problem can make the choice of the technique easier to select.

  • Anomaly detection

An anomaly occurs when something unexpected happens. We can observe anomalies in our daily lives, for instance, a cold day (as if it were winter) in the middle of the summer. We can visualize the anomalies in data. If we look for the chart that contains the daily temperatures measured in the summer, we would see an anomalous point in relation to the other points. However, not all anomalies are expressed in the same way. Anomalies can be classified by their nature, they can be a point anomaly, a contextual anomaly, or a collective anomaly [ 54 ].

A point anomaly can be identified when we compare it with the rest of the data [ 55 ]. Remembering the “cold day in the middle of the summer” example, if we only had data from the summer, we would have a point anomaly if the observed temperature was very different from all others.

A contextual anomaly happens in a particular context [ 55 ]. If we had data from the entire year, we would observe that in the winter there are low temperatures. The point is anomalous because it happens in the summer and not in the winter. This is similar to a conditional anomaly, which depends on the context to be classified as an anomaly.

A collective anomaly is a collection of points that are considered anomalous when compared with the remaining dataset [ 56 ]. They can be, for instance, an abrupt change in the temperature of the summer. Another example would be a day in which it is verified a smaller variation of temperatures. As we know, temperatures are higher in the summer. However, we can have fluctuation throughout the day. From the examples above, we can conclude that anomalies can also be present in time series, and can be isolated outliers or abrupt changes.

There are several challenges associated with the detection of anomalies. Anomalies are not always known or noticeable, and it is difficult to define what may be considered as anomalous. Besides that, there is always some noise associated with the anomaly detection. As an example, network attacks can change, evolve, and adapt, marking this as a complex problem, and allowing negative impacts to happen from the presence of false negatives and false positives in the analysis [ 54 , 57 ].

Anomalies are known for being rare in datasets. It is because of that property that they are considered anomalies. In a dataset containing anomalies, and if our goal is to identify them, we will have a class imbalance problem. This problem is amplified when dealing with big data. There are three different techniques to solve this issue [ 16 ]:

Data-based techniques: using sampling methods, we can reduce the level of imbalance;

Algorithm-based techniques: we can reduce the bias towards the majority group;

Hybrid techniques.

Learners can have difficulties identifying anomalies, especially in highly imbalanced datasets, such as decision trees and logistic regression [ 16 ]. Moreover, some classification metrics are more sensitive to imbalanced classes. Regarding the evaluation metrics, some metrics are highly affected and are not recommended, such as accuracy and error rate. Other metrics, such as precision, and recall, can be used, but they alone are usually not enough [ 16 ]. The F -measure metric is a weighted average of precision and recall and is highly used in this context.

To detect anomalies, statistical learning approaches can be used. In [ 58 ], Hochenbaum et al. used seasonal decomposing to extract the trend and the seasonal components. They proposed two techniques: the seasonal Extreme Studentized Deviate (ESD), and the seasonal hybrid ESD, which adds the median and the Median Absolute Deviation.

Some methods to detect anomalies are signal-based. In [ 59 ], the authors could effectively detect sharp increases in the local variance using wavelet filters and pseudo-spline filters. In [ 97 ], Muñoz et al. used correlation-based techniques.

Principal Component Analysis (PCA) based approaches were explored in [ 60 , 61 ]. In [ 60 ], the authors applied wavelet transformations to network traffic data. Then, it is applied PCA to extract the nature of anomalies. Finally, they use a mapping function to detect the anomalies. In [ 62 ], the authors could also localize the source of anomalies by incorporating the network structure information with the PCA model. They used the Karhunen Loève Expansion to get spatial and temporal correlations. In [ 61 ], the authors proposed the use of Minimum Covariance Determinant (MCD) with Robust Principal Component Analysis (rPCA). As PCA might have issues associated with introducing the outliers in the subspace, rPCA tackles it, with a computational cost. The use of MCD helps to ease the computational cost.

We can also find in the literature approaches based on the k-Nearest Neighbors (KNN) algorithm. In [ 63 ], the authors proposed a Transductive Confidence Machine (TCM) with KNN for online anomaly detection. They could improve their results by applying instance selection. The authors of [ 22 ] compared Naive Bayes, Support Vector Machine (SVM), and decision trees, and in [ 36 ] it is used Naive Bayes.

figure 13

Anomaly detection methods

Several works are based on ANNs, such as [ 37 , 64 , 65 , 66 , 67 , 68 , 69 , 70 , 71 , 72 , 73 , 74 ]. In [ 64 ], motivated by the presence of a high rate of false alarms and improving accuracy, Hussain et al. proposed a FeedForward Neural Network (FNN) to detect anomalies in cellular networks. They accomplished high accuracy and a low False Positive Rate (FPR), proving the usefulness of FNNs. The work in [ 65 ] used a LSTM to detect network attacks through the anomalies present in data. They tested two types of baselines. In the first one, they only used cleaned data to train the model (without anomalies). In the second one, they used dirty data to train the model (with anomalies). They concluded that the dirty baseline models achieved the best results, which is good when no completely clean dataset exists. In [ 66 ], it is proposed the Parallel Subagging-GRU-based network (PSB-GRU)Parallel Subagging-GRU-based network (PSB-GRU) method. The model uses a Gated Recurrent Unit (GRU) network for long-term dependencies, a genetic algorithm to optimize the training process, the Spark platform to improve train efficiency, and subagging smoothly to improve the model’s generalization.

In [ 67 ], it is compared the performance of several RNN-based methods. The authors concluded that LSTM networks achieve the best results in terms of performance; however, the other RNN-based network also achieved good results. The works in [ 65 , 66 , 67 ] allow to conclude that sequential NN are suitable to detect anomalies. In [ 68 ], it is proposed a CNN-based method to extract spatio-temporal and other features from data with a threshold-based separation method to detect anomalies. The architecture had four convolutional layers. They achieved good results; however, they recognize that they need a more lightweight method to perform online anomaly detection. The authors of [ 74 ] also used a CNN. They were able to achieve better performance, in some cases, in architectures with one convolutional layer when compared with two or three convolution layers. However, their methods did not outperform RNN-based methods. The authors of [ 69 ] explored how CNNs can fail. The authors concluded that a one-pixel attack can mislead CNN-based networks. Increasing the number of layers (three convolution and three pooling layers) and retraining contributes to a more robust detection.

The authors of [ 70 ] proposed an ensemble method based on RBM and SVM. They tested their method in real time and achieved good performance. The work in [ 71 ] used Self-Organizing-Maps (SOM). Their model is computationally light, presenting results with a very low delay. In [ 37 ] the authors also use SOM with k - medoids , and they perform a two-step clustering. They achieved fast online detection and a multistage decision to distinguish different anomalies. In [ 72 ] it is proposed an autoencoder-based method with convolution. The use of autoencoders allowed the authors to capture non-linear correlations between features. The use of convolution has also reduced the training time. In [ 73 ], stacked autoencoders are used with a one-class classification model. The use of autoencoders allows the selection of the most relevant features and the reduction of data dimensionality.

Other approaches, such as the one proposed by [ 75 , 76 ] are tensor-based. A tensor is a structure similar to a multidimensional array with three or more dimensions. When we have one dimension, we have a vector (denoted as a first-order tensor), and if we have two dimensions, we have a matrix (second-order tensor) [ 76 ]. In [ 75 ], the proposed method is based on tensor decomposition. The method in [ 76 ] is based on tensor factorization, and we have a two-phase anomaly detection. Tensor-based methods are useful when we have complex data with high-dimensional orders.

Table 6 resumes the revised works for anomaly detection. We can visualize different types of methods. In anomaly detection, one of the most important tasks is the fair evaluation of the methods. Usually, in an anomaly detection problem, we have the class imbalance problem, as mentioned above. To compare better the evaluation metrics used, we decided to create Table  7 . False Positive Rate, True Positive Rate, and accuracy are the most frequently used metrics. The class imbalance highly affects the accuracy, and this metric should not be used, especially without other metrics.

Figure  13 contains some methods used in anomaly detection. Traditional statistical methods can fail in the face of big data and data with several dimensions. On the other side, ML methods can deal with high dimensionality. Supervised methods achieve good performance in detecting anomalies [ 6 ]. However, they have problems detecting new unseen types of anomalies. Unsupervised methods are good at detecting new anomalies [ 14 ].

Figure  14 contains the evolution of the popularity of the type of anomaly detection methods over the last few years. The use of statistical methods decreased while the use of deep learning methods increased. Currently, most of the published works use machine learning and deep learning. Similarly, Fig.  15 contains the evolution of the popularity of techniques over the last few years. As we can observe, methods such as PCA, SVM, and KNN lost popularity over time, while the focus evolved to the use of CNNs, RNNs, LSTMs and AE.

figure 14

Evolution of the popularity of type of methods regarding anomaly detection over the years. ML stands for Machine Learning, DL for Deep Learning, SL for Statistical Learning, and RL for Reinforcement Learning

figure 15

Evolution of the popularity of methods regarding anomaly detection over the years. ESD stands for Extreme Studentized Deviate, PCA for Principal Component Analysis, rPCA for Robust Principal Component Analysis, MCD for Minimum Covariance Determinant, KNN for k-Nearest Neighbors, NB for Naive Bayes, SVM for Support Vector Machine, DT for Decision Trees (and includes random forest), ANN for Artificial Neural Network, FNN for Feedforward Neural Network, LSTM for Long Short-Term Memory, RNN for Recurrent Neural Network, CNN for Convolution Neural Network, SOM for Self-Organizing-Maps, RBM for Restricted Boltzmann Machines, AE for Autoencoder and DBSCAN for Density-Based Spatial Clustering of Applications with Noise

As can be concluded from the above information, there are several methods that can be applied to anomaly detection. Regardless of the chosen method, we must take into consideration some problems associated with the nature of the data. The first class of problems that the methods can be vulnerable to are data poisoning attacks. In this context, a data poisoning attack might be something that we consider normal, being abnormal in the training phase. In [ 77 ], the authors deal with this problem by separating the training phase from the learning process.

Different methods should be considered when dealing with anomalies in data streams, since there is not one single method able to detect all types of anomalies. Furthermore, data streams are very susceptible to data poisoning attacks, since the use of supervised methods does not know the most recent data and needs to be regularly updated. Moreover, we should evaluate, once more, the threshold between execution time and other performance metrics. Finally, in the context of big data and ML, we should take into account that we are dealing with a class imbalance problem.

Conclusions and future research directions

Data by itself can have no value for organizations and society. However, we can transform data into knowledge and improve decision-making through analysis. Nevertheless, dealing with big data can be a complex problem, especially when the data keeps growing over time. In this context, Stream Processing Engines emerged. They are an essential tool for processing big data in real-time. In this work, we presented some frameworks to process data streams in real-time, and we compared them. Spark is not a native streaming framework since it uses micro-batches, which brings some performance issues. However, Spark is the most popular framework with several exploratory data analysis and machine learning modules. On the other side, Flink can deal better with data-intensive applications, while Heron seems to scale better.

We also presented approaches to deal with common big data problems, such as forecasting and anomaly detection in real-time. Applying these algorithms in real time can be very beneficial for organizations. For instance, the use of forecasting can help organizations to optimize the use of services and resources. On the other side, using anomaly detection algorithms can prevent or minimize problems before they happen, such as network attacks. Finally, we discussed statistical, machine learning, and deep learning approaches. Statistical methods are more explainable and computationally lighter. On the other side, machine learning methods deal better with complex data and can predict longer times.

As future research directions, we would like to suggest real-time analytics and algorithms over big data time series streams. Namely, having time series related machine learning and deep learning algorithms take advantage of online learning for providing real-time analysis, forecasts, and anomaly detection. Another possible research direction is the development of explainable methods focused on time-series.

Availability of data and materials

Not applicable.

https://hadoop.apache.org/ .

https://spark.apache.org/ .

https://flink.apache.org/ .

https://storm.apache.org/ .

https://heron.apache.org/ .

https://samza.apache.org/ .

https://aws.amazon.com/kinesis/ .

https://incubator.apache.org/ .

www.scopus.com.

Cox M, Ellsworth D. Application-controlled demand paging for out-of-core visualization. In: Proceedings of the 8th Conference on Visualization ’97. VIS ’97, pp. 235–244. IEEE Computer Society Press, Washington, DC, USA, 1997. https://doi.org/10.1109/VISUAL.1997.663888

Fan J, Han F, Liu H. Challenges of Big Data analysis. Natl Sci Rev. 2014;1(2):293–314. https://doi.org/10.1093/nsr/nwt032 .

Article   Google Scholar  

Gomes EHA, Plentz PDM, Rolt CRD, Dantas MAR. A survey on data stream, big data and real-time. Int J Netw Virtual Organ. 2019;20(2):143–67. https://doi.org/10.1504/IJNVO.2019.097631 .

Zhou B, Li J, Wang X, Gu Y, Xu L, Hu Y, Zhu L. Online internet traffic monitoring system using spark streaming. Big Data Mining Anal. 2018;1(1):47–56. https://doi.org/10.26599/BDMA.2018.9020005 .

Thudumu S, Branch P, Jin J, Singh J. A comprehensive survey of anomaly detection techniques for high dimensional big data. J Big Data. 2020. https://doi.org/10.1186/s40537-020-00320-x .

Es-Samaali H, Outchakoucht A, Benhadou S, Mounnan O, Abou El Kalam A. Anomaly detection for big data security: a benchmark. In: 2021 the 3rd International Conference on Big Data Engineering and Technology (BDET). BDET 2021, Association for Computing Machinery, New York, NY, USA 2021, pp. 35–39. https://doi.org/10.1145/3474944.3474950

Liu X, Buyya R. Resource management and scheduling in distributed stream processing systems: a taxonomy, review, and future directions. ACM Comput Surv. 2020. https://doi.org/10.1145/3355399 .

Sahal R, Breslin JG, Ali MI. Big data and stream processing platforms for industry 4.0 requirements mapping for a predictive maintenance use case. J Manuf Syst. 2020;54:138–51. https://doi.org/10.1016/j.jmsy.2019.11.004 .

Kolajo T, Daramola O, Adebiyi A. Big data stream analysis: a systematic literature review. J Big Data. 2019;6(1):47. https://doi.org/10.1186/s40537-019-0210-7 .

Namiot D. On big data stream processing. Int J Open Info Technol. 2015;3(8):48–51.

Google Scholar  

Wu Y. Network big data: a literature survey on stream data mining. J Softw. 2014. https://doi.org/10.4304/jsw.9.9.2427-2434 .

Inoubli W, Aridhi S, Mezni H, Maddouri M, Mephu Nguifo E. A comparative study on streaming frameworks for big data. In: Ziviani A, Hara CS, Ogasawara ES, de Macêdo JAF, Valduriez P, editors. LADaS@VLDB. Rio de Janeiro: CEUR-WS.org; 2018. p. 17–24.

Dai Q, Qian J. A distributed stream data processing platform design and implementation in smart cities. In: 2020 IEEE 3rd International Conference on Electronic Information and Communication Technology (ICEICT), 2020, pp. 688–693. https://doi.org/10.1109/ICEICT51264.2020.9334234

Ahmed M, Choudhury N, Uddin S. Anomaly detection on big data in financial markets. In: 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2017, pp. 998–1001

L’Heureux A, Grolinger K, Elyamany HF, Capretz MAM. Machine learning with big data: challenges and approaches. IEEE Access. 2017;5:7776–97. https://doi.org/10.1109/ACCESS.2017.2696365 .

Johnson J, Khoshgoftaar T. Survey on deep learning with class imbalance. J Big Data. 2019;6:27. https://doi.org/10.1186/s40537-019-0192-5 .

Luo Y, Du X, Sun Y. Survey on real-time anomaly detection technology for big data streams. In: 2018 12th IEEE International Conference on Anti-counterfeiting, Security, and Identification (ASID), 2018, pp. 26–30. https://doi.org/10.1109/ICASID.2018.8693216

Zhu Y, Zhong XY. Data explosion, data nature and dataology. Brain Inform. 2009;5819:147–58. https://doi.org/10.1007/978-3-642-04954-5_25 .

Gandomi A, Haider M. Beyond the hype: big data concepts, methods, and analytics. Int J Inf Manag. 2015;35(2):137–44. https://doi.org/10.1016/j.ijinfomgt.2014.10.007 .

Trifunovic N, Milutinovic V, Salom J, Kos A. Paradigm shift in big data supercomputing: dataflow vs. controlflow. J Big Data. 2015. https://doi.org/10.1186/s40537-014-0010-z .

Arya M, Sastry GH. Deal-’deep ensemble algorithm’ framework for credit card fraud detection in real-time data stream with google tensorflow. Smart Sci. 2020;8(2):71–83. https://doi.org/10.1080/23080477.2020.1783491 .

Zhao S, Chandrashekar M, Lee Y, Medhi D. Real-time network anomaly detection system using machine learning. In: 2015 11th International Conference on the Design of Reliable Communication Networks (DRCN), 2015, pp. 267–270. https://doi.org/10.1109/DRCN.2015.7149025

Hennig L, Thomas P, Ai R, Kirschnick J, Wang H, Pannier J, Zimmermann N, Schmeier S, Xu F, Ostwald J, Uszkoreit H. Real-time discovery and geospatial visualization of mobility and industry events from large-scale, heterogeneous data streams. In: Proceedings of ACL-2016 System Demonstrations. Association for Computational Linguistics, Berlin, Germany 2016, pp. 37–42. https://doi.org/10.18653/v1/P16-4007. https://aclanthology.org/P16-4007

Baban P. Pre-processing and data validation in IOT data streams. In: Proceedings of the 14th ACM International Conference on Distributed and Event-Based Systems. DEBS ’20. Association for Computing Machinery, New York, NY, USA 2020, pp. 226–229. https://doi.org/10.1145/3401025.3406443

Kovacs A, Bogdandy B, Toth Z. Predict stock market prices with recurrent neural networks using NASDAQ data stream, 2021, pp. 449–454. https://doi.org/10.1109/SACI51354.2021.9465634

Bahri M, Bifet A, Gama J, Gomes HM, Maniu S. Data stream analysis: foundations, major tasks and tools. WIREs Data Min Knowl Discov. 2021;11(3):1405. https://doi.org/10.1002/widm.1405 .

Namiot D, Sneps-Sneppe M, Pauliks R. On data stream processing in IOT applications. In: Galinina O, Andreev S, Balandin S, Koucheryavy Y, editors. Internet of things, smart spaces, and next generation networks and systems. Cham: Springer; 2018. p. 41–51.

Chapter   Google Scholar  

Choudhary P, Garg K. Comparative analysis of spark and hadoop through imputation of data on big datasets. In: 2021 IEEE Bombay Section Signature Conference (IBSSC), 2021, pp. 1–6. https://doi.org/10.1109/IBSSC53889.2021.9673461

Karakaya Z, Yazici A, Alayyoub M. A comparison of stream processing frameworks. In: 2017 International Conference on Computer and Applications (ICCA), 2017, pp. 1–12 . https://doi.org/10.1109/COMAPP.2017.8079733

Nasiri H, Nasehi S, Goudarzi M. Evaluation of distributed stream processing frameworks for IOT applications in smart cities. J Big Data. 2019. https://doi.org/10.1186/s40537-019-0215-2 .

Shahverdi E, Awad A, Sakr S. Big stream processing systems: an experimental evaluation. In: 2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW), 2019, pp. 53–60. https://doi.org/10.1109/ICDEW.2019.00-35

Wecel K, Szmydt M, Stróżyna M. Stream processing tools for analyzing objects in motion sending high-volume location data. Bus Inf Syst. 2021;1:257–68. https://doi.org/10.52825/bis.v1i.41 .

Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I. Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. HotCloud’10. USENIX Association, USA 2010, p. 10

Kulkarni S, Bhagat N, Fu M, Kedigehalli V, Kellogg C, Mittal S, Patel JM, Ramasamy K, Taneja S. Twitter heron: stream processing at scale. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. SIGMOD ’15. Association for Computing Machinery, New York, NY, USA 2015, pp. 239–250. https://doi.org/10.1145/2723372.2742788

Salloum S, Dautov R, Chen X, Peng PX, Huang JZ. Big data analytics on apache spark. Int J Data Sci Anal. 2016;1:145–64. https://doi.org/10.1007/s41060-016-0027-9 .

Ding N, Gao H, Bu H, Ma H. Radm:real-time anomaly detection in multivariate time series based on bayesian network. In: 2018 IEEE International Conference on Smart Internet of Things (SmartIoT), 2018, pp. 129–134. https://doi.org/10.1109/SmartIoT.2018.00-13

Qin X, Tang S, Chen X, Miao D, Wei G. Sqoe kqis anomaly detection in cellular networks: fast online detection framework with hourglass clustering. China Commun. 2018;15(10):25–37. https://doi.org/10.1109/CC.2018.8485466 .

Almeida A, Brás S, Oliveira I, Sargento S. Vehicular traffic flow prediction using deployed traffic counters in a city. Futur Gener Comput Syst. 2022;128:429–42. https://doi.org/10.1016/j.future.2021.10.022 .

Makridakis S, Spiliotis E, Assimakopoulos V. The m4 competition: results, findings, conclusion and way forward. Int J Forecast. 2018;34(4):802–8. https://doi.org/10.1016/j.ijforecast.2018.06.001 .

Makridakis S, Spiliotis E, Assimakopoulos V. M5 accuracy competition: results, findings, and conclusions. Int J Forecast. 2022. https://doi.org/10.1016/j.ijforecast.2021.11.013 .

Karlaftis MG, Vlahogianni EI. Statistical methods versus neural networks in transportation research: differences, similarities and some insights. Transp Res Part C Emerg Technol. 2011;19(3):387–99. https://doi.org/10.1016/j.trc.2010.10.004 .

Carvalho DV, Pereira EM, Cardoso JS. Machine learning interpretability: a survey on methods and metrics. Electronics (Switzerland). 2019. https://doi.org/10.3390/electronics8080832 .

Sezer OB, Gudelek MU, Ozbayoglu AM. Financial time series forecasting with deep learning: a systematic literature review: 2005–2019. Appl Soft Comput. 2020;90: 106181. https://doi.org/10.1016/j.asoc.2020.106181 .

Masini RP, Medeiros MC, Mendes EF. Machine learning advances for time series forecasting. J Econ Surv. 2023;37(1):76–111. https://doi.org/10.1111/joes.12429 .

Khan W, Ghazanfar MA, Azam MA, Karami A, Alyoubi K, Alfakeeh A. Stock market prediction using machine learning classifiers and social media news. J Ambient Intell Humaniz Comput. 2022. https://doi.org/10.1007/s12652-020-01839-w .

Ali A, Zhu Y, Zakarya M. Exploiting dynamic spatio-temporal graph convolutional neural networks for citywide traffic flows prediction. Neural Netw. 2022;145:233–47. https://doi.org/10.1016/j.neunet.2021.10.021 .

Guo K, Hu Y, Qian Z, Liu H, Zhang K, Sun Y, Gao J, Yin B. Optimized graph convolution recurrent neural network for traffic prediction. IEEE Trans Intell Transp Syst. 2021;22(2):1138–49. https://doi.org/10.1109/TITS.2019.2963722 .

Zhu B, Ye S, Wang P, Chevallier J, Wei Y-M. Forecasting carbon price using a multi-objective least squares support vector machine with mixture kernels. J Forecast. 2022;41(1):100–17.

Article   MathSciNet   Google Scholar  

Olu-Ajayi R, Alaka H, Sulaimon I, Sunmola F, Ajayi S. Building energy consumption prediction for residential buildings using deep learning and other machine learning techniques. J Build Eng. 2022;45: 103406. https://doi.org/10.1016/j.jobe.2021.103406 .

Zhang S, Zhao S, Yuan M, Zeng J, Yao J, Lyu MR, King I. Traffic prediction based power saving in cellular networks: a machine learning method. In: Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’17. Association for Computing Machinery, New York, NY, USA 2017) https://doi.org/10.1145/3139958.3140053

Hoang MX, Zheng Y, Singh AK. FCCF: Forecasting citywide crowd flows based on big data. In: Proceeding of the 24rd ACM International Conference on Advances in Geographical Information Systems (ACM SIGSPATIAL 2016). ACM SIGSPATIAL 2016, 2016. https://www.microsoft.com/en-us/research/publication/forecasting-citywide-crowd-flows-based-big-data/

Yao H, Wu F, Ke J, Tang X, Jia Y, Lu S, Gong P, Ye J, Li Z. Deep multi-view spatial-temporal network for taxi demand prediction. In: McIlraith, S.A., Weinberger, K.Q. (eds.) Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th Innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pp. 2588–2595. AAAI Press, 2018. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16069

Zhang K, Liu Z, Zheng L. Short-term prediction of passenger demand in multi-zone level: temporal convolutional neural network with multi-task learning. IEEE Trans Intell Transp Syst. 2020;21(4):1480–90. https://doi.org/10.1109/TITS.2019.2909571 .

Junior G, Rodrigues J, Carvalho L, Al-Muhtadi J, Proença M. A comprehensive survey on network anomaly detection. Telecommun Syst. 2019. https://doi.org/10.1007/s11235-018-0475-8 .

Chandola V, Banerjee A, Kumar V. Anomaly detection: a survey. ACM Comput Surv 2009. https://doi.org/10.1145/1541880.1541882

Ahmed M, Naser Mahmood A, Hu J. A survey of network anomaly detection techniques. J Netw Comput Appl. 2016;60:19–31. https://doi.org/10.1016/j.jnca.2015.11.016 .

Zhu M, Ye K, Xu C-Z. Network anomaly detection and identification based on deep learning methods. In: Luo M, Zhang L-J, editors. Cloud computing–CLOUD 2018. Cham: Springer; 2018. p. 219–34.

Hochenbaum J, Vallis OS, Kejariwal A. Automatic anomaly detection in the cloud via statistical learning. CoRR abs/1704.07706 (2017). 1704.07706

Barford, P., Kline, J., Plonka, D., Ron, A.: A signal analysis of network traffic anomalies. In: Proceedings of the 2nd ACM SIGCOMM Workshop on Internet Measurment. IMW ’02. Association for Computing Machinery, New York, NY, USA 2002, pp. 71–82. https://doi.org/10.1145/637201.637210

Jiang D, Yao C, Xu Z, Qin W. Multi-scale anomaly detection for high-speed network traffic. Trans Emerg Telecommun Technol. 2015;26(3):308–17. https://doi.org/10.1002/ett.2619 .

Matsuda T, Morita T, Kudo T, Takine T. Traffic anomaly detection based on robust principal component analysis using periodic traffic behavior. IEICE Trans Commun E100.B(5), 2017, pp. 749–761 . https://doi.org/10.1587/transcom.2016EBP3239 .

Jiang R, Fei H, Huan J. A family of joint sparse PCA algorithms for anomaly localization in network data streams. IEEE Trans Knowl Data Eng. 2013;25(11):2421–33. https://doi.org/10.1109/TKDE.2012.176 .

Li Y, Lu T, Guo L, Tian Z, Qi L. Optimizing network anomaly detection scheme using instance selection mechanism. In: GLOBECOM 2009–2009 IEEE Global Telecommunications Conference, 2009, pp. 1–7. https://doi.org/10.1109/GLOCOM.2009.5425547

Hussain B, Du Q, Zhang S, Imran A, Imran MA. Mobile edge computing-based data-driven deep learning framework for anomaly detection. IEEE Access. 2019;7:137656–67. https://doi.org/10.1109/ACCESS.2019.2942485 .

Radford BJ, Apolonio LM, Trias AJ, Simpson JA. Network traffic anomaly detection using recurrent neural networks. CoRR 2018.

Tao X, Peng Y, Zhao F, Yang C, Qiang B, Wang Y, Xiong Z. Gated recurrent unit-based parallel network traffic anomaly detection using subagging ensembles. Ad Hoc Netw. 2021. https://doi.org/10.1016/j.adhoc.2021.102465 .

Ravi V, Kp S, Poornachandran P. Evaluation of recurrent neural network and its variants for intrusion detection system (IDs). Int J Inf Syst Model Des. 2017;8:43–63. https://doi.org/10.4018/IJISMD.2017070103 .

Nie L, Li Y, Kong X. Spatio-temporal network traffic estimation and anomaly detection based on convolutional neural network in vehicular ad-hoc networks. IEEE Access. 2018;6:40168–76. https://doi.org/10.1109/ACCESS.2018.2854842 .

Ogawa, Y., Kimura, T., Cheng, J.: Vulnerability assessment for machine learning based network anomaly detection system. In: 2020 IEEE International Conference on Consumer Electronics–Taiwan (ICCE-Taiwan), 2020, pp. 1–2 . https://doi.org/10.1109/ICCE-Taiwan49838.2020.9258068

Garg S, Kaur K, Kumar N, Rodrigues JJPC. Hybrid deep-learning-based anomaly detection scheme for suspicious flow detection in SDN: a social multimedia perspective. IEEE Trans Multimedia. 2019;21(3):566–78. https://doi.org/10.1109/TMM.2019.2893549 .

Sarasamma ST, Zhu QA, Huff J. Hierarchical kohonenen net for anomaly detection in network security. IEEE Trans Syst Man Cybern Syst. 2005;35(2):302–12. https://doi.org/10.1109/TSMCB.2005.843274 .

Chen Z, Yeo C, Lee B-S, Lau C. Autoencoder-based network anomaly detection. 2018 Wireless Telecommunications Symposium (WTS), 2018, p. 1–5. https://doi.org/10.1109/WTS.2018.8363930 .

Dai S, Yan J, Wang X, Zhang L. A deep one-class model for network anomaly detection. IOP Conf Ser Mater Sci Eng. 2019;563: 042007. https://doi.org/10.1088/1757-899X/563/4/042007 .

Kwon, D., Natarajan, K., Suh, S., Kim, H., Kim, J.: An empirical study on network anomaly detection using convolutional neural networks. 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), 2018, pp. 1595–1598. https://doi.org/10.1109/ICDCS.2018.00178 .

Kasai H, Kellerer W, Kleinsteuber M. Network volume anomaly detection and identification in large-scale networks based on online time-structured traffic tensor tracking. IEEE Trans Netw Serv Manag. 2016;13(3):636–50. https://doi.org/10.1109/TNSM.2016.2598788 .

Xie K, Li X, Wang X, Xie G, Wen J, Cao J, Zhang D. Fast tensor factorization for accurate internet anomaly detection. IEEE/ACM Trans Netw. 2017;25(6):3794–807. https://doi.org/10.1109/TNET.2017.2761704 .

Moustafa N, Choo K-KR, Radwan I, Camtepe S. Outlier dirichlet mixture mechanism: adversarial statistical learning for anomaly detection in the fog. IEEE Trans Inf Forensics Secur. 2019;14(8):1975–87. https://doi.org/10.1109/TIFS.2018.2890808 .

Zhou J, Gandomi AH, Chen F, Holzinger A. Evaluating the quality of machine learning explanations: a survey on methods and metrics. Electronics. 2021. https://doi.org/10.3390/electronics10050593 .

Buhl H, Roeglinger M, Moser F, Heidemann J. Big data: a fashionable topic with(out) sustainable relevance for research and practice? Bus Inf Syst Eng. 2013;5:65–9. https://doi.org/10.1007/s12599-013-0249-5 .

Vigen T. Spurious correlations. 2022. https://www.tylervigen.com/spurious-correlations . Accessed 7 Sep 2022.

Google: google trends. 2022. https://trends.google.com/trends/explore . Accessed 07 Sept 2022.

Kobayashi L, Oyalowo A, Agrawal U, Chen S-L, Asaad W, Hu X, Loparo KA, Jay GD, Merck DL. Development and deployment of an open, modular, near-real-time patient monitor datastream conduit toolkit to enable healthcare multimodal data fusion in a live emergency department setting for experimental bedside clinical informatics research. IEEE Sensors Lett. 2019;3(1):1–4. https://doi.org/10.1109/LSENS.2018.2880140 .

Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Infect Dis. 2020;20(5):533–4. https://doi.org/10.1016/s1473-3099(20)30120-1 .

Schultz W, Javey S, Sorokina A. Smart water meters and data analytics decrease wasted water due to leaks. J Am Water Works Assoc. 2018;110(11):24–30. https://doi.org/10.1002/awwa.1124 .

Feuerriegel S, Dolata M, Schwabe G. Fair AI: challenges and opportunities. Bus Inf Syst Eng. 2020. https://doi.org/10.1007/s12599-020-00650-3 .

Confluent: what is streaming data? How it works, examples, and use cases. 2022. https://www.confluent.io/learn/data-streaming/ . Accessed 30 Aug 2022.

Flink A. Stateful computations over data streams. 2022. https://flink.apache.org/ . Accessed 28 Jun 2022.

Flink A. Windows: Apache Flink. 2022. https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/windows/ h. Accessed 28 Jul 2022.

Lam C. Hadoop in action. 1st ed. USA: Manning Publications Co.; 2010.

of the ACM C: Apache spark: a unified engine for big data processing on VIMEO. 2022. https://vimeo.com/185645796 . Accessed 21 Jul 2022.

Hueske F. What is/are the main difference(s) between Flink and Storm? Stack Overflow. https://stackoverflow.com/a/30719138 . Accessed 28 Jun 2022.

Zhang Y. Building a better and faster Beam Samza runner: LinkedIn engineering. https://engineering.linkedin.com/blog/2020/building-a-better-and-faster-beam-samza-runner . Accessed 30 Jun 2022.

Foundation TAS. Apache Heron. A realtime, distributed, fault-tolerant stream processing engine. 2022. https://heron.apache.org/ . Accessed 30 Aug 2022.

Hyndman RJ, Athanasopoulos G. Forecasting: principles and practice. 3rd ed. Melbourne: OTexts; 2021.

Pal A, Prakash P. Practical time series analysis: master time series data processing, visualization, and modeling using python. UK: Packt Publishing; 2017.

Brownlee J. Introduction to time series forecasting with python: how to prepare data and develop models to predict the future. Machine Learning Mastery, San Juan, Puerto Rico, 2017. https://books.google.pt/books?id=-AiqDwAAQBAJ

Muñoz P, Barco R, Serrano I, Gómez-Andrades A. Correlation-based time-series analysis for cell degradation detection in son. IEEE Commun Lett. 2016;20(2):396–9. https://doi.org/10.1109/LCOMM.2016.2516004 .

Download references

Acknowledgements

This work is supported by FEDER, through POR LISBOA 2020 and COMPETE 2020 of the Portugal 2020 Project CityCatalyst POCI-01-0247-FEDER-046119. Ana Almeida acknowledges the Doctoral Grant from Fundação para a Ciência e Tecnologia (2021.06222.BD). Susana Brás is funded by national funds, European Regional Development Fund, FSE, through COMPETE2020 and FCT, in the scope of the framework contract foreseen in the numbers 4, 5 and 6 of the article 23, of the Decree-Law 57/2016, of August 29, changed by Law 57/2017, of July 19.

Author information

Authors and affiliations.

Instituto de Telecomunicações, Aveiro, Portugal

Ana Almeida, Susana Sargento & Filipe Cabral Pinto

Departamento de Eletrónica, Telecomunicações e Informática, Universidade de Aveiro, Aveiro, Portugal

Ana Almeida, Susana Brás & Susana Sargento

IEETA, DETI, LASI, Universidade de Aveiro, Aveiro, Portugal

Susana Brás

Altice Labs, Aveiro, Portugal

Filipe Cabral Pinto

You can also search for this author in PubMed   Google Scholar

Contributions

Conceptualization: AA; Data curation: AA; Formal analysis: AA; Investigation: AA; Methodology: AA; Software: AA; Validation: AA, SB; Visualization: AA; Writing—original draft: AA; Funding acquisition: SS; Project administration: SS; Supervision: SB, SS, FCP; Writing—review & editing: SB, SS, FCP. All authors read the final manuscript.

Corresponding author

Correspondence to Ana Almeida .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Almeida, A., Brás, S., Sargento, S. et al. Time series big data: a survey on data stream frameworks, analysis and algorithms. J Big Data 10 , 83 (2023). https://doi.org/10.1186/s40537-023-00760-1

Download citation

Received : 12 October 2022

Accepted : 08 May 2023

Published : 28 May 2023

DOI : https://doi.org/10.1186/s40537-023-00760-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Time series
  • Stream processing engines
  • Forecasting
  • Machine learning

latest research big data

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts

Latest science news, discoveries and analysis

latest research big data

AI chatbot gets conspiracy theorists to question their convictions

latest research big data

When physicists strove for peace: past lessons for our uncertain times

latest research big data

Academics say flying to meetings harms the climate — but they carry on

latest research big data

Weird signal that baffled seismologists traced to mega-landslide in Greenland

The burning earth: how conquest and carnage have decimated landscapes worldwide, the brain aged more slowly in monkeys given a cheap diabetes drug, first private spacewalk a success what the spacex mission means for science, why do we crumble under pressure science has the answer, europe sidelines alzheimer’s drug: lessons must be learnt henrik zetterberg.

latest research big data

Human embryo models are getting more realistic — raising ethical questions

latest research big data

How we slashed our lab’s carbon footprint

latest research big data

Red light, green light: flickering fluorophores reveal biochemistry in cells

latest research big data

The baseless stat that could be harming Indigenous conservation efforts

Mosquito-borne diseases are surging in europe — how worried are scientists, where did viruses come from alphafold and other ais are finding answers, ancient dna debunks rapa nui ‘ecological suicide’ theory, why does heart disease affect so many young south asians.

latest research big data

How to support Indigenous Peoples on biodiversity: be rigorous with data

latest research big data

Mpox: apply COVID lessons to control outbreak in Africa

latest research big data

Data on SDGs are riddled with gaps. Citizens can help

Wildfires are spreading fast in canada — we must strengthen forests for the future, why the next pandemic could come from the arctic — and what to do about it christian sonne, current issue.

Issue Cover

Bioprospecting marine microbial genomes to improve biotechnology

Spectroscopic confirmation of two luminous galaxies at a redshift of 14, one month convection timescale on the surface of a giant evolved star, research analysis.

latest research big data

Rapa Nui’s population history rewritten using ancient DNA

latest research big data

Menopause age shaped by genes that influence mutation risk

latest research big data

Swirling star bubbles offer a glimpse of the Sun’s future

latest research big data

Future optoelectronics unlocked by ‘doping’ strategy

Long-lasting heart-failure treatment could be a game-changer, lipid recycling by macrophage cells drives the growth of brain cancer, thread, read, rewind, repeat: towards using nanopores for protein sequencing, cell-to-cell tunnels rescue neurons from degeneration.

latest research big data

The grassroots organizations continuing the fight for Ukrainian science

latest research big data

The human costs of the research-assessment culture

latest research big data

How a struggling biotech company became a university ‘spin-in’

I fire darts at whales to help track their movements, forget chatgpt: why researchers now run small ais on their laptops, books & culture.

latest research big data

How influencers and algorithms mobilize propaganda — and distort reality

latest research big data

Consider the finches: Books in brief

latest research big data

Why repairing forests is not just about planting trees

Candidate 1143172 cover letter: junior pot scrubber, nature podcast.

Nature Podcast

Latest videos

Nature briefing.

An essential round-up of science news, opinion and analysis, delivered to your inbox every weekday.

latest research big data

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

IMAGES

  1. Latest Big Data Trends and Predictions to watch out for in 2023

    latest research big data

  2. Top 20 Latest Big Data Trends for 2021

    latest research big data

  3. Latest Big Data Analytics Research Proposal [Novel Ideas]

    latest research big data

  4. Recent Big data trends in 2021

    latest research big data

  5. 10 Latest Trends in Big Data Analytics for 2023

    latest research big data

  6. 140 Excellent Big Data Research Topics to Consider

    latest research big data

VIDEO

  1. Research in National Security

  2. Machine Learning vs AI vs Deep Learning

  3. Using Big Data to Revolutionize Sustainability

  4. From Stability to Differential Privacy

  5. Algorithmic High-Dimensional Geometry I

  6. Researcher Stories: Using Big Data to advise international development

COMMENTS

  1. Big Data Research

    Big Data for Medicine and Healthcare. Edited by Francesco Piccialli, Nik Bessis, Gwanggil Jeon, Fabio Giampaolo. 4 July 2022. View all issues. Read the latest articles of Big Data Research at ScienceDirect.com, Elsevier's leading platform of peer-reviewed scholarly literature.

  2. Top 20 Latest Research Problems in Big Data and Data Science

    E ven though Big data is in the mainstream of operations as of 2020, there are still potential issues or challenges the researchers can address. Some of these issues overlap with the data science field. In this article, the top 20 interesting latest research problems in the combination of big data and data science are covered based on my personal experience (with due respect to the ...

  3. 15 years of Big Data: a systematic literature review

    Big Data is still gaining attention as a fundamental building block of the Artificial Intelligence and Machine Learning world. Therefore, a lot of effort has been pushed into Big Data research in the last 15 years. The objective of this Systematic Literature Review is to summarize the current state of the art of the previous 15 years of research about Big Data by providing answers to a set of ...

  4. A new theoretical understanding of big data analytics capabilities in

    Big Data Analytics (BDA) usage in the industry has been increased markedly in recent years. As a data-driven tool to facilitate informed decision-making, the need for BDA capability in organizations is recognized, but few studies have communicated an understanding of BDA capabilities in a way that can enhance our theoretical knowledge of using BDA in the organizational domain.

  5. Home page

    The Journal of Big Data publishes open-access original research on data science and data analytics. Deep learning algorithms and all applications of big data are welcomed. Survey papers and case studies are also considered. The journal examines the challenges facing big data today and going forward including, but not limited to: data capture ...

  6. Big Data Research

    A Security Management Framework for Big Data in Smart Healthcare. Parsa Sarosh, Shabir A. Parah, G. Mohiuddin Bhat, Khan Muhammad. Article 100225.

  7. Big Data Research

    Read the latest articles of Big Data Research at ScienceDirect.com, Elsevier's leading platform of peer-reviewed scholarly literature.

  8. Frontiers in Big Data

    This innovative journal focuses on the power of big data - its role in machine learning, AI, and data mining, and its practical application from cybersecurity to climate science and public health. ... Submit your research. Start your submission and get more impact for your research by publishing with us. Author guidelines.

  9. Systematic analysis of healthcare big data analytics for efficient care

    The underlying research on finding the key features will not only help in integrating big data in healthcare domain, but it will also assist in findings new gateways for future research directions ...

  10. Big data applications: overview, challenges and future

    The big data market is expected to have remarkable growth globally, with revenue projections ranging to USD 473.6 Billion by 2030, reflecting a growth rate of 12.7% from 2022 to 2030 (Research and Consulting 2023).This substantial growth underscores the increasing recognition of big data's critical role across industries and sectors.

  11. COVID‐19 Pandemic in the New Era of Big Data Analytics: Methodological

    Introduction. The spread of the COVID-19 global pandemic has generated an exponentially mounting and extraordinary volume of data that can be harnessed to improve our understanding of big data management research as well as exemplifying the necessity among scholars, practitioners and policymakers for a better and deeper understanding of a range of analytical tools that could be utilized to ...

  12. Five Key Trends in AI and Data Science for 2024

    5. Data, analytics, and AI leaders are becoming less independent. This past year, we began to notice that increasing numbers of organizations were cutting back on the proliferation of technology and data "chiefs," including chief data and analytics officers (and sometimes chief AI officers).

  13. Frontiers

    1. Introduction. In the last twenty years, we have witnessed an unprecedented and ever-increasing trend in data production. Hilbert and López (2011) date the rise of this phenomenon back to 2002, marking the onset of the digital age. Indeed, the transition from analog to digital storage devices dramatically augmented the capacity for data accumulation, thereby ushering in the Big Data era.

  14. Big data in Earth science: Emerging practice and promise

    More recently, big data have been used to support research toward the UN Sustainable Development Goals (SDGs), such as climate action (SDG 13) and life below water (SDG 14) . ... Big data-based analyses provide new insights into climate processes in ocean-climate interactions, severe weather, and wind potential. ...

  15. Articles

    In recent years, mobile applications have proliferated across domains such as E-banking, Augmented Reality, E-Transportation, and E-Healthcare. These applications are often built using microservices, an archit... Abdul Rasheed Mahesar, Xiaoping Li and Dileep Kumar Sajnani. Journal of Big Data 2024 11:123. Research Published on: 4 September 2024.

  16. Top Trends in Big Data for 2024 and Beyond

    Incorrect or misguided data can lead to wrong decisions and costly outcomes. Big data continues to drive major changes in how organizations process, store and analyze data. 2. More data, increased data diversity drive advances in processing and the rise of edge computing. The pace of data generation continues to accelerate.

  17. Data mining

    Latest Research and Reviews. ... This atlas takes advantage of the integration of big data, enabling the discovery of putative neural progenitors in adults and microglial regional variations.

  18. Big Data Research

    Deep Learning Techniques for Enhanced Mangrove Land use and Land change from Remote Sensing Imagery: A Blue Carbon Perspective. Huimin Han, Zeeshan Zeeshan, Muhammad Assam, Dr Faheem Ullah Khan, ... Nadia Sarhan. In Press, Journal Pre-proof, Available online 13 June 2024. View PDF.

  19. Big Data, new epistemologies and paradigm shifts

    With respect to the sciences, access to Big Data and new research praxes has led some to proclaim the emergence of a new fourth paradigm, one rooted in data-intensive exploration that challenges the established scientific deductive approach. At present, whilst it is clear that Big Data is a disruptive innovation, presenting the possibility of a ...

  20. Big data: The next frontier for innovation, competition, and

    The amount of data in our world has been exploding, and analyzing large data sets—so-called big data—will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus, according to research by MGI and McKinsey's Business Technology Office. Leaders in every sector will have to grapple ...

  21. Business analytics and big data research in information systems

    The "Business Analytics and Big Data" track as a melting pot for topics in information systems (IS) and neighbouring disciplines has a long and successful history at the European Conference on Information Systems (ECIS). From its initial year in 2012 to 2021, the track has received 512 submissions.

  22. The Top 5 Data Science And Analytics Trends In 2023

    Today, information can be captured from many different sources, and technology to extract insights is becoming increasingly accessible. The Top 5 Data Science And Analytics Trends In 2023. Adobe ...

  23. Privacy Prevention of Big Data Applications: A Systematic Literature

    Big Data is a relatively new IT concept, and it is apparent that further research is needed in this area. Many papers, however, show a substantial gap, suggesting that research is skewed toward traditional techniques and that Big Data is under-researched. In the bulk of the articles, the study findings are one-sided and incomplete.

  24. Time series big data: a survey on data stream frameworks, analysis and

    Big data has a substantial role nowadays, and its importance has significantly increased over the last decade. Big data's biggest advantages are providing knowledge, supporting the decision-making process, and improving the use of resources, services, and infrastructures. The potential of big data increases when we apply it in real-time by providing real-time analysis, predictions, and ...

  25. Latest science news, discoveries and analysis

    Latest science news and analysis from the world's leading research journal. ... Data from giant project show how withdrawn research propagates through the literature.