ML, AI, Big Data - IT Job QnS

Machine Learning , Artificial Intelligence, Big Data Analytics

Theory Comming soon

Previous Job Question with Answer on ML, AI & Big Data

Topic Wise Question Bank

Big Data, ML & AI

- 1
  Big Data, ML & AIProbability
  Server X and Server Y distribute incoming web traffic. Server X handles 60% of requests, while Server Y handles the remaining 40%. The probability of high latency is 2% for Server X and 4% for Server Y. Determine the total probability that a randomly selected request encounters high latency and the conditional probability that a delayed request was handled by Server X.
  ICB, AP, 26 | Assistant Programmer
  
  This is a conditional probability problem using the law of total probability and Bayes' theorem.
  Given:
  P(X)=0.6, P(Y)=0.4
  P(L|X)=0.02, P(L|Y)=0.04
  Step 1: Total probability of latency
  P(L) = P(X)P(L|X) + P(Y)P(L|Y)
  = 0.6×0.02 + 0.4×0.04
  = 0.012 + 0.016 = 0.028
  Step 2: Conditional probability
  P(X|L) = (P(X)P(L|X)) / P(L)
  = 0.012 / 0.028 = 3/7 ≈ 0.4286
  Conclusion: Overall latency probability is 2.8%, and delayed request is more likely from Server Y.
  
  Probability Solution
  Given:
  P(X)=0.6, P(Y)=0.4
  P(L|X)=0.02, P(L|Y)=0.04
  Total probability:
  P(L)=0.028
  Conditional probability:
  P(X|L)=3/7 ≈ 0.4286
- 2
  Big Data, ML & AIBig Data Analytics
  From the following Hadoop ecosystem components: MapReduce, HDFS, YARN, HBase, ZooKeeper, Pig, Hive, Mahout, Chukwa, Cassandra, Avro, Oozie, Flume, Sqoop Identify five components from the list that you would use to design this pipeline. Justify the role of each chosen component in this scenario, focusing on its functionality.
  Combined Bank, AE(IT)/AME, 26 | AME/ANE/AE
  Selected Hadoop Components for Big Data Pipeline
  1. HDFS (Hadoop Distributed File System)
  - Used to store massive volumes of data (petabytes).
  - Provides distributed and fault-tolerant storage.
  2. YARN (Yet Another Resource Negotiator)
  - Manages cluster resources and schedules jobs.
  - Ensures efficient execution of multiple applications.
  3. MapReduce
  - Used for batch data processing.
  - Processes large datasets in parallel across nodes.
  4. Hive
  - Provides SQL-like interface for querying big data.
  - Useful for data analysis and reporting.
  5. Flume
  - Used for real-time data ingestion (e.g., web server logs).
  - Efficiently collects and transfers streaming data to HDFS.
  - These components together support data ingestion, storage, processing, resource management, and analytics for a scalable big data pipeline.
  1. HDFS
  - বড় পরিমাণ data (petabytes) সংরক্ষণে ব্যবহৃত হয়।
  - distributed এবং fault-tolerant storage প্রদান করে
  2. YARN
  - cluster resource manage করে এবং job schedule করে।
  - একাধিক application efficiently run করতে সাহায্য করে।
  3. MapReduce
  - batch data processing এর জন্য ব্যবহৃত হয়।
  - large data parallelভাবে process করে।
  4. Hive
  - SQL-এর মতো query interface দেয়।
  - data analysis এবং reporting-এর জন্য উপযোগী।
  5. Flume
  - real-time data ingestion (যেমন web logs) এর জন্য ব্যবহৃত হয়।
  - streaming data efficiently HDFS-এ পাঠায়।
  - এই components গুলো একসাথে data ingestion, storage, processing, resource management এবং analytics নিশ্চিত করে।
- 3
  Big Data, ML & AIPreprocessing
  Suppose your dataset has missing values and noise. How would you preprocess it?
  Combined Bank, AP-23, 26 | Assistant Programmer
  Data Preprocessing for Missing Values and Noise
  Data preprocessing is the step of cleaning and transforming raw data before feeding it into a model. Missing values and noise are two common problems that can reduce accuracy and mislead results. Below are the methods to handle them.
  1. Handling Missing Values
  Missing values occur when some data points are not recorded or are lost. They must be handled carefully to avoid biased results.
  - Deletion: Remove rows or columns that have too many missing values. This is simple but only works when the missing data is small and random.
  - Mean/Median/Mode Imputation: Fill missing numeric values with the average (mean), middle value (median), or most common value (mode) of that column.
  - Forward/Backward Fill: Use the previous or next available value to fill the gap, useful in time-series data.
  - Predictive Imputation: Use machine learning models like KNN or regression to predict and fill the missing values based on other columns.
  - Constant Value Fill: Replace missing values with a fixed number like 0 or a label like "Unknown" for categorical data.
  Example: In an age column, if 10% of entries are blank, you can fill them with the mean age of all other rows.
  2. Handling Noise
  Noise refers to random errors or unwanted variation in data that hides true patterns. It must be reduced to improve model performance.
  - Binning: Sort data into groups or bins and replace values with the bin average or median. This smooths out local fluctuations.
  - Regression: Fit a regression line to the data and use the predicted values to replace noisy points.
  - Outlier Detection: Identify extreme values using the Z-score or IQR method and remove or cap them.
  - Smoothing Filters: Use moving average or Gaussian filters to reduce random spikes in numerical data.
  - Clustering: Group similar data points together and remove points that do not belong to any major cluster.
  Example: In a temperature dataset, a sudden reading of 500°C is clearly noise. Using the IQR method, this outlier can be detected and replaced with the median.
  Data Preprocessing for Missing Values and Noise
  Data preprocessing হলো raw data clean এবং transform করার step যা model-এ feed করার আগে করা হয়। Missing values এবং noise দুটি common problem যা accuracy কমায় এবং results mislead করতে পারে।
  1. Handling Missing Values
  Missing values ঘটে যখন কিছু data points record করা হয় না বা হারিয়ে যায়। Biased results এড়াতে এগুলো carefully handle করতে হবে।
  - Deletion: Rows বা columns remove করা যেগুলোতে অনেক missing values আছে। এটি simple কিন্তু শুধু small এবং random missing data-এর ক্ষেত্রে কাজ করে।
  - Mean/Median/Mode Imputation: Missing numeric values সেই column-এর average (mean), middle value (median) বা most common value (mode) দিয়ে fill করা।
  - Forward/Backward Fill: আগের বা পরের available value ব্যবহার করে gap fill করা, time-series data-এর জন্য useful।
  - Predictive Imputation: KNN বা regression এর মতো machine learning models ব্যবহার করে অন্য columns-এর উপর ভিত্তি করে missing values predict এবং fill করা।
  - Constant Value Fill: Missing values একটি fixed number যেমন 0 বা categorical data-এর জন্য "Unknown" label দিয়ে replace করা।
  Example: একটি age column-এ যদি 10% entries blank থাকে, তাহলে অন্য rows-এর mean age দিয়ে সেগুলো fill করা যায়।
  2. Handling Noise
  Noise হলো data-এর random errors বা unwanted variation যা true patterns hide করে। Model performance উন্নত করতে এটি কমাতে হবে।
  - Binning: Data-কে groups বা bins-এ sort করে এবং bin average বা median দিয়ে values replace করা। এটি local fluctuations smooth করে।
  - Regression: Data-এর উপর regression line fit করে এবং noisy points-কে predicted values দিয়ে replace করা।
  - Outlier Detection: Z-score বা IQR method ব্যবহার করে extreme values identify করে remove বা cap করা।
  - Smoothing Filters: Numerical data-এর random spikes কমাতে moving average বা Gaussian filters ব্যবহার করা।
  - Clustering: Similar data points একসাথে group করে এবং এমন points remove করা যা কোনো major cluster-এর অংশ নয়।
  Example: একটি temperature dataset-এ হঠাৎ 500°C reading noise হিসেবে স্পষ্ট। IQR method ব্যবহার করে এই outlier detect করা যায় এবং median দিয়ে replace করা যায়.
- 4
  Big Data, ML & AITypes of ML
  Explain the concepts of Reinforcement Learning (RL), Deep Learning (DL), and Federated Learning (FL) in the context of Machine Learning. Briefly describe how each approach differs in its learning mechanism, data usage, and real-world applications.(10 Marks)
  Combined Bank, SO-IT, 25 | Senior Officer (IT)
  
  Reinforcement Learning (RL)
  Reinforcement Learning is a type of machine learning where an agent learns by interacting with an environment. The agent takes actions and receives rewards or penalties, and its goal is to maximize cumulative reward over time.
  Learning Mechanism: Trial-and-error with reward feedback.
  Data Usage: Generated through interaction with the environment.
  Applications: Robotics, game playing (AlphaGo), autonomous vehicles, recommendation systems.
  Deep Learning (DL)
  Deep Learning is a subset of machine learning that uses multi-layer neural networks to automatically learn features from large datasets. It is inspired by the human brain structure.
  Learning Mechanism: Supervised/unsupervised learning using deep neural networks.
  Data Usage: Requires large, labeled or unlabeled datasets.
  Applications: Image recognition, speech recognition, natural language processing, medical diagnosis.
  Federated Learning (FL)
  Federated Learning is a distributed learning approach where the model is trained across multiple devices without sharing raw data. Only model updates are sent to a central server.
  Learning Mechanism: Collaborative learning with local model updates.
  Data Usage: Data remains on local devices (privacy-preserving).
  Applications: Mobile keyboards, healthcare data analysis, IoT systems, privacy-sensitive applications.
  
  Reinforcement Learning (RL)
  Reinforcement Learning হলো machine learning এর এমন একটি পদ্ধতি যেখানে একটি agent environment এর সাথে interaction করে শেখে। Agent action নেয় এবং reward বা penalty পায়, এবং লক্ষ্য থাকে মোট reward সর্বোচ্চ করা।
  Learning Mechanism: Trial-and-error এবং reward ভিত্তিক শেখা।
  Data Usage: Environment এর সাথে interaction করে data তৈরি হয়।
  Applications: Robotics, game playing (AlphaGo), autonomous vehicle, recommendation system।
  Deep Learning (DL)
  Deep Learning হলো machine learning এর একটি অংশ যেখানে multi-layer neural network ব্যবহার করে বড় dataset থেকে স্বয়ংক্রিয়ভাবে feature শেখা হয়। এটি মানুষের brain structure থেকে অনুপ্রাণিত।
  Learning Mechanism: Deep neural network ব্যবহার করে supervised বা unsupervised learning।
  Data Usage: বড় আকারের labeled বা unlabeled data প্রয়োজন।
  Applications: Image recognition, speech recognition, NLP, medical diagnosis।
  Federated Learning (FL)
  Federated Learning হলো একটি distributed learning পদ্ধতি যেখানে raw data share না করেই একাধিক device এ model train করা হয়। শুধু model update central server এ পাঠানো হয়।
  Learning Mechanism: Local training এবং global model aggregation।
  Data Usage: Data local device এই থাকে (privacy বজায় থাকে)।
  Applications: Mobile keyboard prediction, healthcare data analysis, IoT system, privacy-sensitive application।
- 5
  Big Data, ML & AITypes of ML
  What is machine learning? Differentiate among supervised learning vs unsupervised learning vs reinforcement learning.
  Combined Bank, SO(IT), 24 | Senior Officer (IT)
  Machine Learning (ML) is a branch of Artificial Intelligence where computers learn from data, identify patterns, and make decisions without explicit programming.
  - Purpose: To analyze data and make accurate predictions or decisions automatically.
  - Example: Spam filtering, recommendation systems, image recognition.
  - Key Idea: The system improves its performance over time by learning from experience (data).
  Types of Machine Learning
  Supervised Learning:
  - Uses labeled data where input and correct output are already known.
  - The model learns a mapping function from input to output.
  - Used for prediction tasks.
  - Example: Email classification (spam/not spam), house price prediction.
  Unsupervised Learning:
  - Uses unlabeled data without predefined output.
  - The model discovers hidden patterns or relationships in data.
  - Used for grouping and data analysis.
  - Example: Customer segmentation, clustering.
  Reinforcement Learning:
  - An agent interacts with an environment and learns by trial and error.
  - Uses reward and punishment to improve decisions.
  - Focuses on maximizing long-term reward.
  - Example: Game playing (Chess, AI agents), robotics control.
  Machine Learning (ML) হলো Artificial Intelligence-এর একটি শাখা যেখানে computer data থেকে pattern শিখে এবং নিজে নিজে decision নিতে পারে, আলাদা করে program না করেও।
  - Purpose: Data বিশ্লেষণ করে সঠিক prediction বা decision নেওয়া।
  - Example: Spam detection, recommendation system, image recognition।
  - Key Idea: System data থেকে শিখে সময়ের সাথে সাথে তার performance উন্নত করে।
  Machine Learning-এর ধরন
  Supervised Learning:
  - Labeled data ব্যবহার করা হয় যেখানে input-এর সাথে output জানা থাকে।
  - Model input ও output-এর সম্পর্ক শিখে।
  - Prediction করার জন্য ব্যবহৃত হয়।
  - Example: Spam detection, house price prediction।
  Unsupervised Learning:
  - Unlabeled data ব্যবহার করা হয় যেখানে output আগে থেকে জানা থাকে না।
  - Data-এর ভিতরে hidden pattern বা group খুঁজে বের করে।
  - Data analysis এবং grouping-এ ব্যবহৃত হয়।
  - Example: Customer segmentation, clustering।
  Reinforcement Learning:
  - একটি agent environment-এর সাথে interaction করে trial and error এর মাধ্যমে শেখে।
  - Reward এবং punishment ব্যবহার করে শেখানো হয়।
  - Long-term reward maximize করার উপর গুরুত্ব দেয়।
  - Example: Game playing, robotics control।
- 6
  Big Data, ML & AITypes of ML
  Compare and contrast the three fundamental paradigms of Machine Learning: Supervised Learning, Unsupervised Learning, and Reinforcement Learning.
  Combined Bank, AP-22, 26 | Assistant Programmer
  
  Machine Learning Paradigms
  Machine Learning can be broadly classified into three fundamental paradigms based on how models learn from data: Supervised Learning, Unsupervised Learning, and Reinforcement Learning.
  1) Supervised Learning
  In supervised learning, the model is trained using labeled data, where the correct output is already known.
  Learning Mechanism: Learns by mapping inputs to known outputs.
  Data Usage: Uses labeled datasets.
  Applications: Spam detection, image classification, disease prediction.
  2) Unsupervised Learning
  In unsupervised learning, the model works with unlabeled data and tries to discover hidden patterns or structures in the data.
  Learning Mechanism: Pattern discovery and grouping.
  Data Usage: Uses unlabeled datasets.
  Applications: Customer segmentation, clustering, anomaly detection.
  3) Reinforcement Learning
  In reinforcement learning, an agent learns by interacting with an environment and receives rewards or penalties for actions.
  Learning Mechanism: Trial-and-error based on reward feedback.
  Data Usage: Data generated through interaction with the environment.
  Applications: Robotics, game playing, autonomous systems.
  Key Differences
  Supervised learning relies on labeled data, unsupervised learning finds patterns without labels, and reinforcement learning focuses on decision-making through rewards.
  
  Machine Learning এর তিনটি প্রধান Paradigm
  Machine Learning শেখার পদ্ধতির উপর ভিত্তি করে তিনটি প্রধান ভাগে বিভক্ত: Supervised Learning, Unsupervised Learning এবং Reinforcement Learning।
  1) Supervised Learning
  Supervised learning এ model কে labeled data দিয়ে train করা হয়, যেখানে input এর সঠিক output আগে থেকেই জানা থাকে।
  Learning Mechanism: Input ও known output এর mapping শেখে।
  Data Usage: Labeled dataset ব্যবহার করে।
  Applications: Spam detection, image classification, disease prediction।
  2) Unsupervised Learning
  Unsupervised learning এ model unlabeled data নিয়ে কাজ করে এবং data এর ভেতরের pattern বা structure খুঁজে বের করে।
  Learning Mechanism: Pattern ও group discovery।
  Data Usage: Unlabeled dataset ব্যবহার করে।
  Applications: Customer segmentation, clustering, anomaly detection।
  3) Reinforcement Learning
  Reinforcement learning এ একটি agent environment এর সাথে interaction করে শেখে এবং action অনুযায়ী reward বা penalty পায়।
  Learning Mechanism: Trial-and-error ও reward ভিত্তিক শেখা।
  Data Usage: Environment interaction থেকে data তৈরি হয়।
  Applications: Robotics, game playing, autonomous system।
  মূল পার্থক্য
  Supervised learning এ labeled data লাগে, unsupervised learning এ label ছাড়া pattern খোঁজা হয়, আর reinforcement learning এ reward এর মাধ্যমে decision-making শেখা হয়।
- 7
  Big Data, ML & AIGenerative AI & XAI
  Imagine a government agency is developing an AI-based citizen service chatbot that can automatically generate responses, summarize documents, and provide policy information to citizens. Explain how Generative AI can be used to power such a system, and how Explainable AI (XAI) techniques can ensure that its responses are transparent, reliable, and accountable. (10 Marks)
  Combined Bank, SO-IT, 25 | Senior Officer (IT)
  Use of Generative AI in a Government Citizen Service Chatbot
  Generative AI can be used to power a government chatbot by enabling it to automatically generate human-like responses, summarize long policy documents, and provide accurate policy-related information to citizens. Using large language models, the chatbot can understand citizens’ questions in natural language and generate clear, context-aware answers. It can also analyze official documents, extract key points, and present simplified summaries, making government services more accessible and efficient.
  Role of Explainable AI (XAI)
  Explainable AI (XAI) techniques help ensure that the chatbot’s responses are transparent, reliable, and accountable. XAI allows the system to explain why a particular response was generated by showing the source policy, rules, or reasoning behind the answer. This helps government officials and citizens trust the system and verify that the information is correct and unbiased.
  Benefits of Using XAI
  - Transparency: Citizens can understand how and from where the answer was derived.
  - Reliability: Officials can audit and validate chatbot decisions.
  - Accountability: The system can justify responses, reducing the risk of misinformation.
  - Public Trust: Clear explanations increase confidence in AI-based public services.
  Government Citizen Service Chatbot এ Generative AI এর ব্যবহার
  Generative AI ব্যবহার করে একটি সরকারী chatbot তৈরি করা যায় যা স্বয়ংক্রিয়ভাবে মানুষের মতো উত্তর তৈরি করতে পারে, দীর্ঘ policy document সংক্ষেপ করতে পারে এবং নাগরিকদের নীতিমালা সংক্রান্ত তথ্য সহজভাবে দিতে পারে। Large language model ব্যবহার করে chatbot নাগরিকদের প্রশ্ন বুঝতে পারে এবং প্রাসঙ্গিক ও স্পষ্ট উত্তর প্রদান করতে পারে। এছাড়া সরকারি নথি বিশ্লেষণ করে গুরুত্বপূর্ণ তথ্য সংক্ষেপে তুলে ধরতে পারে, ফলে নাগরিক সেবা আরও সহজ ও দ্রুত হয়।
  Explainable AI (XAI) এর ভূমিকা
  Explainable AI (XAI) chatbot এর উত্তরগুলোকে স্বচ্ছ, নির্ভরযোগ্য এবং জবাবদিহিমূলক করতে সাহায্য করে। XAI এর মাধ্যমে বোঝানো যায় কেন একটি নির্দিষ্ট উত্তর দেওয়া হয়েছে—যেমন কোন policy, নিয়ম বা তথ্যের ভিত্তিতে উত্তরটি এসেছে। এতে নাগরিক এবং সরকার উভয়ই AI সিস্টেমের উপর আস্থা রাখতে পারে।
  XAI ব্যবহারের সুবিধা
  - স্বচ্ছতা: নাগরিকরা বুঝতে পারে উত্তরটি কীভাবে তৈরি হয়েছে।
  - নির্ভরযোগ্যতা: সরকার সহজে chatbot এর সিদ্ধান্ত যাচাই করতে পারে।
  - জবাবদিহিতা: ভুল বা বিভ্রান্তিকর তথ্যের ঝুঁকি কমে।
  - জনগণের আস্থা: ব্যাখ্যাসহ উত্তর দিলে AI-ভিত্তিক সেবার প্রতি বিশ্বাস বাড়ে।
- 8
  Big Data, ML & AIHadoop Ecosystem
  A financial services provider needs to handle massive streaming and historical log data to perform fraud analytics and ML-driven maintenance prediction. Identify five Hadoop ecosystem technologies appropriate for this use case and describe their roles.
  CB, AME/AE(IT-23), 26 | AME/ANE/AE
  To process large-scale streaming and historical log data, different Hadoop ecosystem technologies can be used together for storage, processing, querying, and machine learning.
  1. HDFS (Hadoop Distributed File System)
  - HDFS is used for distributed storage of massive amounts of data across multiple servers.
  - It stores historical logs, transaction records, and streaming data reliably with fault tolerance.
  Role: Large-scale distributed data storage.
  2. Apache Kafka
  - Kafka is used for real-time data streaming and message collection.
  - It collects continuous transaction logs, user activities, and system events from different sources.
  Role: Real-time streaming data ingestion.
  3. Apache Spark
  - Spark is a fast data processing framework used for big data analytics and machine learning.
  - It can process both streaming data and historical data efficiently.
  Role: Real-time analytics, fraud detection, and ML processing.
  4. Apache Hive
  - Hive is a data warehouse tool used to query large datasets using SQL-like language.
  - Analysts can generate reports and analyze fraud-related historical data easily.
  Role: SQL-based querying and data analysis.
  5. Apache Mahout
  - Mahout provides machine learning algorithms for big data applications.
  - It can be used for predictive analytics, anomaly detection, and maintenance prediction models.
  Role: Machine learning and predictive analytics.
  প্রশ্ন: একটি financial services provider massive streaming এবং historical log data ব্যবহার করে fraud analytics ও ML-driven maintenance prediction করতে চায়। এই কাজের জন্য উপযুক্ত পাঁচটি Hadoop ecosystem technology এবং তাদের ভূমিকা বর্ণনা কর।
  Large-scale streaming এবং historical log data process করার জন্য বিভিন্ন Hadoop ecosystem technology একসাথে ব্যবহার করা হয় storage, processing, querying এবং machine learning-এর কাজে।
  1. HDFS (Hadoop Distributed File System)
  - HDFS বহু server-এ distributed ভাবে বিপুল পরিমাণ data সংরক্ষণ করতে ব্যবহৃত হয়।
  - এটি historical log, transaction record এবং streaming data fault tolerance সহ নিরাপদে সংরক্ষণ করে।
  Role: Large-scale distributed data storage।
  2. Apache Kafka
  - Kafka real-time data streaming এবং message collection-এর জন্য ব্যবহৃত হয়।
  - এটি বিভিন্ন source থেকে continuous transaction log, user activity এবং system event সংগ্রহ করে।
  Role: Real-time streaming data ingestion।
  3. Apache Spark
  - Spark একটি দ্রুত data processing framework যা big data analytics এবং machine learning-এর জন্য ব্যবহৃত হয়।
  - এটি streaming data এবং historical data উভয়ই দ্রুত process করতে পারে।
  Role: Real-time analytics, fraud detection এবং ML processing।
  4. Apache Hive
  - Hive হলো একটি data warehouse tool যা SQL-এর মতো language ব্যবহার করে বড় dataset query করতে সাহায্য করে।
  - এর মাধ্যমে analyst সহজে fraud-related historical data বিশ্লেষণ ও report তৈরি করতে পারে।
  Role: SQL-based querying এবং data analysis।
  5. Apache Mahout
  - Mahout big data application-এর জন্য machine learning algorithm প্রদান করে।
  - এটি predictive analytics, anomaly detection এবং maintenance prediction model তৈরিতে ব্যবহৃত হয়।
  Role: Machine learning এবং predictive analytics।
- 9
  Big Data, ML & AI
  Design a scalable big data analytics pipeline for an e-commerce platform that ingests real-time logs, stores petabyte-scale data, supports batch and stream processing, delivers near real-time recommendations, and runs large-scale machine learning models; from the following Hadoop ecosystem components - MapReduce, HDFS, YARN, HBase, ZooKeeper, Pig, Hive, Mahout, Chukwa, Cassandra, Avro, Oozie, Flume, Sqoop - select any five and explain the role of each component in this system.
  CB, AME/AE(IT-23), 26 | AME/ANE/AE
  
  To build a scalable Big Data Analytics Pipeline for an E-commerce Platform, the following five Hadoop Ecosystem components can be selected:
  1. Flume – Real-time Data Ingestion
  Apache Flume collects and transfers real-time web logs, clickstream data, and application events from multiple servers into the Hadoop ecosystem. It provides reliable and scalable log ingestion.
  Role: Ingests real-time logs into HDFS.
  2. HDFS – Distributed Storage
  Hadoop Distributed File System (HDFS) stores petabyte-scale structured and unstructured data across multiple machines. It provides high fault tolerance and scalability by replicating data blocks.
  Role: Stores massive datasets reliably for analytics and machine learning.
  3. YARN – Resource Management
  YARN (Yet Another Resource Negotiator) manages cluster resources and schedules jobs submitted by different applications, ensuring efficient utilization of the Hadoop cluster.
  Role: Allocates CPU and memory resources for batch and stream processing jobs.
  4. MapReduce – Batch Processing
  MapReduce performs distributed parallel processing of very large datasets stored in HDFS. It divides tasks into Map and Reduce phases for efficient computation.
  Role: Executes large-scale batch analytics such as sales analysis, customer behavior analysis, and report generation.
  5. Mahout – Machine Learning
  Apache Mahout provides scalable machine learning algorithms such as clustering, classification, and recommendation systems on large datasets.
  Role: Builds recommendation engines and large-scale machine learning models for personalized product recommendations.
  
  একটি E-commerce Platform-এর জন্য Scalable Big Data Analytics Pipeline তৈরিতে নিচের পাঁচটি Hadoop Ecosystem Component নির্বাচন করা যেতে পারে।
  ১. Flume – Real-time Data Ingestion
  Apache Flume বিভিন্ন Web Server ও Application থেকে Real-time Log, Clickstream Data এবং Event সংগ্রহ করে Hadoop System-এ পাঠায়। এটি নির্ভরযোগ্য ও Scalable Data Ingestion নিশ্চিত করে।
  ভূমিকা: Real-time Log সংগ্রহ করে HDFS-এ সংরক্ষণ করে।
  ২. HDFS – Distributed Storage
  Hadoop Distributed File System (HDFS) Petabyte-Scale Structured ও Unstructured Data একাধিক Node-এ Distributedভাবে সংরক্ষণ করে। এটি Data Replication-এর মাধ্যমে Fault Tolerance এবং Scalability নিশ্চিত করে।
  ভূমিকা: বিপুল পরিমাণ Data নিরাপদ ও নির্ভরযোগ্যভাবে সংরক্ষণ করে।
  ৩. YARN – Resource Management
  YARN (Yet Another Resource Negotiator) Hadoop Cluster-এর CPU, Memory এবং অন্যান্য Resource পরিচালনা করে এবং বিভিন্ন Job দক্ষতার সাথে Schedule করে।
  ভূমিকা: Batch এবং Stream Processing Job-এর জন্য Resource Allocate ও Manage করে।
  ৪. MapReduce – Batch Processing
  MapReduce HDFS-এ সংরক্ষিত বিশাল Data Distributedভাবে Process করে। এটি Map এবং Reduce Phase-এর মাধ্যমে Parallel Processing সম্পন্ন করে।
  ভূমিকা: Sales Analysis, Customer Behavior Analysis এবং অন্যান্য Large-scale Batch Analytics সম্পন্ন করে।
  ৫. Mahout – Machine Learning
  Apache Mahout বৃহৎ Data-এর উপর Machine Learning Algorithm যেমন Clustering, Classification এবং Recommendation System বাস্তবায়ন করে।
  ভূমিকা: Personalized Product Recommendation এবং Large-scale Machine Learning Model তৈরি করে।
- 10
  Big Data, ML & AIPrecision and Recall
  An engineering model is deployed to identify critical network hardware failures. Explain why an a administrator must optimize for recall instead of precision if the business penalty for a missed failure is catastrophic.
  RAKUB, ANSE, 26 | AME/ANE/AE
  Understanding Precision and Recall
  - Precision: Out of all cases predicted as failure by the model, how many are actually real failures. High precision means fewer false alarms.
  - Recall: Out of all actual failures, how many the model correctly detects. High recall means fewer missed failures.
  Why Recall Matters More Here
  When the cost of missing a failure is extremely high—such as network downtime, data center collapse, financial loss, or safety risks—recall becomes the priority. The goal is to detect as many real failures as possible, even if it increases false alarms.
  - False Positive (Low Precision): A failure is predicted but does not actually exist. Cost: extra inspection time and minor operational disruption.
  - False Negative (Low Recall): A real failure is missed by the model. Cost: system outage, financial loss, and possible harm to users.
  In such scenarios, a recall-focused approach is preferred because preventing one critical failure is more important than reducing unnecessary alerts.
  Practical Steps to Optimize Recall
  - Lower classification threshold: Increase sensitivity so more cases are classified as positive, capturing borderline failures.
  - Class weights: Assign higher penalty during training for missing actual failures to bias the model toward detection.
  - Ensemble methods: Combine multiple models so that if one misses a failure, another may detect it.
  - Anomaly detection: Flag unusual behavior for human review instead of ignoring it automatically.
  Critical Network Hardware Failure Detection-এ Recall Optimize করার কারণ
  - Precision: Model যেসব cases failure বলে predict করেছে, তার মধ্যে কতগুলো আসলে failure ছিল। High precision মানে কম false alarms।
  - Recall: আসলে যতগুলো failure exist করে, model কতগুলো correctly ধরতে পেরেছে। High recall মানে কম missed failures।
  এখানে Recall বেশি Important কেন
  যখন missed failure-এর business penalty catastrophic হয় — যেমন complete network downtime, data center collapse, financial loss, বা safety hazards — তখন system-এর priority হয় যত বেশি possible failure detect করা, এমনকি false alarms কিছুটা বেড়ে গেলেও।
  - False Positive (Low Precision): Model এমন failure predict করে যা বাস্তবে নেই। Cost = technician inspection time এবং minor operational disruption।
  - False Negative (Low Recall): Model একটি real failure miss করে। Cost = catastrophic outage, financial loss, reputation damage এবং users-এর potential harm।
  Critical infrastructure-এ একটি missed failure অনেক বেশি ক্ষতিকর, তাই recall maximize করাই এখানে প্রধান লক্ষ্য।
  Recall Optimize করার Practical Steps
  - Lower classification threshold: Decision boundary কমিয়ে বেশি cases-কে positive হিসেবে classify করা, যাতে borderline failuresও ধরা পড়ে।
  - Class weights: Training-এর সময় real failure miss করলে বেশি penalty দিয়ে model-কে failure detection-এর দিকে bias করা।
  - Ensemble methods: একাধিক model combine করা যাতে একটি model miss করলে অন্যটি detect করতে পারে।
  - Anomaly detection: unusual behavior automatically reject না করে human review-এর জন্য flag করা।

Previous Job Question on ML & AI

MCQ on ML & AI

coming soon