Python Web Scraping with BeautifulSoup: Complete Beginner’s Guide 2026

Web scraping is one of the most useful skills in modern Python programming. Whether you’re aggregating product prices, collecting research data, building a news dashboard, or training a machine learning model, web scraping unlocks access to the vast information available online. In this comprehensive guide, we’ll teach you how to scrape websites with Python and BeautifulSoup — from absolute beginner to building production-ready scrapers.

🎯 What You’ll Learn

By the end of this guide, you’ll be able to:

Sentiment Analyser ML Project Python tutorial

Sentiment Analyser Py Project Guide

May 18, 2026

Pin Your Note Py Project Guide

May 18, 2026

Install and set up Python web scraping tools
Extract any element from any webpage
Handle errors, retries, and rate limits
Scrape multiple pages efficiently
Work with dynamic JavaScript-heavy sites
Respect website rules (robots.txt and rate limits)
Save scraped data to CSV, JSON, or databases

📦 Installation: Setting Up Your Environment

First, let’s install the libraries we need. Open your terminal and run:

pip install requests beautifulsoup4 lxml pandas

What each library does:

requests: Sends HTTP requests to download web pages
beautifulsoup4: Parses HTML and extracts data
lxml: Fast HTML parser (faster than the default html.parser)
pandas: Saves data to CSV/Excel easily

🚀 Your First Scraper in 10 Lines

Let’s scrape a sample website to extract all article titles. Here’s the simplest possible scraper:

import requests
from bs4 import BeautifulSoup

url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")

titles = soup.find_all("h2")
for title in titles:
    print(title.text.strip())

That’s it! In 10 lines, you’ve built a working web scraper. Let’s break down what’s happening:

Send the request: requests.get(url) downloads the HTML
Parse the HTML: BeautifulSoup(response.text, "lxml") turns the text into a navigable tree
Find elements: soup.find_all("h2") returns all <h2> tags
Extract text: .text.strip() gets the clean text inside each tag

🔍 Finding Elements: The Core Skills

BeautifulSoup gives you several ways to locate elements:

By tag name

soup.find("h1")          # First <h1> on the page
soup.find_all("p")       # All <p> tags as a list

By class

soup.find("div", class_="post-title")     # Note: class_ (with underscore) — class is reserved in Python
soup.find_all("a", class_="external-link")

By ID

soup.find(id="main-content")
soup.select_one("#main-content")           # Same thing, CSS selector style

By attribute

soup.find("img", {"alt": "Logo"})
soup.find_all("a", href=True)              # All links with href attribute

Using CSS selectors (most powerful)

soup.select("div.post > h2.title")         # Descendant + child + class
soup.select("a[href^='https']")            # Links starting with https
soup.select("li:nth-of-type(2)")           # Second <li> in each parent

💡 Pro tip: CSS selectors (select() and select_one()) are usually cleaner and more powerful than find()/find_all(). Learn them and you’ll write better scrapers.

📋 Real Example: Scraping a Quote Website

Let’s scrape quotes.toscrape.com — a website specifically designed for practicing web scraping.

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "http://quotes.toscrape.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")

quotes_data = []

for quote_div in soup.find_all("div", class_="quote"):
    text = quote_div.find("span", class_="text").text
    author = quote_div.find("small", class_="author").text
    tags = [tag.text for tag in quote_div.find_all("a", class_="tag")]

    quotes_data.append({
        "text": text,
        "author": author,
        "tags": ", ".join(tags)
    })

# Save to CSV
df = pd.DataFrame(quotes_data)
df.to_csv("quotes.csv", index=False, encoding="utf-8")
print(f"Saved {len(quotes_data)} quotes to quotes.csv")

Run this and you’ll have a clean CSV file with quotes, authors, and tags. This is real, production-quality scraping code.

🔄 Scraping Multiple Pages (Pagination)

Most useful scraping involves multiple pages. Here’s the pattern:

import requests
from bs4 import BeautifulSoup
import time

all_quotes = []
base_url = "http://quotes.toscrape.com/page/{}/"
page = 1

while True:
    url = base_url.format(page)
    response = requests.get(url)

    if response.status_code != 200:
        break

    soup = BeautifulSoup(response.text, "lxml")
    quotes = soup.find_all("div", class_="quote")

    if not quotes:  # Empty page = we're done
        break

    for quote_div in quotes:
        all_quotes.append({
            "text": quote_div.find("span", class_="text").text,
            "author": quote_div.find("small", class_="author").text
        })

    print(f"Scraped page {page}, total quotes: {len(all_quotes)}")
    page += 1
    time.sleep(1)  # Be polite — wait 1 second between requests

print(f"Done! Total: {len(all_quotes)} quotes")

Why `time.sleep(1)`?

This is critical: Always add delays between requests. Without it, you could:

Overload the server (bad for them, bad for you)
Get your IP banned
Violate the website’s terms of service

1 second between requests is a polite default. For larger sites, you can go faster (0.5s or 0.2s). For smaller sites, slower (2-5s).

🛡️ Handling Errors Like a Pro

Real-world scraping needs robust error handling:

import requests
from bs4 import BeautifulSoup
import time
from requests.exceptions import RequestException

def fetch_page(url, max_retries=3):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
    }

    for attempt in range(max_retries):
        try:
            response = requests.get(url, headers=headers, timeout=10)
            response.raise_for_status()  # Raises HTTPError for 4xx/5xx
            return response.text
        except RequestException as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff: 1s, 2s, 4s
            else:
                print(f"Failed to fetch {url} after {max_retries} attempts")
                return None

html = fetch_page("http://quotes.toscrape.com")
if html:
    soup = BeautifulSoup(html, "lxml")
    # Continue scraping...

This adds:

User-Agent header: Many sites block requests without this
Timeout: Don’t wait forever for slow servers (10 seconds max)
Retries with backoff: Wait longer between each retry
Graceful failure: Returns None instead of crashing

🎭 Sites Need JavaScript? Use Selenium

BeautifulSoup only sees the HTML that’s returned from the server. Modern sites built with React, Vue, or Angular load content dynamically with JavaScript. For those, you need a browser automation tool like Selenium or Playwright.

Quick Selenium example:

pip install selenium

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time

options = Options()
options.add_argument("--headless")  # Run without opening browser window

driver = webdriver.Chrome(options=options)
driver.get("https://example-spa-website.com")
time.sleep(3)  # Wait for JavaScript to load

soup = BeautifulSoup(driver.page_source, "lxml")
# Now scrape soup as usual
driver.quit()

Selenium is slower than requests but handles JavaScript. Use it only when needed.

🤖 Respecting robots.txt

Every website has a /robots.txt file that tells you what you can and can’t scrape. Always check it before scraping:

import urllib.robotparser

rp = urllib.robotparser.RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()

if rp.can_fetch("*", "https://example.com/products"):
    # OK to scrape
    pass
else:
    print("This URL is disallowed by robots.txt")

If a site says "Disallow: /admin/" — don’t scrape /admin/. It’s a request, not law, but respecting it keeps you ethical and safe.

💾 Saving Data: CSV, JSON, or Database

To CSV (most common)

import pandas as pd
df = pd.DataFrame(scraped_data)
df.to_csv("data.csv", index=False, encoding="utf-8-sig")  # utf-8-sig opens correctly in Excel

To JSON

import json
with open("data.json", "w", encoding="utf-8") as f:
    json.dump(scraped_data, f, ensure_ascii=False, indent=2)

To SQLite database

import sqlite3
conn = sqlite3.connect("scraped.db")
df = pd.DataFrame(scraped_data)
df.to_sql("quotes", conn, if_exists="append", index=False)
conn.close()

🎯 Best Practices Cheatsheet

Always identify yourself with a User-Agent header
Always add delays (time.sleep) between requests
Always handle errors with try/except
Always check robots.txt before scraping a new site
Avoid hammering an API — use pagination, not concurrent requests on small sites
Cache responses locally during development to avoid re-downloading
Use sessions for sites requiring login: session = requests.Session()
Rotate User-Agents for large-scale scraping
Use proxies if your IP gets blocked
Document your scraper — websites change, your code needs to be readable

⚠️ Legal & Ethical Considerations

Before scraping, ask yourself:

✅ Is this data publicly available? (no login required)
✅ Am I respecting robots.txt?
✅ Am I using reasonable rate limits?
✅ Is my scraping for personal/research use or commercial gain?

Some sites explicitly prohibit scraping in their Terms of Service. Examples include LinkedIn, Facebook, and certain news websites. Scraping them could lead to IP bans, legal warnings, or worse.

Safe to scrape: Public product catalogs, news headlines (with attribution), public government data, your own social media data.

Risky: Personal user data, behind login walls, sites with explicit no-scraping policies.

🚀 What’s Next?

Now that you’ve mastered the basics, consider exploring:

Scrapy framework: For large-scale, production scrapers (faster than requests + BeautifulSoup)
Async scraping with aiohttp + asyncio: 10x+ faster for many URLs
Playwright: Modern alternative to Selenium, faster and more reliable
API alternatives: Many sites offer official APIs — always check first
Cloud scraping platforms: Bright Data, ScraperAPI, Apify for enterprise

📚 Recommended Reading

Official BeautifulSoup docs: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Real Python web scraping tutorials
"Web Scraping with Python" by Ryan Mitchell (book)

🎓 Practice Project Ideas

Build these to cement your learning:

News aggregator: Scrape headlines from 3-5 news sites and email them daily
Price tracker: Monitor a product on Amazon and alert when price drops
Job board scraper: Collect Python jobs from multiple boards
Movie database: Scrape IMDB ratings for your favorite genre
Real estate tracker: Aggregate apartment listings from local sites

❓ Frequently Asked Questions

Is web scraping legal?

In most jurisdictions, scraping publicly available data is legal. However, violating a site’s Terms of Service can have civil consequences. The famous case hiQ vs LinkedIn ruled that scraping publicly accessible data is legal in the US, but laws vary by country.

How fast can I scrape?

Depends on the site. As a rule of thumb:

Small sites: 1 request per 2-5 seconds
Medium sites: 1 request per 1 second
Large sites with APIs: 5-10 requests per second (or follow API rate limits)

Why does my scraper return empty results?

Most common causes:

The site loads content via JavaScript (use Selenium)
The site blocks bots (add User-Agent header)
The site requires login (use sessions or cookies)
The HTML structure changed (re-inspect the page)

Should I use Scrapy or BeautifulSoup?

BeautifulSoup for learning and small scripts (1-100 pages)
Scrapy for production scrapers (1000+ pages, multiple sites)

What if a site uses CAPTCHA?

Try to avoid triggering CAPTCHAs (slow down requests, rotate User-Agents)
For unavoidable CAPTCHAs, consider 2Captcha or Anti-Captcha services
For automated CAPTCHA solving, you usually need a paid service or ML model

Happy scraping! 🐍🕷️

Ahmad Hussain

See Full Bio

Python Web Scraping with BeautifulSoup: Complete Beginner’s Guide 2026

Sentiment Analyser Py Project Guide

Pin Your Note Py Project Guide

Related Posts

Sentiment Analyser Py Project Guide

Pin Your Note Py Project Guide

Notification App Py Project Guide

Leave a Reply Cancel reply

POPULAR NEWS

71 Python Projects with References and Source Code

OOPS in Python Handwritten Notes

Most Asked JavaScript Interview (100 Q&A) PDF

Most Asked Java Interview (100 Q&A) PDF

Top 50 Java Interview Questions and Answers PDF

Recent News

Category

Recent News

55 Python Projects with References and Source Code Guide

Sentiment Analyser Py Project Guide

Welcome Back!

Retrieve your password

Add New Playlist

Python Web Scraping with BeautifulSoup: Complete Beginner’s Guide 2026

🎯 What You’ll Learn

Related posts

📦 Installation: Setting Up Your Environment

🚀 Your First Scraper in 10 Lines

🔍 Finding Elements: The Core Skills

By tag name

By class

By ID

By attribute

Using CSS selectors (most powerful)

📋 Real Example: Scraping a Quote Website

🔄 Scraping Multiple Pages (Pagination)

Why time.sleep(1)?

🛡️ Handling Errors Like a Pro

🎭 Sites Need JavaScript? Use Selenium

Quick Selenium example:

🤖 Respecting robots.txt

💾 Saving Data: CSV, JSON, or Database

To CSV (most common)

To JSON

To SQLite database

🎯 Best Practices Cheatsheet

⚠️ Legal & Ethical Considerations

🚀 What’s Next?

📚 Recommended Reading

🎓 Practice Project Ideas

❓ Frequently Asked Questions

Is web scraping legal?

How fast can I scrape?

Why does my scraper return empty results?

Should I use Scrapy or BeautifulSoup?

What if a site uses CAPTCHA?

Related Posts

Leave a Reply Cancel reply

POPULAR NEWS

Recent News

Category

Recent News

Welcome Back!

Retrieve your password

Add New Playlist

Why `time.sleep(1)`?