Monday, May 18, 2026
  • About
  • Advertise
  • Careers
  • Contact
Connect 4 Programming
  • Home
  • Python
  • Java
  • SQL
  • JavaScript
  • HTML
  • Data Structure
  • GIT
  • OOP
  • Interview Questions
  • Login
No Result
View All Result
Connect 4 Prog
Home Uncategorized

Python Web Scraping with BeautifulSoup: Complete Beginner’s Guide 2026

Web scraping is one of the most useful skills in modern Python programming. Whether you’re aggregating product prices, collecting research data, building a news dashboard, or training a machine learning model, web scraping unlocks access to the vast information available online. In this comprehensive guide, we’ll teach you how to scrape websites with Python and BeautifulSoup — from absolute beginner to building production-ready scrapers.

🎯 What You’ll Learn

By the end of this guide, you’ll be able to:

Related posts

Sentiment Analyser ML Project Python tutorial

Sentiment Analyser Py Project Guide

May 18, 2026
Pin Your Note Python project tutorial

Pin Your Note Py Project Guide

May 18, 2026
  • Install and set up Python web scraping tools
  • Extract any element from any webpage
  • Handle errors, retries, and rate limits
  • Scrape multiple pages efficiently
  • Work with dynamic JavaScript-heavy sites
  • Respect website rules (robots.txt and rate limits)
  • Save scraped data to CSV, JSON, or databases

📦 Installation: Setting Up Your Environment

First, let’s install the libraries we need. Open your terminal and run:

pip install requests beautifulsoup4 lxml pandas

What each library does:

  • requests: Sends HTTP requests to download web pages
  • beautifulsoup4: Parses HTML and extracts data
  • lxml: Fast HTML parser (faster than the default html.parser)
  • pandas: Saves data to CSV/Excel easily

🚀 Your First Scraper in 10 Lines

Let’s scrape a sample website to extract all article titles. Here’s the simplest possible scraper:

import requests
from bs4 import BeautifulSoup

url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")

titles = soup.find_all("h2")
for title in titles:
    print(title.text.strip())

That’s it! In 10 lines, you’ve built a working web scraper. Let’s break down what’s happening:

  1. Send the request: requests.get(url) downloads the HTML
  2. Parse the HTML: BeautifulSoup(response.text, "lxml") turns the text into a navigable tree
  3. Find elements: soup.find_all("h2") returns all <h2> tags
  4. Extract text: .text.strip() gets the clean text inside each tag

🔍 Finding Elements: The Core Skills

BeautifulSoup gives you several ways to locate elements:

By tag name

soup.find("h1")          # First <h1> on the page
soup.find_all("p")       # All <p> tags as a list

By class

soup.find("div", class_="post-title")     # Note: class_ (with underscore) — class is reserved in Python
soup.find_all("a", class_="external-link")

By ID

soup.find(id="main-content")
soup.select_one("#main-content")           # Same thing, CSS selector style

By attribute

soup.find("img", {"alt": "Logo"})
soup.find_all("a", href=True)              # All links with href attribute

Using CSS selectors (most powerful)

soup.select("div.post > h2.title")         # Descendant + child + class
soup.select("a[href^='https']")            # Links starting with https
soup.select("li:nth-of-type(2)")           # Second <li> in each parent

💡 Pro tip: CSS selectors (select() and select_one()) are usually cleaner and more powerful than find()/find_all(). Learn them and you’ll write better scrapers.

📋 Real Example: Scraping a Quote Website

Let’s scrape quotes.toscrape.com — a website specifically designed for practicing web scraping.

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "http://quotes.toscrape.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")

quotes_data = []

for quote_div in soup.find_all("div", class_="quote"):
    text = quote_div.find("span", class_="text").text
    author = quote_div.find("small", class_="author").text
    tags = [tag.text for tag in quote_div.find_all("a", class_="tag")]

    quotes_data.append({
        "text": text,
        "author": author,
        "tags": ", ".join(tags)
    })

# Save to CSV
df = pd.DataFrame(quotes_data)
df.to_csv("quotes.csv", index=False, encoding="utf-8")
print(f"Saved {len(quotes_data)} quotes to quotes.csv")

Run this and you’ll have a clean CSV file with quotes, authors, and tags. This is real, production-quality scraping code.

🔄 Scraping Multiple Pages (Pagination)

Most useful scraping involves multiple pages. Here’s the pattern:

import requests
from bs4 import BeautifulSoup
import time

all_quotes = []
base_url = "http://quotes.toscrape.com/page/{}/"
page = 1

while True:
    url = base_url.format(page)
    response = requests.get(url)

    if response.status_code != 200:
        break

    soup = BeautifulSoup(response.text, "lxml")
    quotes = soup.find_all("div", class_="quote")

    if not quotes:  # Empty page = we're done
        break

    for quote_div in quotes:
        all_quotes.append({
            "text": quote_div.find("span", class_="text").text,
            "author": quote_div.find("small", class_="author").text
        })

    print(f"Scraped page {page}, total quotes: {len(all_quotes)}")
    page += 1
    time.sleep(1)  # Be polite — wait 1 second between requests

print(f"Done! Total: {len(all_quotes)} quotes")

Why time.sleep(1)?

This is critical: Always add delays between requests. Without it, you could:

  • Overload the server (bad for them, bad for you)
  • Get your IP banned
  • Violate the website’s terms of service

1 second between requests is a polite default. For larger sites, you can go faster (0.5s or 0.2s). For smaller sites, slower (2-5s).

🛡️ Handling Errors Like a Pro

Real-world scraping needs robust error handling:

import requests
from bs4 import BeautifulSoup
import time
from requests.exceptions import RequestException

def fetch_page(url, max_retries=3):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
    }

    for attempt in range(max_retries):
        try:
            response = requests.get(url, headers=headers, timeout=10)
            response.raise_for_status()  # Raises HTTPError for 4xx/5xx
            return response.text
        except RequestException as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff: 1s, 2s, 4s
            else:
                print(f"Failed to fetch {url} after {max_retries} attempts")
                return None

html = fetch_page("http://quotes.toscrape.com")
if html:
    soup = BeautifulSoup(html, "lxml")
    # Continue scraping...

This adds:

  • User-Agent header: Many sites block requests without this
  • Timeout: Don’t wait forever for slow servers (10 seconds max)
  • Retries with backoff: Wait longer between each retry
  • Graceful failure: Returns None instead of crashing

🎭 Sites Need JavaScript? Use Selenium

BeautifulSoup only sees the HTML that’s returned from the server. Modern sites built with React, Vue, or Angular load content dynamically with JavaScript. For those, you need a browser automation tool like Selenium or Playwright.

Quick Selenium example:

pip install selenium
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time

options = Options()
options.add_argument("--headless")  # Run without opening browser window

driver = webdriver.Chrome(options=options)
driver.get("https://example-spa-website.com")
time.sleep(3)  # Wait for JavaScript to load

soup = BeautifulSoup(driver.page_source, "lxml")
# Now scrape soup as usual
driver.quit()

Selenium is slower than requests but handles JavaScript. Use it only when needed.

🤖 Respecting robots.txt

Every website has a /robots.txt file that tells you what you can and can’t scrape. Always check it before scraping:

import urllib.robotparser

rp = urllib.robotparser.RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()

if rp.can_fetch("*", "https://example.com/products"):
    # OK to scrape
    pass
else:
    print("This URL is disallowed by robots.txt")

If a site says "Disallow: /admin/" — don’t scrape /admin/. It’s a request, not law, but respecting it keeps you ethical and safe.

💾 Saving Data: CSV, JSON, or Database

To CSV (most common)

import pandas as pd
df = pd.DataFrame(scraped_data)
df.to_csv("data.csv", index=False, encoding="utf-8-sig")  # utf-8-sig opens correctly in Excel

To JSON

import json
with open("data.json", "w", encoding="utf-8") as f:
    json.dump(scraped_data, f, ensure_ascii=False, indent=2)

To SQLite database

import sqlite3
conn = sqlite3.connect("scraped.db")
df = pd.DataFrame(scraped_data)
df.to_sql("quotes", conn, if_exists="append", index=False)
conn.close()

🎯 Best Practices Cheatsheet

  1. Always identify yourself with a User-Agent header
  2. Always add delays (time.sleep) between requests
  3. Always handle errors with try/except
  4. Always check robots.txt before scraping a new site
  5. Avoid hammering an API — use pagination, not concurrent requests on small sites
  6. Cache responses locally during development to avoid re-downloading
  7. Use sessions for sites requiring login: session = requests.Session()
  8. Rotate User-Agents for large-scale scraping
  9. Use proxies if your IP gets blocked
  10. Document your scraper — websites change, your code needs to be readable

⚠️ Legal & Ethical Considerations

Before scraping, ask yourself:

  • ✅ Is this data publicly available? (no login required)
  • ✅ Am I respecting robots.txt?
  • ✅ Am I using reasonable rate limits?
  • ✅ Is my scraping for personal/research use or commercial gain?

Some sites explicitly prohibit scraping in their Terms of Service. Examples include LinkedIn, Facebook, and certain news websites. Scraping them could lead to IP bans, legal warnings, or worse.

Safe to scrape: Public product catalogs, news headlines (with attribution), public government data, your own social media data.

Risky: Personal user data, behind login walls, sites with explicit no-scraping policies.

🚀 What’s Next?

Now that you’ve mastered the basics, consider exploring:

  1. Scrapy framework: For large-scale, production scrapers (faster than requests + BeautifulSoup)
  2. Async scraping with aiohttp + asyncio: 10x+ faster for many URLs
  3. Playwright: Modern alternative to Selenium, faster and more reliable
  4. API alternatives: Many sites offer official APIs — always check first
  5. Cloud scraping platforms: Bright Data, ScraperAPI, Apify for enterprise

📚 Recommended Reading

  • Official BeautifulSoup docs: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
  • Real Python web scraping tutorials
  • "Web Scraping with Python" by Ryan Mitchell (book)

🎓 Practice Project Ideas

Build these to cement your learning:

  1. News aggregator: Scrape headlines from 3-5 news sites and email them daily
  2. Price tracker: Monitor a product on Amazon and alert when price drops
  3. Job board scraper: Collect Python jobs from multiple boards
  4. Movie database: Scrape IMDB ratings for your favorite genre
  5. Real estate tracker: Aggregate apartment listings from local sites

❓ Frequently Asked Questions

Is web scraping legal?

In most jurisdictions, scraping publicly available data is legal. However, violating a site’s Terms of Service can have civil consequences. The famous case hiQ vs LinkedIn ruled that scraping publicly accessible data is legal in the US, but laws vary by country.

How fast can I scrape?

Depends on the site. As a rule of thumb:

  • Small sites: 1 request per 2-5 seconds
  • Medium sites: 1 request per 1 second
  • Large sites with APIs: 5-10 requests per second (or follow API rate limits)

Why does my scraper return empty results?

Most common causes:

  1. The site loads content via JavaScript (use Selenium)
  2. The site blocks bots (add User-Agent header)
  3. The site requires login (use sessions or cookies)
  4. The HTML structure changed (re-inspect the page)

Should I use Scrapy or BeautifulSoup?

  • BeautifulSoup for learning and small scripts (1-100 pages)
  • Scrapy for production scrapers (1000+ pages, multiple sites)

What if a site uses CAPTCHA?

  • Try to avoid triggering CAPTCHAs (slow down requests, rotate User-Agents)
  • For unavoidable CAPTCHAs, consider 2Captcha or Anti-Captcha services
  • For automated CAPTCHA solving, you usually need a paid service or ML model

Happy scraping! 🐍🕷️

author avatar
Ahmad Hussain
See Full Bio

Related Posts

Notification App Python project tutorial
Uncategorized

Notification App Py Project Guide

May 18, 2026
Pin Your Note Python project tutorial
Uncategorized

Pin Your Note Py Project Guide

May 18, 2026
Sentiment Analyser ML Project Python tutorial
Uncategorized

Sentiment Analyser Py Project Guide

May 18, 2026

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR NEWS

  • 71 Python Projects with References and Source Code

    71 Python Projects with References and Source Code

    0 shares
    Share 0 Tweet 0
  • OOPS in Python Handwritten Notes

    10 shares
    Share 0 Tweet 0
  • Most Asked JavaScript Interview (100 Q&A) PDF

    0 shares
    Share 0 Tweet 0
  • Most Asked Java Interview (100 Q&A) PDF

    0 shares
    Share 0 Tweet 0
  • Top 50 Java Interview Questions and Answers PDF

    0 shares
    Share 0 Tweet 0
Connect 4 Programming

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc.

Follow us on social media:

Recent News

  • Sentiment Analyser Py Project Guide
  • Pin Your Note Py Project Guide
  • Notification App Py Project Guide

Category

  • Data Structure
  • GIT
  • HTML
  • Interview Questions
  • Java
  • JavaScript
  • OOP
  • Programming
  • Py
  • Python
  • SQL
  • Uncategorized

Recent News

Sentiment Analyser ML Project Python tutorial

Sentiment Analyser Py Project Guide

May 18, 2026
Pin Your Note Python project tutorial

Pin Your Note Py Project Guide

May 18, 2026
  • About
  • Advertise
  • Careers
  • Contact

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

No Result
View All Result
  • Home
  • Python
  • Java
  • SQL
  • JavaScript
  • HTML
  • Data Structure
  • GIT
  • OOP
  • Interview Questions