Web scraping is one of the most useful skills in modern Python programming. Whether you’re aggregating product prices, collecting research data, building a news dashboard, or training a machine learning model, web scraping unlocks access to the vast information available online. In this comprehensive guide, we’ll teach you how to scrape websites with Python and BeautifulSoup — from absolute beginner to building production-ready scrapers.
🎯 What You’ll Learn
By the end of this guide, you’ll be able to:
- Install and set up Python web scraping tools
- Extract any element from any webpage
- Handle errors, retries, and rate limits
- Scrape multiple pages efficiently
- Work with dynamic JavaScript-heavy sites
- Respect website rules (robots.txt and rate limits)
- Save scraped data to CSV, JSON, or databases
📦 Installation: Setting Up Your Environment
First, let’s install the libraries we need. Open your terminal and run:
pip install requests beautifulsoup4 lxml pandasWhat each library does:
- requests: Sends HTTP requests to download web pages
- beautifulsoup4: Parses HTML and extracts data
- lxml: Fast HTML parser (faster than the default
html.parser) - pandas: Saves data to CSV/Excel easily
🚀 Your First Scraper in 10 Lines
Let’s scrape a sample website to extract all article titles. Here’s the simplest possible scraper:
import requests
from bs4 import BeautifulSoup
url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
titles = soup.find_all("h2")
for title in titles:
print(title.text.strip())That’s it! In 10 lines, you’ve built a working web scraper. Let’s break down what’s happening:
- Send the request:
requests.get(url)downloads the HTML - Parse the HTML:
BeautifulSoup(response.text, "lxml")turns the text into a navigable tree - Find elements:
soup.find_all("h2")returns all<h2>tags - Extract text:
.text.strip()gets the clean text inside each tag
🔍 Finding Elements: The Core Skills
BeautifulSoup gives you several ways to locate elements:
By tag name
soup.find("h1") # First <h1> on the page
soup.find_all("p") # All <p> tags as a listBy class
soup.find("div", class_="post-title") # Note: class_ (with underscore) — class is reserved in Python
soup.find_all("a", class_="external-link")By ID
soup.find(id="main-content")
soup.select_one("#main-content") # Same thing, CSS selector styleBy attribute
soup.find("img", {"alt": "Logo"})
soup.find_all("a", href=True) # All links with href attributeUsing CSS selectors (most powerful)
soup.select("div.post > h2.title") # Descendant + child + class
soup.select("a[href^='https']") # Links starting with https
soup.select("li:nth-of-type(2)") # Second <li> in each parent💡 Pro tip: CSS selectors (select() and select_one()) are usually cleaner and more powerful than find()/find_all(). Learn them and you’ll write better scrapers.
📋 Real Example: Scraping a Quote Website
Let’s scrape quotes.toscrape.com — a website specifically designed for practicing web scraping.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "http://quotes.toscrape.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
quotes_data = []
for quote_div in soup.find_all("div", class_="quote"):
text = quote_div.find("span", class_="text").text
author = quote_div.find("small", class_="author").text
tags = [tag.text for tag in quote_div.find_all("a", class_="tag")]
quotes_data.append({
"text": text,
"author": author,
"tags": ", ".join(tags)
})
# Save to CSV
df = pd.DataFrame(quotes_data)
df.to_csv("quotes.csv", index=False, encoding="utf-8")
print(f"Saved {len(quotes_data)} quotes to quotes.csv")Run this and you’ll have a clean CSV file with quotes, authors, and tags. This is real, production-quality scraping code.
🔄 Scraping Multiple Pages (Pagination)
Most useful scraping involves multiple pages. Here’s the pattern:
import requests
from bs4 import BeautifulSoup
import time
all_quotes = []
base_url = "http://quotes.toscrape.com/page/{}/"
page = 1
while True:
url = base_url.format(page)
response = requests.get(url)
if response.status_code != 200:
break
soup = BeautifulSoup(response.text, "lxml")
quotes = soup.find_all("div", class_="quote")
if not quotes: # Empty page = we're done
break
for quote_div in quotes:
all_quotes.append({
"text": quote_div.find("span", class_="text").text,
"author": quote_div.find("small", class_="author").text
})
print(f"Scraped page {page}, total quotes: {len(all_quotes)}")
page += 1
time.sleep(1) # Be polite — wait 1 second between requests
print(f"Done! Total: {len(all_quotes)} quotes")Why time.sleep(1)?
This is critical: Always add delays between requests. Without it, you could:
- Overload the server (bad for them, bad for you)
- Get your IP banned
- Violate the website’s terms of service
1 second between requests is a polite default. For larger sites, you can go faster (0.5s or 0.2s). For smaller sites, slower (2-5s).
🛡️ Handling Errors Like a Pro
Real-world scraping needs robust error handling:
import requests
from bs4 import BeautifulSoup
import time
from requests.exceptions import RequestException
def fetch_page(url, max_retries=3):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
for attempt in range(max_retries):
try:
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status() # Raises HTTPError for 4xx/5xx
return response.text
except RequestException as e:
print(f"Attempt {attempt + 1} failed: {e}")
if attempt < max_retries - 1:
time.sleep(2 ** attempt) # Exponential backoff: 1s, 2s, 4s
else:
print(f"Failed to fetch {url} after {max_retries} attempts")
return None
html = fetch_page("http://quotes.toscrape.com")
if html:
soup = BeautifulSoup(html, "lxml")
# Continue scraping...This adds:
- User-Agent header: Many sites block requests without this
- Timeout: Don’t wait forever for slow servers (10 seconds max)
- Retries with backoff: Wait longer between each retry
- Graceful failure: Returns
Noneinstead of crashing
🎭 Sites Need JavaScript? Use Selenium
BeautifulSoup only sees the HTML that’s returned from the server. Modern sites built with React, Vue, or Angular load content dynamically with JavaScript. For those, you need a browser automation tool like Selenium or Playwright.
Quick Selenium example:
pip install seleniumfrom selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time
options = Options()
options.add_argument("--headless") # Run without opening browser window
driver = webdriver.Chrome(options=options)
driver.get("https://example-spa-website.com")
time.sleep(3) # Wait for JavaScript to load
soup = BeautifulSoup(driver.page_source, "lxml")
# Now scrape soup as usual
driver.quit()Selenium is slower than requests but handles JavaScript. Use it only when needed.
🤖 Respecting robots.txt
Every website has a /robots.txt file that tells you what you can and can’t scrape. Always check it before scraping:
import urllib.robotparser
rp = urllib.robotparser.RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
if rp.can_fetch("*", "https://example.com/products"):
# OK to scrape
pass
else:
print("This URL is disallowed by robots.txt")If a site says "Disallow: /admin/" — don’t scrape /admin/. It’s a request, not law, but respecting it keeps you ethical and safe.
💾 Saving Data: CSV, JSON, or Database
To CSV (most common)
import pandas as pd
df = pd.DataFrame(scraped_data)
df.to_csv("data.csv", index=False, encoding="utf-8-sig") # utf-8-sig opens correctly in ExcelTo JSON
import json
with open("data.json", "w", encoding="utf-8") as f:
json.dump(scraped_data, f, ensure_ascii=False, indent=2)To SQLite database
import sqlite3
conn = sqlite3.connect("scraped.db")
df = pd.DataFrame(scraped_data)
df.to_sql("quotes", conn, if_exists="append", index=False)
conn.close()🎯 Best Practices Cheatsheet
- Always identify yourself with a User-Agent header
- Always add delays (
time.sleep) between requests - Always handle errors with try/except
- Always check robots.txt before scraping a new site
- Avoid hammering an API — use pagination, not concurrent requests on small sites
- Cache responses locally during development to avoid re-downloading
- Use sessions for sites requiring login:
session = requests.Session() - Rotate User-Agents for large-scale scraping
- Use proxies if your IP gets blocked
- Document your scraper — websites change, your code needs to be readable
⚠️ Legal & Ethical Considerations
Before scraping, ask yourself:
- ✅ Is this data publicly available? (no login required)
- ✅ Am I respecting robots.txt?
- ✅ Am I using reasonable rate limits?
- ✅ Is my scraping for personal/research use or commercial gain?
Some sites explicitly prohibit scraping in their Terms of Service. Examples include LinkedIn, Facebook, and certain news websites. Scraping them could lead to IP bans, legal warnings, or worse.
Safe to scrape: Public product catalogs, news headlines (with attribution), public government data, your own social media data.
Risky: Personal user data, behind login walls, sites with explicit no-scraping policies.
🚀 What’s Next?
Now that you’ve mastered the basics, consider exploring:
- Scrapy framework: For large-scale, production scrapers (faster than requests + BeautifulSoup)
- Async scraping with aiohttp + asyncio: 10x+ faster for many URLs
- Playwright: Modern alternative to Selenium, faster and more reliable
- API alternatives: Many sites offer official APIs — always check first
- Cloud scraping platforms: Bright Data, ScraperAPI, Apify for enterprise
📚 Recommended Reading
- Official BeautifulSoup docs: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
- Real Python web scraping tutorials
- "Web Scraping with Python" by Ryan Mitchell (book)
🎓 Practice Project Ideas
Build these to cement your learning:
- News aggregator: Scrape headlines from 3-5 news sites and email them daily
- Price tracker: Monitor a product on Amazon and alert when price drops
- Job board scraper: Collect Python jobs from multiple boards
- Movie database: Scrape IMDB ratings for your favorite genre
- Real estate tracker: Aggregate apartment listings from local sites
❓ Frequently Asked Questions
Is web scraping legal?
In most jurisdictions, scraping publicly available data is legal. However, violating a site’s Terms of Service can have civil consequences. The famous case hiQ vs LinkedIn ruled that scraping publicly accessible data is legal in the US, but laws vary by country.
How fast can I scrape?
Depends on the site. As a rule of thumb:
- Small sites: 1 request per 2-5 seconds
- Medium sites: 1 request per 1 second
- Large sites with APIs: 5-10 requests per second (or follow API rate limits)
Why does my scraper return empty results?
Most common causes:
- The site loads content via JavaScript (use Selenium)
- The site blocks bots (add User-Agent header)
- The site requires login (use sessions or cookies)
- The HTML structure changed (re-inspect the page)
Should I use Scrapy or BeautifulSoup?
- BeautifulSoup for learning and small scripts (1-100 pages)
- Scrapy for production scrapers (1000+ pages, multiple sites)
What if a site uses CAPTCHA?
- Try to avoid triggering CAPTCHAs (slow down requests, rotate User-Agents)
- For unavoidable CAPTCHAs, consider 2Captcha or Anti-Captcha services
- For automated CAPTCHA solving, you usually need a paid service or ML model
Happy scraping! 🐍🕷️





