Website Scraper & Analyzer

A CLI tool that comprehensively analyzes and collects website information just by entering a URL

Created on: May 19, 2025
Website Scraper & Analyzer
This content has been translated by AI from the original Japanese version.

I want to develop a CLI tool that can retrieve various information about a website in a list format just by entering a URL.

Main Features

Recursive Link Collection

  • Retrieve links with the same origin from a page URL and call them recursively
  • Ensures URLs with the same pathname are retrieved only once

SEO Information Analysis

  • Retrieve SEO-related information for each page
    • Title, description, keywords
    • OGP information (og:title, og:description, og:image)
    • Twitter Card information
    • Heading structure analysis (h1-h6)
    • Image alt attribute checking

Visual Analysis

  • Take screenshots of pages
    • PC version support (mobile version in the future)
    • Full page capture
    • Highlighting of important sections

Data Output

  • Automatic sitemap.xml generation
  • SEO analysis report output
  • Link structure visualization

Technical Implementation

CLI Application

  • Runs from the command line
  • Utilizes Node.js environment
  • Headless browsing using Puppeteer
  • Output in various formats like JSON and CSV

Use Cases

  • Competitor website analysis
  • SEO audit
  • Broken link checking
  • Content inventory creation
  • Automatic sitemap generation

Security Considerations

  • Respect for robots.txt
  • Recommend for analyzing own sites
  • Preserve copyright notices
  • Exclude privacy information

Architecture

Monorepo Structure

  • CLI package
  • Account management site (with API)
  • Shared libraries

AI Features (Paid)

  • Automatic content analysis
  • SEO optimization suggestions
  • Competitive analysis reports
  • Available only when logged in via CLI

Future Expansion Plans

  • Performance measurement features
  • Accessibility checking
  • Multi-site comparison analysis