Content Extractor

extractContent() and extractFromHtml() extract the main text content from a web page, removing noise elements like navigation, ads, and sidebars.

Basic Usage

import { extractContent, extractFromHtml } from 'web-meta-scraper';
 
// From URL
const result = await extractContent('https://example.com/article');
 
// From HTML string
const result = extractFromHtml('<html>...</html>');
 
console.log(result.content);       // "Article text content..."
console.log(result.wordCount);     // 1234
console.log(result.language);      // "en"
console.log(result.metadata);      // { title: "Article Title", description: "..." }

Result Structure

interface ExtractContentResult {
  content: string;                                  // Extracted main text
  metadata: { title: string; description: string }; // Page title and meta description
  wordCount: number;                                // Word count (CJK-aware)
  language: string;                                 // Language from html lang attribute
}

How It Works

Metadata extraction — Extracts <title> and <meta name="description"> before processing.
Noise removal — Removes <script>, <style>, <nav>, <header>, <footer>, <aside>, ad containers, sidebars, popups, and other non-content elements.
Content area detection — Searches for the main content area using selectors in order: <article>, <main>, [role="main"], .content, .post-content, .entry-content, .article-body, #content. Falls back to <body> if none found.
Text normalization — Collapses whitespace and ensures block elements produce line breaks.
Word counting — Counts whitespace-separated words for Latin scripts and individual characters for CJK (Chinese, Japanese, Korean).

Use Cases

AI agents reading web page content for summarization or analysis
Content indexing for search or categorization
Readability analysis with word count and language detection

Metadata Validator Options