Content Extractor
extractContent() and extractFromHtml() extract the main text content from a web page, removing noise elements like navigation, ads, and sidebars.
Basic Usage
import { extractContent, extractFromHtml } from 'web-meta-scraper';
// From URL
const result = await extractContent('https://example.com/article');
// From HTML string
const result = extractFromHtml('<html>...</html>');
console.log(result.content); // "Article text content..."
console.log(result.wordCount); // 1234
console.log(result.language); // "en"
console.log(result.metadata); // { title: "Article Title", description: "..." }Result Structure
interface ExtractContentResult {
content: string; // Extracted main text
metadata: { title: string; description: string }; // Page title and meta description
wordCount: number; // Word count (CJK-aware)
language: string; // Language from html lang attribute
}How It Works
- Metadata extraction — Extracts
<title>and<meta name="description">before processing. - Noise removal — Removes
<script>,<style>,<nav>,<header>,<footer>,<aside>, ad containers, sidebars, popups, and other non-content elements. - Content area detection — Searches for the main content area using selectors in order:
<article>,<main>,[role="main"],.content,.post-content,.entry-content,.article-body,#content. Falls back to<body>if none found. - Text normalization — Collapses whitespace and ensures block elements produce line breaks.
- Word counting — Counts whitespace-separated words for Latin scripts and individual characters for CJK (Chinese, Japanese, Korean).
Use Cases
- AI agents reading web page content for summarization or analysis
- Content indexing for search or categorization
- Readability analysis with word count and language detection