cat.web — Web content classification for R • cat.web

Web content classification with LLMs. A domain wrapper around cat.stack that adds automatic URL fetching and web-context prompt injection (source domain, content type, metadata).

cat.web wraps the Python catweb package via reticulate.

Installation

# From R-universe (recommended once published):
install.packages("cat.web", repos = "https://chrissoria.r-universe.dev")

# Or from a local clone:
devtools::install("path/to/cat.stack")
devtools::install("path/to/cat.web")

# Install the Python backend
# pip install cat-web

Quick Start

Classify a list of URLs

library(cat.web)

urls <- c(
  "https://example.com/article-1",
  "https://example.com/article-2",
  "https://example.com/article-3"
)

results <- classify(
  input_data    = urls,
  categories    = c("News", "Opinion", "Tutorial"),
  source_domain = "example.com",
  content_type  = "blog post",
  api_key       = Sys.getenv("OPENAI_API_KEY")
)

Classify raw text (no fetching)

results <- classify(
  input_data = df$article_text,
  categories = c("News", "Opinion", "Tutorial"),
  api_key    = Sys.getenv("OPENAI_API_KEY")
)

Discover categories from scraped pages

result <- extract(
  input_data    = urls,
  source_domain = "example.com",
  api_key       = Sys.getenv("OPENAI_API_KEY")
)
print(result$top_categories)

Summarize web articles

results <- summarize(
  input_data    = urls,
  source_domain = "nytimes.com",
  content_type  = "news article",
  format        = "bullets",
  api_key       = Sys.getenv("OPENAI_API_KEY")
)

Functions

Function	Description
`classify()`	Classify URLs or text into categories
`extract()`	Discover and extract categories from web content
`explore()`	Get raw category extractions for saturation analysis
`summarize()`	Summarize web articles (URL auto-fetched)

License

MIT