Web content classification with LLMs. A domain wrapper around cat.stack that adds automatic URL fetching and web-context prompt injection (source domain, content type, metadata).
cat.web wraps the Python catweb package via reticulate.
Installation
# From R-universe (recommended once published):
install.packages("cat.web", repos = "https://chrissoria.r-universe.dev")
# Or from a local clone:
devtools::install("path/to/cat.stack")
devtools::install("path/to/cat.web")
# Install the Python backend
# pip install cat-webQuick Start
Classify a list of URLs
library(cat.web)
urls <- c(
"https://example.com/article-1",
"https://example.com/article-2",
"https://example.com/article-3"
)
results <- classify(
input_data = urls,
categories = c("News", "Opinion", "Tutorial"),
source_domain = "example.com",
content_type = "blog post",
api_key = Sys.getenv("OPENAI_API_KEY")
)Classify raw text (no fetching)
results <- classify(
input_data = df$article_text,
categories = c("News", "Opinion", "Tutorial"),
api_key = Sys.getenv("OPENAI_API_KEY")
)Discover categories from scraped pages
result <- extract(
input_data = urls,
source_domain = "example.com",
api_key = Sys.getenv("OPENAI_API_KEY")
)
print(result$top_categories)Summarize web articles
results <- summarize(
input_data = urls,
source_domain = "nytimes.com",
content_type = "news article",
format = "bullets",
api_key = Sys.getenv("OPENAI_API_KEY")
)Functions
| Function | Description |
|---|---|
classify() |
Classify URLs or text into categories |
extract() |
Discover and extract categories from web content |
explore() |
Get raw category extractions for saturation analysis |
summarize() |
Summarize web articles (URL auto-fetched) |