Webpage Text API Documentation

Overview

All calls to the Webpage Text API are authenticated using OAuth 2.0. Before making a call to the API, request an access token from the authentication servers. Use that access token until it expires, and then request another one from the authentication servers. Send that access token to Webpage Text API calls with an Authentication: Bearer access_token header.

I recommend that bodies of POST requests be in JSON format. You will need a Content-Type: application/json HTTP header with every request.

Response bodies for all requests are in JSON format.

If an API call is successful the response will have a status code of 200 (OK). If not it will have a 4xx or 5xx status code. A 4xx status code indicates a client-side error. A 5xx status code indicates a server-side error. These are the most likely 4xx and 5xx status codes:

400 (Bad Request): The request body was missing values or is somehow not in the expected format. The authentication endpoint will return this if the Client ID or Client Secret is incorrect.

401 (Unauthorized): The Webpage Text API will return this if the access token is not found, is expired, or is not valid.

404 (Not Found): The request URL is not valid.

429 (Too Many Requests): The monthly request limit or URL limit has been exceeded.

500 (Internal Server Error): A server-side error occurred. This could be the result of a temporary issue with our system. The Single Webpage Text Retrieval endpoint will return this if it is unable to retrieve or generate webpage text for the specified URL.


API Calls

Authentication

You will receive a Client ID and Client Secret as part of the provisioning process. Before making a request for webpage text, a client needs to retrieve an access token. You can request one with this endpoint:

POST https://auth.goldenhillsoftware.com/1.0/tokens

That request must include the following parameters:

client_id: The Client ID you received as part of the provisioning process.

client_secret: The Client Secret you received as part of the provisioning process.

scope: The value https://webpagetextapi.goldenhillsoftware.com/.

grant_type: The value client_credentials.

The response will be a JSON object with the following properties:

access_token: The access token to use for API calls.

expires_in: The number of seconds for which the access token is valid.

token_type: The value bearer.

scope: The value https://webpagetextapi.goldenhillsoftware.com/.

The client should use that access token until it expires, as specified by the expires_in response property.

Sample

curl \
--header "Content-Type: application/json" \
--request POST \
--data '{ "client_id": "Your Client ID", "client_secret": "Your Client Secret", "scope": "https://webpagetextapi.goldenhillsoftware.com/", "grant_type": "client_credentials" }' \
https://auth.goldenhillsoftware.com/1.0/tokens
{
    "access_token": "Your access token",
    "expires_in": 86400,
    "token_type": "bearer",
    "scope": "string://webpagetextapi.goldenhillsoftware.com/"
}

Single Webpage Text Retrieval

Use this API call to retrieve webpage text for a single webpage URL:

POST https://webpagetextapi.goldenhillsoftware.com/1.0/retrievals

Include an Authorization header with the string Bearer access_token, replacing access_token with the access token obtained from the authentication endpoint.

This requires the following request parameter:

url: The webpage URL.

The response will be a JSON object with the following properties:

url: The URL specified in the request.

response_url: The URL from which the webpage was retrieved. This may be different from the url property if the server issued a redirect.

title: The title of the webpage. This property will not be present if no title is found.

author: The author of the webpage. This property will not be present if no author is found.

html: The HTML of the main webpage content.

Sample

curl \
--header "Content-Type: application/json" \
--header "Authorization: Bearer Your Temporary Access Token" \
--request POST \
--data '{ "url": "https://www.goldenhillsoftware.com/2021/08/unread-27-improves-the-experience-of-reading-linked-list-articles/" }' \
https://webpagetextapi.goldenhillsoftware.com/1.0/retrievals
{
    "url": "https://www.goldenhillsoftware.com/2021/08/unread-27-improves-the-experience-of-reading-linked-list-articles/",
    "response_url": "https://www.goldenhillsoftware.com/2021/08/unread-27-improves-the-experience-of-reading-linked-list-articles/",
    "title": "Unread 2.7 Improves the Experience of Reading Linked List Articles",
    "author": "John Brayton",
    "html": "\u003cp\u003eUnread 2.7 is \u003ca href=\"https://apps.apple.com/us/app/unread-2/id1363637349\"\u003eavailable from the App Store\u003c/a\u003e. This update improves the experience of reading linked list articles and adds other improvements. Additional HTML removed."
}

Bulk Webpage Text Retrieval

Use this API call to retrieve webpage text for up to 100 webpage URLs:

POST https://webpagetextapi.goldenhillsoftware.com/1.0/bulk-retrievals

Include an Authorization header with the string Bearer access_token, replacing access_token with the access token obtained from the authentication endpoint.

This requires the following request parameter:

urls: Between 1 and 100 unique URLs.

The response will be a JSON array. Each element will have the following properties:

url: The URL of the requested webpage.

status: The status of the retrieval, specified as an HTTP status code. A value of 200 indicates success. A 4xx value indicates a client-side error. A 5xx value indicates that the server was unable to fulfill the request.

result: A JSON object with the retrieved webpage text result. This property will not be present if status is not 200.

Each result object will have the following properties:

url: The URL specified in the request.

response_url: The URL from which the webpage was retrieved. This may be different from the url property if the server issued a redirect.

title: The title of the webpage. This property will not be present if no title is found.

author: The author of the webpage. This property will not be present if no author is found.

html: The HTML of the main webpage content.

Sample

curl \
--header "Content-Type: application/json" \
--header "Authorization: Bearer Your Temporary Access Token" \
--request POST \
--data '{ "urls":["https://www.goldenhillsoftware.com/2021/08/unread-27-improves-the-experience-of-reading-linked-list-articles/","https://www.goldenhillsoftware.com/2021/06/unreads-compact-and-expansive-article-list-formats/"] }' \
https://webpagetextapi.goldenhillsoftware.com/1.0/bulk-retrievals
[
    {
        "url": "https://www.goldenhillsoftware.com/2021/08/unread-27-improves-the-experience-of-reading-linked-list-articles/",
        "status": 200,
        "result": {
            "url": "https://www.goldenhillsoftware.com/2021/08/unread-27-improves-the-experience-of-reading-linked-list-articles/",
            "response_url": "https://www.goldenhillsoftware.com/2021/08/unread-27-improves-the-experience-of-reading-linked-list-articles/",
            "title": "Unread 2.7 Improves the Experience of Reading Linked List Articles",
            "author": "John Brayton",
            "html": "<p>Unread 2.7 is <a href=\"https://apps.apple.com/us/app/unread-2/id1363637349\">available from the App Store</a>. This update improves the experience of reading linked list articles and adds other improvements. Additional HTML removed."
        }
    },
    {
        "url": "https://www.goldenhillsoftware.com/2021/06/unreads-compact-and-expansive-article-list-formats/",
        "status": 200,
        "result": {
            "url": "https://www.goldenhillsoftware.com/2021/06/unreads-compact-and-expansive-article-list-formats/",
            "response_url": "https://www.goldenhillsoftware.com/2021/06/unreads-compact-and-expansive-article-list-formats/",
            "title": "Unread\u2019s Compact and Expansive Article List Formats",
            "author": "John Brayton",
            "html": "<figure><img src=\"https://www.goldenhillsoftware.com/unread26rsrcs/articlelistpost/hero.jpeg\" width=\"2000\" height=\"1125\"></figure> <p>Last week I <a href=\"https://www.goldenhillsoftware.com/2021/06/unread-26-adds-full-text-search-a-compact-article-list-option-for-iphone-and-more/\">released Unread 2.6</a> with full-text search capabilities, a compact article list option for iPhone, and more. Additional HTML removed."
        }
    }
]