I started building the Webpage Text API with these design goals:
Scalability: I need to be able to support a large number of clients and a large number of articles.
Caching: If an article is read by one client, it will likely be read by other clients. When feasible I want to retrieve the webpage and generate its webpage text result just once.
The core of the Webpage Text API contains two types of servers:
Retriever: A retriever is a server that knows how to take a single article URL and generate its webpage text result.
Frontend Server: A frontend server accepts requests from Unread and from other clients. A request can contain up to 100 article URLs. The frontend server is responsible for getting results for the article URLs and combining them into a single response.
The Retriever runs a Ruby on Rails app. That app is heavily dependent on curb to make HTTP and HTTPS requests, and on my webpage text generation library.
Requests to the Retrievers have URL paths that incorporate the URL for which webpage text is needed. For example:
NGINX is configured to cache responses locally. If a Retriever has a cached response, it can return it without even sending the request to Passenger or the Ruby on Rails app.
Each Frontend Server is configured with an ordered list of Retrievers. A Frontend Server determines which Retriever should process a request for a specific article URL by generating a numeric hash of that article URL, and using that hash to identify a server. This ensures that all requests for webpage text for a specific article URL are processed by a single Retriever as long as the set of retrievers does not change.
Like Retrievers, Frontend Servers are based on NGINX, Passenger, and a Ruby on Rails app. Each Frontend Server also has a memcached process that provides a second tier of caching for webpage text results for individual article URLs.
There is no specific limit to either the number of Retrievers or the number of Frontend Servers that can be deployed at any one time.