Relation of HTTP Caching and Proxying
Caching of HTTP resources is one possible use case of an HTTP proxy. An HTTP cache is not necessarily an HTTP proxy (for example the private cache of a user agent is not a proxy), and a an HTTP proxy is not necessarily a cache (it can serve other purposes like authentication or traffic redirection).
Communication of Client and Proxy
A user agent can offer a method of „proxy configuration“ that allows it to determine if, for a given URL, it should direct the request to a proxy server that runs on a specific network address on a specific port with a specific protocol. Requests to such URLs are then not directly requested from the webserver indicated by the URL but the request is transmitted to the proxy, which performs the request on behalf of the client and sends any response or error back to the client.
HTTP proxies may have but are not required to have caching ability. The most general use of an HTTP proxy is to eliminate direct connections between clients and servers for purposes of security or privacy, to reduce network load or for other technical reasons.
Several protocols can be used for proxying web requests; the most common are HTTP, HTTPS and SOCKS version 4 and 5 (an important addition of SOCKS in version 5 is support for IPv6).
The popular web proxy software Squid only supports operation as a HTTP or HTTPS proxy and would require additional software to be reachable by means of SOCKS.
User Agent Configuration
The DNS- or DHCP-based distribution mechanism WPAD and the configuration file format PAC can be used to distribute proxy configuration to certain versions of certain user agents, but many user agents ignore these settings. Central configuration management is required to deploy explicit proxy configuration to all HTTP clients across a network, and many user agents require explicit configuration.
Example: The package manager of Debian-based Linux systems, „apt“ can use an HTTP proxy for downloading software packages, but it will ignore WPAD configuration from DNS or DHCP; instead, an configuration file stating the proxy explicitly must be put on the machine running „apt“. See [Minoza 2014] for a proposal on how to pass WPAD configuration to „apt“.
An HTTP proxy can be configured to restrict responses to authenticated users. If a proxy requires authentication from a client that sent a request without included proxy authentication, it replies with an HTTP status code of 407 „Proxy Authentication Required“. The reply will include an HTTP header Proxy-Authenticate that specifies method and realm of the required authentication.
Typical methods of authentication of a user agent ot a web proxy are:
- Basic HTTP authentication,
- Digest HTTP authentication,
- Authentication mechanism negotiation (SPNEGO), resulting in either NTLM- or Kerberos-authentication,
- Authentication based on identities provided by directory services, where directory access can be perform using protocols such as LDAP or RADIUS.
Many authentication schemes can be used to infer a client identity from the result of authentication procedure. A proxy can base authorization to access specific web resources based on the authenticated identity of a client.
If the client is not configured to use a proxy but instead HTTP or HTTP traffic is redirected to a proxy on a network level, for example by port-redirection on an intermediate network node, the proxy is said to be „intercepting“ the client.
If, at network level, HTTP traffic between a client and a server is redirected to and processed by a proxy, and no indication about the proxy operation is given to client and server, the proxy is said to operate „transparently“. By the nature of their operation, transparent proxies can not implement proxy authentication, because the client is unaware of their operation.
The popular web proxy software Squid disables proxy authentication as soon as the client-facing socket is put into intercepting mode. It does not distinguish between intercepting and transparent mode of operation.
Proxy-Related HTTP Methods
[RFC 7231], section 4.3.6 specifies the method CONNECT which is reserved for proxies to establish tunnels for end-to-end protocols, specifically HTTPS (see the section on Tunneling below for details).
Proxy-Related HTTP Headers
Conventional proxies are required to alter the HTTP headers of requests and responses they transmit; they have to add a valid identifier of themselves and the HTTP protocol version used to forward the request as HTTP header Via, specified in [RFC 7230], section 5.7.1.
[RFC 7239] specifies a set of HTTP headers that can be used to preserve the IP adress of a descendant HTTP client while traversing an intermediate proxy for purposes of client identification, caching and redirecting specific to clients and application of access restrictions. Of the proposed possible headers, X-Forwarded-For has gained widespread adoption. Popular HTTP proxying software supports transmission and maintenance of this header.
Using X-Forwarded-For has security implications, since the response to a request that explicitly states the original client’s IP address may be personalized by a parent server to the requirements of an identity derived from that information.
Since the specification states no way to validate the information passed in X-Forwarded-For, if carelessly implemented by an intermediate, it can be communicated wrongly or even be forged by misbehaving clients.
Proxy-Related HTTP Status Codes
There is a set of HTTP status codes reserved for error conditions that occur on HTTP proxies: 407 „Proxy Authentication Required“, 502 „Bad Gateway“ (the forwarded request returned a fatal error) and 504 „Gateway timeout“ (the forwarded request was not replied to within a configured timeout period). There is also 305 „Use Proxy“ indicating that a specific proxy should be used which is stated in the Location header of the response, but it is ignored by many user agents for security reasons.
The PROXY Protocol
The PROXY protocol, specified in [HAProxy 2017], is TCP based protocol and can be used when forwarding requests between intermediate caching proxies. It can encapsulate other TCP-based protocols, notably HTTP and HTTPS. It adds additional information about initial and original IP addresses and TCP ports of requesting clients and responding servers to a TCP communication that takes place between intermediate proxies.
The PROXY protocol’s devised use is to preserve client and server IP address information across the traversal of non-transparent load-balancers that break up the TCP connection between client and server or by transparent load-balancers that modify the source or destination address of forwarded TCP packets.
Since a receiver of a IP adress and TCP port information has no plausible way of validating the received information, the PROXY protocol is subject to communication errors and even forgery, much as the X-Forwarded-For header (see above). Therefore, the PROXY protocol should only be enabled between intermediate proxies that are trusted to deliver appropriate information.
Parallel Proxy Connections
Modern user agents expect the download of a web application to comprise many different requests to many different resources. They will default to preemptively opening up multiple TCP connections to remote servers. Using advanced techniques such as preconnecting and prefetching (see below) increases this client-side behavior.
If a user agent has a configured proxy that it accesses using HTTP version 1.1, that behavior can be repeated; popular web browser can open multiple parallel connections to the proxy, just as they would have to the authoritative webserver. Specifically, the browser will open up a CONNECT tunnel for every HTTPS connection, which, in HTTP protocol 1.1 occupies a TCP connection entirely.
Distributed web applications utilize content delivery networks, providing resources with a multitude of authoritative server names, circumventing per-server limitations of the user agents and potentially leading to further multiplication of simultaneously open TCP connections.
HTTP version 2 (see below) makes attempts to mitigate this problem by mandating singular connections between clients and proxies, using multiplexing of interleaved streams to perform multiple requests and tunnelings simultaneously, but usage of this protocol version by user agents to communicate to HTTP proxies has not yet gained widespread adoption (citation needed).
Web proxies, no matter if caching or non-caching, conventional or intercepting, visible or transparent will be exposed to the bulk of this behavior and must be ready to process multiple TCP connections per user agent and webserver as well as the sum of HTTP traffic generated by all users. On busy proxies exposed to many clients, this can expose limitations of the involved network stack. If the proxy performs caching, the operation can be demanding on memory and storage of the cache server, and both must be optimized for efficient access under parallel load.