-------------------------
Week 11 Notes for CST8165
-------------------------
-Ian! D. Allen - idallen@idallen.ca - www.idallen.com
Remember - knowing how to find out an answer is more important than
memorizing the answer. Learn to fish! RTFM! (Read The Fine Manual)
Keep up on your readings (Course Outline: average 4 hours/week homework)
Review:
------
- handling multiple simultaneous connections
- reducing indentation levels to make code readable
- testing strategies
- from Kurose/Ross:
http://teaching.idallen.com/cst8165/07f/notes/kurose/ (HTTP slides)
- includes HTTP slides showing Request and Response headers
------------------------------------------------------------------------------
HTTP - Hyper Text Transfer Protocol
----
First used in 1990
http://tools.ietf.org/html/rfc2616 (HTTP 1.1 - June 1999 - 176 pages)
- a "PULL" protocol - receiver initiates (SMTP is "PUSH" protocol)
HTTP design issues by Tim Berner-Lee
---------------------------
http://www.w3.org/Protocols/DesignIssues.html
Q: Why did Tim Berners-Lee choose "Internet Protocol" instead of RPC for HTTP?
Q: Name one advantage and one disadvantage of coding HTTP using RPC.
Q: Does the HTTP server need to keep state information about the client?
Q: Why is the stateless nature of HTTP a problem for such things as
search systems? How does Tim say the problems can be mitigated?
Many of Tim's original methods (e.g. "PORT") didn't make it into the
final HTTP specification.
HTTP protocol consists of Requests and Responses
------------------------------------------------
http://tools.ietf.org/html/rfc2616
Requests - Section 5
Responses - Section 6
Unlike SMTP, the HTTP protocol is much more "symmetric" - the format
of what the client sends to the server looks a lot like what the server
sends back to the client. You can both upload and download using HTTP.
An HTTP "Request" goes from client to server (from your web browser to
the remote server). A Request consists of a series of header lines of
the form "name: data" ending at an empty line (a line with just CRLF),
followed by an (often optional) body. An HTTP "Response" comes back from
the server to you (from the server to your web browser). Unlike SMTP,
the Response has the same header and body structure as the Request.
Q: What is an "HTTP Request"? an "HTTP Response"?
Q: What is the format/structure of HTTP Requests and Responses?
Sniffing Browser HTTP Requests and Responses
--------------------------------------------
Since HTTP is a text-based protocol, you can use "netcat" to connect
directly to an HTTP server, send a simple Request, and see what responses
come back. Note the need for a blank line to end the Request:
* $ nc -v google.ca 80
google.ca [64.233.161.104] 80 (www) open
* GET / HTTP/1.0
*
HTTP/1.0 302 Found
Location: http://www.google.ca/
Cache-Control: private
Set-Cookie: PREF=ID=4bacaba254d7fab1:TM=1174172556:LM=1174172556:
S=F5pnjX7gt4IYGP2n; expires=Sun, 17-Jan-2038 19:14:07 GMT;
path=/; domain=.google.com
Content-Type: text/html
Server: GWS/2.1
Content-Length: 218
Date: Sat, 17 Mar 2007 23:02:36 GMT
Connection: Keep-Alive
302 Moved
302 Moved
The document has moved
here.
Sample HTTP "HEAD" and "GET" session:
http://teaching.idallen.com/cst8165/07f/notes/http_session.txt
Q: How can I use netcat to pull a Response from a remote HTTP server?
To see what lines a browser sends to an HTTP server, you can use Ethereal;
or, for a quick dump, just use netcat on a spare port (e.g. 55555)
and have the browser access the port via http://localhost:55555/foobar :
Start a fake netcat HTTP server on a spare port, e.g. 55555, then start
up your browser and connect to http://localhost:55555/foobar and see
what your netcat server reports:
* $ nc -v -l -p 55555 localhost # Debian/Ubuntu
* $ nc -v -l localhost 55555 # RedHat/Mandrake
connect to [127.0.0.1] from localhost [127.0.0.1] 40757
GET /foobar HTTP/1.1
Host: localhost:55555
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12)
Gecko/20060216 Debian/1.7.12-1.1ubuntu2
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,
text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-ca,en-us;q=0.9,en-gb;q=0.7,en;q=0.6,fr-ca;q=0.4,
fr-fr;q=0.3,fr;q=0.1
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
At this point, you can type into the fake HTTP server netcat session
and send HTTP Response lines back to your browser:
* HTTP/1.1 200 this is my reply to the browser
* Content-Type: text/plain
*
* ab
* cd
* ef
* gh
* ^C (interrupt)
Your browser will show the above text.
Q: How can I use netcat to show a Request from an HTTP client?
Fetching a raw web page: wget
-----------------------------
You can use "wget" to fetch the raw HTML from a web page, and options
also let you see the header lines:
$ wget http://idallen.com/
$ wget -O output_file -S http://idallen.com/
$ wget -O output_file --save-headers http://idallen.com/
$ wget --header="Host: teaching.idallen.com" http://idallen.com/
Q: How can I download the raw HTML from a web page to my current directory?
HTTP is stateless; need session tracking
----------------------------------------
Unlike protocols such as SMTP, FTP, TELNET, etc., HTTP is completely
"stateless". Nothing in the protocol links one request with another.
Any need for "state", e.g. login credentials, shopping cart data, etc.,
has to be done outside the protocol.
http://publib.boulder.ibm.com/infocenter/wchelp/v5r6m1/index.jsp?topic=/com.ibm.commerce.admin.doc/concepts/csesmsession_mgmt.htm
"Web browsers and e-commerce sites use HTTP to communicate. Since
HTTP is a stateless protocol (meaning that each command is executed
independently without any knowledge of the commands that came before
it), there must be a way to manage sessions between the browser side
and the server side."
http://java.sun.com/blueprints/qanda/client_tier/session_state.html
"What are the client-tier mechanisms for storing session state?"
- cookies
- URL rewriting
- hidden form fields
"We do not recommend storing session state directly on the client using
URL rewriting. [...] This section describes how to store session state
directly on the client for those who choose to ignore these guidelines."
"We do not recommend storing session state directly on the client
using cookies. [...] This section describes how to store session state
directly on the client for those who choose to ignore these guidelines."
- Recommendation: don't save the actual state in the cookie or URL,
save a session ID only:
http://java.sun.com/blueprints/qanda/web_tier/session_state.html
"A web container provides session management to the JSP pages and
servlets it contains by way of interface HttpSession. Typically, the
container will try to use a cookie to save user session state on the
client. If the client refuses to accept the cookie for some reason
(the user has disabled cookies, an intervening firewall filters
cookies, etc.), the container will usually try to implement session
management by using URL rewriting. URL rewriting works in cases
where cookies will not, even in browsers that don't implement
cookies, but suffer from other problems. Rewritten URLs tend to be
long and ugly, are expensive to produce for pages with many links,
and usually don't "bookmark" well. Furthermore, rewritten URLs
usually can't be used with legacy web pages, because the URLs in the
links in those pages are static."
http://java.boot.by/wcd-guide/ch04s04.html
"Given a scenario, describe which session management mechanism the
Web container could employ, how cookies might be used to manage
sessions, how URL rewriting might be used to manage sessions, and
write servlet code to perform URL rewriting."
http://www.brics.dk/~amoeller/WWW/javaweb/sessions.html
- URL rewriting
- hidden form fields
- cookies
Q T/F The recommended way to save HTTP state is to keep your state
information in a client cookie
Q T/F The recommended way to save HTTP state is to save only a state
session ID in a client cookie
Q: Why is session tracking needed on top of HTTP?
Q: What is an HTTP "session"?
Q: Name and describe briefly two of three possible ways to implement
implicit HTTP session tracking
Reading the HTTP RFC 2616
-------------------------
http://tools.ietf.org/html/rfc2616
ftp://ftp.rfc-editor.org/in-notes/rfc2616.txt
Standards: http://www.w3.org/Protocols/
Errata: http://skrb.org/ietf/http_errata.html
http://purl.org/NET/http-errata
Issues: http://greenbytes.de/tech/webdav/draft-lafon-rfc2616bis-issues.html
Mail Archives: http://lists.w3.org/Archives/Public/ietf-http-wg/
- HTTP is usually over TCP/IP, but any reliable protocol will do (p.13)
Q: Does HTTP require a reliable protocol, or can it run over something
unreliable such as UDP?
- 1.0 required separate connections per request
- 1.1 big change: allows chaining multiple requests per connection (p.14)
Q: What big change did HTTP 1.1 bring to the HTTP "one connection per
request" model of HTTP 1.0?
p.15
- ABNF extended with a "#rule" for comma-separated lists:
( *LWS element *( *LWS "," *LWS element )) becomes 1#element
- implied *LWS can appear between any ajacent tokens or strings in the grammar
Q: Describe what this ABNF HTTP rule means: 2#3("foo")
p.15-16
- HTTP ABNF grammar is unaffected by LWS between tokens
- HTTP 1.1 lines can continue ("fold") onto multiple lines
if the continuation line begins with a space or horizontal tab
- the only CRLF allowed is part of a continuation line
- if you want a real CRLF, or a non-ISO-8859-1 character, in a header
field, encode it as RFC2047 (MIME)
Q: How can you fold a long line in HTTP 1.1?
p.17
- must double-quote special characters used in message headers
- some fields allow comments in parentheses ()
Q: What do HTTP comments look like in message headers?
- unlike SMTP, HTTP has a version number! (p.17)
- URI "absolute" vs. "relative" paths (p.19, 36):
"URIs in HTTP can be represented in absolute form or relative to
some known base URI [11], depending upon the context of their use.
The two forms are differentiated by the fact that absolute URIs
always begin with a scheme name followed by a colon." p.19
An Absolute URI starts with "http:" and a Relative URI is anything else.
Inside a web page, Relative URIs can have some forms not allowed in an
HTTP Request. For the HTTP Request, Section 5.1.2 says you only have
two real choices (the leading slash is required on the Relative URI):
Absolute URI: http://idallen.com/foo.txt
Relative URI: /foo.txt # an absolute path
- proxy servers require ("MUST") absolute URIs ("http://...") (p.36)
- note that "absolute URI" is not the same as Unix "absolute path";
- for a Request, a "relative URI" must be an "absolute path" and
start with a slash
"To allow for transition to absoluteURIs in all requests in future
versions of HTTP, all HTTP/1.1 servers MUST accept the absoluteURI
form in requests, even though HTTP/1.1 clients will only generate
them in requests to proxies."
Q: Give examples of HTTP absolute and relative URIs used in Requests.
Q: Can a relative Request-URI (client Request to server) begin without
a slash, i.e. can it be a relative pathname "foo.html"? (5.1.2 p. 36)
Q: Can an HTTP client request an empty URI? (5.1.2)
Q: T/F The HTTP is moving towards always using absolute URI's. (p.37)
- path part of URI is case-sensitive; the host and scheme names are not (p.20)
"When comparing two URIs to decide if they match or not, a client
SHOULD use a case-sensitive octet-by-octet comparison of the entire
URIs, with these exceptions:" p.20
Q: Which parts of an absolute URI are case-sensitive?
- The HTTP protocol does not place any a priori limit on the length of a URI.
- server may issue 414 (Request-URI Too Long) status (p.19)
Q: What is the maximum length of a URI, as given in the HTTP spec?
- HTTP headers can describe:
- "content encoding" - a property of the original entity (p.23)
- e.g. "gzip"
- "transfer coding" - a property of the HTTP message (p.24)
- e.g. "chunked" (transfer content in separate chunks, p.25)
- may change how the entity is transferred
Q: What is the difference between the "content encoding" header and
the "transfer coding" header?
- HTTP relaxes CRLF rule - allows consistent CR or LF or CRLF in text
(but not in control sequences!) - 3.7.1 p.27
Q: T/F HTTP permits a client to send just CR or LF when communicating with
an HTTP server (e.g. when sending a GET or HEAD request).
- HTTP Request/Response messages do not use SMTP "continuation" method
- message headers continue until an empty line: CRLF CRLF (p.31)
Q: T/F The same generic HTTP message type is used both to send messages
from client to server and from server to client. (section 4.1)
Q: How do HTTP clients and servers detect the end of a series of message
header fields (section 4.1)?
Q: Is the CRLF at the end of the message headers optional?
- leading empty lines preceding a Request or Response SHOULD be ignored
(section 4.1, p.31)
Q: Determine if google.ca, yahoo.ca, and facebook.com adhere to the
above leading-blank-line SHOULD clause in section 4.1, p.31
- nc -v google.ca http OR telnet google.ca http
- multiple message-header fields with the same name are allowed
- but only if the entire field-value is a comma-separated list
- should behave as if they were all on one long field (p.32)
Q: T/F You can always send multiple identical message header fields; the
HTTP protocol says they will be concatenated.
- message body MUST NOT be included unless specifically allowed (p.33)
- responses to "HEAD" MUST NOT include a message body (p.33)
Q: T/F All HTTP Responses may include an optional message body.
- HTTP Request and Response messages have the same general format:
Request = Request-Line ; Section 5.1
*(( general-header ; Section 4.5
| request-header ; Section 5.3
| entity-header ) CRLF) ; Section 7.1
CRLF
[ message-body ] ; Section 4.3
Response = Status-Line ; Section 6.1
*(( general-header ; Section 4.5
| response-header ; Section 6.2
| entity-header ) CRLF) ; Section 7.1
CRLF
[ message-body ] ; Section 7.2
- "general header" fields apply to the message, not to the entity
being transferred, and they can only be extended by a protocol
version change (p.35)
- "request header fields" - section 5.3 p.38
- can only be extended with a protocol change
- "response header fields" - section 6.2 p.39
- can only be extended with a protocol change
- unknown fields are treated as "entity header" fields
- you can have custom "entity header" fields without a protocol change
Q: T/F HTTP "general header fields" can appear in both Requests and Responses
Q: T/F Unrecognized HTTP header fields are presumed to apply to the
entity being transferred; they become "entity header" fields
- unlike SMTP (HELO and helo), the HTTP "method token" (e.g. "GET") is
case-sensitive and must be UPPER CASE ONLY (p.36)
- but HTTP header field names in HTTP messages are not case-sensitive! (p.31)
Q: T/F HTTP allows the use of either "HEAD" or "head" in a Request Line
- servers MUST support at least GET and HEAD (p.36)
Q: What method tokens are the minimum required of an HTTP server?
- A big change made from HTTP 1.0 to HTTP 1.1 was the requirement
that HTTP 1.1 Requests MUST include the "Host:" header to indicate the
network location of the web server with which you want to communicate.
(5.1.2 p.37, 9.0 p.51, 14.23 p.129, 19.6.1.1 p.171)
- With the HTTP 1.1 "Host:" header, a single IP address can now serve
multiple different web sites, each of which is at the same IP address
but has a unique network location.
- the network location in an absolute URI over-rides the "Host:" header (p.38)
- an unrecognized network location MUST produce a 400 Response
Q: If a client Request contains a host name in both the URI and the
Host: header, which one has priority?
Q: T/F If a URI or "Host:" header field specify a host name that is not
recognized on this server, the server MUST forward the request to the
other host name. (5.2 p.38)
Q: List the names of the mandatory request header field(s) for HTTP 1.1
Q: T/F If you give the host name in a URI using HTTP 1.1, you don't need
to send the Host: header field, the name in the URI is sufficient.
HTTP Status Code and Reason Phrase - section 6.1.1 p.39
----------------------------------
- a 3 digit Status Code, machine-readable, followed by a human Reason Phrase
- only first digit has an assigned meaning (one of five) p.40
- five "classes" of response, based on the first digit (p.40)
- 1xx: Informational - Request received, continuing process
- 2xx: Success - The action was successfully received,
understood, and accepted
- 3xx: Redirection - Further action must be taken in order to
complete the request
- 4xx: Client Error - The request contains bad syntax or cannot
be fulfilled
- 5xx: Server Error - The server failed to fulfill an apparently
valid request
Q: What are the five possible meanings of the first digit of an HTTP response?
Q: T/F The Reason Phrases given in the HTTP RFC are recommendations
only; they MAY be changed or replaced with local equivalents without
affecting the protocol.
Q: T/F HTTP 1.1 clients do not need to understand the meaning all of the
registered three-digit HTTP 1.1 status codes.
Q: T/F An HTTP client MUST understand all five classes (first digit) of
Status Codes.
Q: If an HTTP server returns an unrecognized status code to a client,
what SHOULD the client do with the response? (6.1.1 p.41)
Entity (section 7 p.42)
------
- the "entity" is the thing being transferred, e.g. image, text, etc.
- "entity headers" give information about the entity being transferred
- may include "extension header" fields
- unrecognized extension headers SHOULD be ignored
- entity body has a length header and so is 8-bit clean (unlike SMTP)
- but a transfer coding (chunking) may have been applied to assist transit
- The sender of an HTTP 1.1 message SHOULD give the Content-Type
- but if not (and only if not), the recipient MAY guess it by inspection
(7.2.1 p.43)
Q: T/F In the HTTP 1.1 protocol, senders MUST provide the entity
Content-Type header field.
Q: T/F A recipient may over-ride the Content-Type by inspecting the
entity being transferred (or its URI).
Q: If no Content-Type is specified, what type is assumed? (7.2.1)
- the entity-Length of a message is calculated *before* transfer
encodings have been applied (i.e. it is the actual length of the
entity, regardless of how it might be altered to be transferred)
- The Content-Length header, if present, MUST represent *both* the
entity-length and the actual transfer-length. (4.4 p.33)
- You MUST NOT send a Content-Length field if you apply a Transfer
Encoding (because the Transfer Encoding might change the size). If a
Transfer-Encoding field is present, you MUST NOT send Content-Length
(because the Transfer Encoding method will specify the length).
Q: T/F The Content-Length, if present, is both the real size of the item
being sent and the size of the actual data being transferred.
Persistent Connections (HTTP 1.1 - section 8.1 p.44)
---------------------------------
- a significant upgrade from HTTP 1.0 - Persistent Connections
- HTTP 1.1 connections default to persistent, even upon error (8.1.2)
- persistent TCP connections have many advantages:
- fewer TCP handshakes
- reduced CPU, memory, latency
- allow pipelining multiple requests without waiting for responses
- longer connections allow better TCP congestion control
- allows HTTP to evolve more gracefully
- errors don't cause the connection to close
- no penalty for trying a feature then dropping back to previous version
Q: T/F HTTP implementations MUST implement persistent connections. (8.1.1)
Q: T/F A persistent connection MUST drop on an error condition. (8.1.2)
Q: Describe three of four advantages of persistent TCP connections (8.1)
- a "Connection:" header field can ask for explicit connection closing:
Connection: close
Q: How can you signal the end of an HTTP 1.1 persistent connection?
Q: T/F You signal the end of an HTTP session using the same keyword as
SMTP - QUIT.
- persistent connections require that all messages have a self-defined
message length, so you know where the next message begins - you
can't just end the message by closing the connection
Q: Why do persistent connections need message lengths?
- clients should not pipeline non-idempotent methods or non-idempotent
sequences of methods, to avoid inconsistent state if the connection drops
in the middle and the same request has to be sent again
Q: Why not pipeline non-idempotent methods? (8.1.2.2 p.46)
- HTTP does not define any time-out for persistent connections
(actually, I can't find any time-out for *anything*!)
- connection close events may happen at any time (asynchronous)
- clients SHOULD limit to 2 the number of persistent connections to a server
Premature Server Close - 8.2.4 p.50
----------------------
- an issue with Internet protocols is: if the connection drops, when
and how often do you try to get it going again? Try too often and
you may contribute to network congestion.
- HTTP "MAY" use "binary exponential backoff" of T = R * 2**N (p.50)
Q: T/F HTTP client MAY double their wait times on each retry against
an HTTP server.