Monday, January 16, 2017

Keynote: URL types

More or Less, everyone could imagine what is URL, cause most web browsers display it in the address bar.
At the same time, if we'll dig dipper, many of the software engineers are not aware of the all use cases or even URL types.
So, here are some short-notes to refresh your memory.

Syntax

Every HTTP URL conforms to the syntax of a generic URI. A generic URI is of the form:

scheme:[//[user:password@]host[:port]][/]path[?query][#fragment]

  • scheme - how the resource is to be accessed http, https, ftp, file, etc.

  • [user:password@]host[:port] or [authority part] or [server] - specifies the name of the computer where the resource is located (optionally includes authentication section and port).

  • path - specifies the sequence of directories leading to the target. If resource is omitted, the target is the last directory in path.

  • resource - if included, resource is the target, and is typically the name of a file.

  • query - either null or an ASCII string holding data.

  • fragment - either null or an ASCII string holding data that can be used for further processing on the resource; usually as components identifier.

* Check the Uniform Resource Locator - Wiki Page for details.


Absolute URL

scheme://server/path/resource

Absolute URL - Contains all the information necessary to locate a resource.

Base URL of a resource is everything up to and including the last slash in its path name:

// Absolute URL
http://www.example.com/foo/bar.html

// Base URL
http://www.example.com/foo/

Relative URL

relative_path/resource

Relative URL - Identifies a resource relative to its context.

In most cases, figuring the absolute URL from a relative URL is just a matter of concatenating the Base URL and the Relative URL.

  • . (single-dot path segment) - refers to the current directory.
  • .. (double-dot path segment) - refers to the parent directory, stripping off everything up to the previous slash in Base URL.
  • Begins with / - a Relative URL that always replaces the entire pathname of the Base URL.
  • Begins with // - a Relative URL that always replaces everything from the hostname onwards.

Resolving Relative URLs which are all assumed to have base URL http://example.com/foo/:

Relative URI Absolute URI
bar.html http://example.com/foo/bar.html
help/ http://example.com/foo/help/
help/rule.html http://example.com/foo/help/rule.html
../ http://example.com/
../../../ http://example.com/
./ http://example.com/foo/
./bar.html http://example.com/foo/bar.html
/ http://example.com/

* Here are paths definitions retrieved from the URL Specification:

A path-absolute-URL string must be / followed by a path-relative-URL string.

A path-relative-URL string must be zero or more URL-path-segment strings, separated from each other by /, and not start with /.

A URL-path-segment string must be zero or more URL units excluding / and ?; or a single-dot path segment or a double-dot path segment.


Protocol-relative URL

//server/path/resource

Protocol-relative URL - have no protocol specified. Therefore the protocol either http or https would be picked automatically.

So, the "two forward slashes" are a common shorthand for "whatever protocol is being used right now".
U often can see this approach with including JS libraries from some CDN.
By using protocol relative URLs, you can avoid implementing:

Js
if (window.location.protocol === 'http:') {
    myResourceUrl = 'http://example.com/resource.js';
} else {
    myResourceUrl = 'https://example.com/resource.js';
}

Resolving Protocole-Relative URLs:

Protocol Relative URI Absolute URI
//www.example.com/ http://www.example.com/
//www.example.com:8080/ http://www.example.com:8080/
//www.google.com/ https://www.google.com/

Canonical URL

scheme://server/

Canonical URL - just the best url when there are several choices. It usually refers to home page.

For example, despite that for end user these urls are same:

www.example.com
example.com/
www.example.com/index.html

But technically all of these urls are different. A web server could return completely different content for all the urls above. When Google “canonicalizes” a url, we try to pick the url that seems like the best representative from that set.

So, to make sure that Google picks the url that U want, be consistent.
Pick the url you prefer and always use that format for your internal links.
Do not make half of your URLs refer to http://example.com/ and the other half to http://www.example.com/.


Internationalized URL

scheme://internationalized_domain/encoded_path/

An Internationalized Resource Identifier (IRI) is a form of URL that includes Unicode characters.

Web and Internet software automatically convert the domain name into punycode usable by the Domain Name System;

The URL path name can also be specified by the user in the local alphabet. If not already encoded, it is converted to Unicode, and any characters not part of the basic URL character set are converted to English letters using percent-encoding.

URLs encoding is useful not only when using local alphabets in URL, but also when you want to pass the path or some another URL as the argument:

http://www.example.com/?external_resource=http://domain/path/resource/&

// if URL encoded, becomes
http://www.example.com/?external_resource=http%3A%2F%2Fdomain%2Fpath%2Fresource%2F%26

Normalized URL

Normalized URL - The transformed URL into the way, so it's possible to determine if two syntactically different URLs may be equivalent.

There are various types of normalization that may be performed, here some of them:

Normalization type Input Output
Remove duplicate slashes http:////www.example.com///foo/ http://www.example.com/foo/
Scheme & Host to lowercase HTTP://www.Example.com/ http://www.example.com/
Adding trailing / http://www.example.com/foo http://www.example.com/foo/
Removing the default port http://www.example.com:80/ http://www.example.com/
Removing dot-segments https://www.example.com/../foo/./bar.html https://www.example.com/foo/bar.html
Decoding of Unreserved chars http://www.example.com/%7foo_bar10/ http://www.example.com/~foo_bar10/

see Also


No comments:

Post a Comment