As a software developer, you must have a complete hierarchical understanding of how network applications work, which also includes the technologies used by these applications: browsers, HTTP, HTML, network servers, request processing, etc. This article will take a closer look at what happens in the background when you enter a URL. 1. First, you need to enter the URL in your browser :The first step in navigation is to find out the IP address of the domain name you are visiting. The DNS lookup process is as follows: Browser caching – Browsers cache DNS records for a period of time. Interestingly, the operating system does not tell the browser how long to store the DNS records, so different browsers will store them for a fixed period of time (ranging from 2 minutes to 30 minutes). System cache – If the required record is not found in the browser cache, the browser will make a system call (gethostbyname in Windows). This will get the records in the system cache. Router Cache – Next, the previous query request is sent to the router, which usually has its own DNS cache. ISP DNS Cache – The next thing to check is the ISP’s DNS cache server. The corresponding cache records can generally be found here. Recursive search - Your ISP's DNS server performs a recursive search starting from the root nameservers, from the .com top-level nameservers to Facebook's nameservers. Generally, the DNS server cache contains the domain names in the .com domain name server, so the matching process to the top-level server is not so necessary. DNS recursive lookup is shown in the following figure: One of the things that's worrying about DNS is that entire domains like wikipedia.org or facebook.com appear to correspond to just a single IP address. Fortunately, there are several ways to eliminate this bottleneck: Round Robin DNS is a solution when multiple IPs are returned during a DNS lookup. For example, Facebook.com actually corresponds to four IP addresses. A load balancer is a hardware device that listens on a specific IP address and forwards network requests to the servers in the cluster. Some large sites generally use this expensive, high-performance load balancer. Geographic DNS improves scalability by mapping domain names to multiple different IP addresses based on the user's geographic location. This way different servers can't update in sync, but it's great for mapping static content. Anycast is a routing technology that maps an IP address to multiple physical hosts. The only drawback is that Anycast does not adapt well to the TCP protocol, so it is rarely used in those solutions.Most DNS servers use Anycast to achieve efficient and low-latency DNS lookups. Because dynamic pages like the Facebook homepage will expire soon or even immediately in the browser cache after opening, they cannot read from it. So, the browser will send the following request to the Facebook server: GET http://facebook.com/ HTTP/1.1 GET This request defines the URL to read: "http://facebook.com/". The browser defines itself ( User-Agent header), and what type of responses it wants to receive (Accept and Accept -Encoding headers). The Connection header asks the server not to close the TCP connection for subsequent requests. The request also includes the cookies stored by the browser for the domain. As you probably already know, cookies are key-value pairs that track the state of a website across different page requests. In this way, cookies will store the login username, the password assigned by the server and some user settings. Cookies are stored in the client as a text file and sent to the server each time a request is made. There are many tools available for viewing raw HTTP requests and their responses. The author prefers to use fiddler, of course there are other tools like FireBug. These software can be of great help in website optimization. In addition to get requests, there is also a send request, which is often used when submitting forms. Send a request passing its parameters via the URL (eg: http://robozzle.com/puzzle.aspx?id=85). A send request sends its parameters after the request body headers. The picture shows the response sent back by the Facebook server to the browser: HTTP/1.1 301 Moved Permanently The server responds to the browser with a 301 Permanent Redirect response, so that the browser visits "http://www.facebook.com/" instead of "http://facebook.com/". Why does the server have to redirect instead of directly sending the web page content that the user wants to see? This question has many interesting answers. One of the reasons has to do with search engine rankings. You see, if a page has two addresses, like http://www.igoro.com/ and http://igoro.com/, search engines will think they are two websites, resulting in each having fewer search links and lower rankings. Search engines know what 301 permanent redirect means, so they will rank visits to addresses with and without www in the same website. Another is that using different addresses will make it less cache-friendly. When a page has several names, it may appear several times in the cache. Now the browser knows that “http://www.facebook.com/” is the correct address to visit, so it sends another GET request: GET http://www.facebook.com/ HTTP/1.1 The header information has the same meaning as in the previous request. 6. The server "processes" the requestThe server receives the GET request, processes it, and returns a response. On the surface this may seem like a straightforward task, but there are actually a lot of interesting things happening in the process - for a website as simple as the author's blog, let alone a website with a large number of visitors like Facebook! Web server softwareWeb server software (like IIS and Apache) receives an HTTP request and then determines what request processing to perform to handle it. A request handler is a program that understands the request and generates HTML in response (like ASP.NET, PHP, RUBY...). In the simplest example, demand processing can be stored in a file hierarchy that maps the structure of a website address. For example, the address http://example.com/folder1/page1.aspx will map to the file /httpdocs/folder1/page1.aspx. The web server software can be configured to manually process requests for addresses, so that the publishing address of page1.aspx can be http://example.com/folder1/page1. Request processingThe request handler reads the request and its parameters and cookies. It reads and possibly updates some data and stores it on the server. The request processing then generates an HTML response. All dynamic websites face an interesting difficulty - how to store data. Most small websites will have a SQL database to store data, and websites that store large amounts of data and/or have a lot of traffic will have to find some way to distribute the database across multiple machines. Solutions include: sharding (distributing data tables into multiple databases based on primary key values), replication, and simplified databases using weak semantic consistency. Delegating work to batch processing is a cheap technique for keeping data updated. For example, Facebook has to update its news feed in a timely manner, but its data-supported "People You May Know" feature only needs to be updated every night (this is the author's guess, and it is unknown how this feature will be improved). Batch job updates may cause some less important data to become obsolete, but can make data update work faster and more concise. The figure shows the response generated and returned by the server: HTTP/1.1 200 OK The entire response size is 35kB, most of which is transferred as a blob after being trimmed. The Content-Encoding header tells the browser that the entire response body should be compressed using the gzip algorithm. After decompressing the blob, you can see the expected HTML: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" Regarding compression, the header information indicates whether to cache the page, how to do it if cached, what cookies to set (this is not included in the previous response), privacy information, etc. Please note that the Content-type header is set to " text/html ". The header tells the browser to render the response content as HTML instead of downloading it as a file. The browser decides how to interpret the response based on the header information, but also considers other factors like the URL extension. Before the browser has even finished reading the entire HTML document, it starts displaying this page: When the browser displays the HTML, it notices the tags that need to fetch the content of other addresses. At this point, the browser sends a get request to retrieve the files. Here are a few URLs we need to re-fetch when we visit facebook.com: picturehttp://static.ak.fbcdn.net/rsrc.php/z12E0/hash/8q2anwu7.gif http://static.ak.fbcdn.net/rsrc.php/zBS5C/hash/7hwy7at6.gif … CSS Style Sheets http://static.ak.fbcdn.net/rsrc.php/z448Z/hash/2plh8s4n.css http://static.ak.fbcdn.net/rsrc.php/zANE1/hash/cvtutcee.css … JavaScript files http://static.ak.fbcdn.net/rsrc.php/zEMOA/hash/c8yzb6ub.js http://static.ak.fbcdn.net/rsrc.php/z6R9L/hash/cq2lgbs8.js … These addresses go through a process similar to HTML reading. So the browser will look up those domains in DNS, send the request, redirect, etc... But unlike dynamic pages, static files allow browsers to cache them. Some files may not need to communicate with the server and can be read directly from the cache. The server's response includes information about how long static files should be kept, so the browser knows how long to cache them. In addition, each response may contain an ETag header (the entity value of the requested variable) that works like a version number. If the browser observes that the version ETag information of the file already exists, it will immediately stop transmitting the file. Try to guess what " fbcdn.net " stands for in the address? The clever answer is "Facebook Content Delivery Network". Facebook uses a content delivery network (CDN) to deliver static files like images, CSS sheets, and JavaScript files. Therefore, these files will be backed up in many CDN data centers around the world. Static content often represents the bandwidth size of the site and can be easily replicated through CDN. Usually websites use third-party CDN. For example, Facebook's static files are hosted by Akamai, the largest CDN provider. For example, when you try to ping static.ak.fbcdn.net, you might get a response from an akamai.net server. Interestingly, when you ping again, the server that responds may be different, which means that the load balancing behind the scenes has started to work. In the spirit of Web 2.0, the client remains connected to the server after the page is displayed. Take Facebook's chat function as an example. It will keep in touch with the server to update the status of your bright and gray friends in time. To update the status of these friends whose avatars are lit, the JavaScript code executed in the browser sends an asynchronous request to the server. This asynchronous request is sent to a specific address and is a programmatically constructed get or send request. Still in the Facebook example, the client sends a publish request to http://www.facebook.com/ajax/chat/buddy_list.php to get the online status information of which of your friends are online. When talking about this pattern, we have to talk about "AJAX" - "Asynchronous JavaScript and XML", although there is no clear reason why the server responds in XML format. As another example, for asynchronous requests, Facebook will return some JavaScript code snippets. Among other things, fiddler is a tool that allows you to see the asynchronous requests sent by the browser. In fact, not only can you passively watch these requests, you can also take the initiative to modify and resend them. The fact that AJAX requests can be easily fooled is frustrating for developers of online games that keep score. (Of course, don’t lie to others like that~) Facebook's chat feature provides an interesting example of the problem with AJAX: pushing data from the server to the client. Because HTTP is a request-response protocol, the chat server cannot send new messages to the client. Instead, the client has to poll the server every few seconds to see if it has any new messages. Long polling is an interesting technique to reduce server load when these situations occur. If the server has no new messages when polled, it ignores the client. When a new message is received from the client before the timeout, the server will find the unfinished request and return the new message to the client as a response. |
<<: Implementation of Nginx configuration Https security authentication
Table of contents Drag and drop implementation Dr...
By default, the border of the table is 0, and we ...
Original Intention The reason for making this too...
1. VMware download and install Link: https://www....
1. Arrange CSS in alphabetical order Not in alphab...
Today, the company's springboot project is re...
This article shares the specific code for JavaScr...
Event Description onactivate: Fired when the objec...
Table of contents 0x01. Install the Pagoda Panel ...
I'm looking for a job!!! Advance preparation:...
Supervisor is a very good daemon management tool....
Table of contents Preface 1. typeof 2. instanceof...
<br />Recently, UCDChina wrote a series of a...
This article introduces a framework made by Frame...
Table of contents Preface 1.insert ignore into 2....