HTTP Data Sets for Rate/Size/Duration Dependency Study
This page includes the following sections:
Format (columns):
- Size (bytes)
- Capture time (Unix timestamp of the first packet in the response)
- Duration (seconds)
- 1 iff response was not fully captured (due to boundary effects).
- Order of this response within the connection in which it was transferred
(i.e., 1st, 2nd, ...).
Files:
Notes:
- Each data set was obtained by processing a 4-hour long trace
of TCP headers. The 2002 traces required the merging of 2 overlapping
2+-hour long traces.
-
Duration is defined as the difference between the timestamps of
the first and the last packets seen for each response.
-
The size and duration of those responses that started before or ended after
the trace collection are not accurate, since some of their packets
were not captured. The last column is intended to flag these cases.
In addition, the monitor sometimes losses some packets (with a very low
probability), and that may also result in partially-captured responses.
Format (columns):
- Size (bytes)
- Capture time (timestamp of the first packet seen for the connection)
- Duration (seconds)
- Number of responses found in the connection (>= 1).
- 1 iff connection was not fully captured
Files:
Notes:
- These data sets are intended to help us compare our data with
that used in Zhang et al. study.
-
Duration is defined as the difference between the timestamps of
the first and the last packets seen for each connection (e.g. SYN and
FIN packets).
- Connections that carried more than one response are considered
effectively persistent HTTP connections.
Format (columns):
- Capture time (timestamp of the first packet seen in the first response of the document)
- Duration (time in seconds between the first and the last response packets seen for the document)
- Size (total size in bytes of the responses in the document)
- Number of responses found in the document
- Number of responses that were not not fully captured
- Number of connections used to download the document
- Number of servers contacted to download the document
Files:
Notes:
- A document is a set of responses downloaded by a single client
from one or more servers. Client idle time, in which the client is not
downloading anything, longer than 1 second define document boundaries. Our
concept of document corresponds to that of an HTML page with multiple
embedded images, that are downloaded together. Two of these multi-object page
downloads are likely to be separated in time (due to user think times).
Smith et al. has a more in depth discussion of our concept of document
and its limitations.
-
Duration is defined as the difference between the timestamp of
the first packet seen for the first response and the timestamp
of the last packet seen for the last response in the document.
- Using the current heuristic, parallel document downloads
are grouped together into a single
document, and this may skew the data. I did some analysis of long
inter-response times within documents, and it shows this problem is unlikely
to affect more than 2% of data (0.2% using a conservative inter-response time
of 10 seconds).
Format (columns):
- Capture time (timestamp of the first packet seen in the first response received by the client)
- Duration (sum of the durations of the documents downloaded by the client)
- Idle time (sum of the idle times observed for this client)
- Size (total size in bytes of the responses downloaded by the client)
- Number of responses downloaded by the client
- Number of responses that were not fully captured
- Number of documents dowloaded by the client
- Number of connections used by the client
- Number of servers contacted by the client
Files:
Notes:
- Some of the data points in these data sets
will correspond to more than one host due dynamic IP assignments
(i.e. a single IP is used by more than one host during the duration of
the traces).
Some sort of timeout (client idle time) for DHCP addresses could improve
this data set. It should be kept in mind than this artifact will mostly group
together two or more hosts using the network same medium (so our
network dependency study will not be affected).
- Idle times are defined as quiet periods (i.e., no packet seen)
within a client session (i.e., between the first and the last packet)
that lasted for more than one second.
To Do:
- Follow up with Mark Lindsey on the frequency of IP address changes
for wireless host.
-
F.D. Smith, F. Hernandez Campos, K. Jeffay, and D. Ott,
"What TCP/IP protocol headers can tell us about the web,"
in Proceedings of the ACM SIGMETRICS, 2001, pp. 245--256.
This paper describes the measurement and inference techniques we use to
create the data sets linked above from packet header traces (collected
at UNC's main link).
-
Yin Zhang, Lee Breslau, Vern Paxson, and Scott Shenker,
"On the Characteristics and Origins of Internet Flow Rates,"
In Proceedings of ACM SIGCOMM, 2002.
This paper studies the correlation between size, duration and rate in network
connections.
-
B. A. Mah, "An Empirical Model of HTTP Network Traffic,"
In Proceedings of IEEE InfoComm, April 1997.
This paper presents a model of HTTP traffic that is suitable for
traffic generation in testbed and simulations environments. The model was
populated using a technique related to the one we used to produce
our data sets.
-
P. Barford and M. Crovella, "An Architecture for a WWWWorkload Generator,"
In Proc. SIGMETRICS, 1998.
This paper presents a different approach to HTTP modeling (log analysis).
It is also a good example of the type of models that are considered useful
in networking.
-
N. Vicari, S. Kvhler and J. Charzinski,
"The Dependence of Internet User Characteristics on Access Speed,"
25th Local Computer Networks (LCN) 2000, November 2000, Tampa, USA.
This paper compares the distributions associated with dial-up and cable modem
users.
Félix Hernández-Campos
Last modified: Fri Oct 24 09:29:20 EDT 2003