Package Halberd :: Package clues :: Module analysis
[hide private]
[frames] | no frames]

Module analysis

source code

Utilities for clue analysis.

Functions [hide private]
list
diff_fields(clues)
Study differences between fields.
source code
 
ignore_changing_fields(clues)
Tries to detect and ignore MIME fields with ever changing content.
source code
str
get_digest(clue)
Returns the specified clue's digest.
source code
tuple
clusters(clues, step=3)
Finds clusters of clues.
source code
Clue
merge(clues)
Merges a sequence of clues into one.
source code
dict
classify(seq, *classifiers)
Classify a sequence according to one or several criteria.
source code
list
sections(classified, sects=None)
Returns sections (and their items) from a nested dict.
source code
list
deltas(xs)
Computes the differences between the elements of a sequence of integers.
source code
list of slice
slices(start, xs)
Returns slices of a given sequence separated by the specified indices.
source code
 
sort_clues(clues)
Sorts clues according to their time difference.
source code
list
filter_proxies(clues, maxdelta=3)
Detect and merge clues pointing to a proxy cache on the remote end.
source code
list
uniq(clues)
Return a list of unique clues.
source code
int
hits(clues)
Compute the total number of hits in a sequence of clues.
source code
list
analyze(clues)
Draw conclusions from the clues obtained during the scanning phase.
source code
 
reanalyze(clues, analyzed, threshold)
Identify and ignore changing header fields.
source code
 
_test() source code
Variables [hide private]
  logger = Halberd.logger.getLogger()
  __package__ = 'Halberd.clues'
Function Details [hide private]

diff_fields(clues)

source code 

Study differences between fields.

Parameters:
  • clues (list) - Clues to analyze.
Returns: list
Fields which were found to be different among the analyzed clues.

ignore_changing_fields(clues)

source code 

Tries to detect and ignore MIME fields with ever changing content.

Some servers might include fields varying with time, randomly, etc. Those fields are likely to alter the clue's digest and interfer with analyze, producing many false positives and making the scan useless. This function detects those fields and recalculates each clue's digest so they can be safely analyzed again.

Parameters:
  • clues (list or tuple) - Sequence of clues.

get_digest(clue)

source code 

Returns the specified clue's digest.

This function is usually passed as a parameter for classify so it can separate clues according to their digest (among other fields).

Returns: str
The digest of a clue's parsed headers.

clusters(clues, step=3)

source code 

Finds clusters of clues.

A cluster is a group of at most step clues which only differ in 1 seconds between each other.

Parameters:
  • clues (list or tuple) - A sequence of clues to analyze
  • step (int) - Maximum difference between the time differences of the cluster's clues.
Returns: tuple
A sequence with merged clusters.

merge(clues)

source code 

Merges a sequence of clues into one.

A new clue will store the total count of the clues.

Note that each Clue has a starting count of 1

>>> a, b, c = Clue(), Clue(), Clue()
>>> sum([x.getCount() for x in [a, b, c]])
3
>>> a.incCount(5), b.incCount(11), c.incCount(23)
(None, None, None)
>>> merged = merge((a, b, c))
>>> merged.getCount()
42
>>> merged == a
True
Parameters:
  • clues (list or tuple) - A sequence containing all the clues to merge into one.
Returns: Clue
The result of merging all the passed clues into one.

classify(seq, *classifiers)

source code 

Classify a sequence according to one or several criteria.

We store each item into a nested dictionary using the classifiers as key generators (all of them must be callable objects).

In the following example we classify a list of clues according to their digest and their time difference.

>>> a, b, c = Clue(), Clue(), Clue()
>>> a.diff, b.diff, c.diff = 1, 2, 2
>>> a.info['digest'] = 'x'
>>> b.info['digest'] = c.info['digest'] = 'y'
>>> get_diff = lambda x: x.diff
>>> classified = classify([a, b, c], get_digest, get_diff)
>>> digests = classified.keys()
>>> digests.sort()  # We sort these so doctest won't fail.
>>> for digest in digests:
...     print digest
...     for diff in classified[digest].keys():
...         print ' ', diff
...         for clue in classified[digest][diff]:
...             if clue is a: print '    a'
...             elif clue is b: print '    b'
...             elif clue is c: print '    c'
...
x
  1
    a
y
  2
    b
    c
Parameters:
  • seq (list or tuple) - A sequence to classify.
  • classifiers (list or tuple) - A sequence of callables which return specific fields of the items contained in seq
Returns: dict
A nested dictionary in which the keys are the fields obtained by applying the classifiers to the items in the specified sequence.

sections(classified, sects=None)

source code 

Returns sections (and their items) from a nested dict.

See also: classify

Parameters:
  • classified (dict) - Nested dictionary.
  • sects (list) - List of results. It should not be specified by the user.
Returns: list
A list of lists in where each item is a subsection of a nested dictionary.

deltas(xs)

source code 

Computes the differences between the elements of a sequence of integers.

>>> deltas([-1, 0, 1])
[1, 1]
>>> deltas([1, 1, 2, 3, 5, 8, 13])
[0, 1, 1, 2, 3, 5]
Parameters:
  • xs (list) - A sequence of integers.
Returns: list
A list of differences between consecutive elements of xs.

slices(start, xs)

source code 

Returns slices of a given sequence separated by the specified indices.

If we wanted to get the slices necessary to split range(20) in sub-sequences of 5 items each we'd do:

>>> seq = range(20) 
>>> indices = [5, 10, 15]
>>> for piece in slices(0, indices):
...     print seq[piece]
[0, 1, 2, 3, 4]
[5, 6, 7, 8, 9]
[10, 11, 12, 13, 14]
[15, 16, 17, 18, 19]
Parameters:
  • start (int.) - Index of the first element of the sequence we want to partition.
  • xs (list) - Sequence of indexes where 'cuts' must be made.
Returns: list of slice
A sequence of slice objects suitable for splitting a list as specified.

filter_proxies(clues, maxdelta=3)

source code 

Detect and merge clues pointing to a proxy cache on the remote end.

Parameters:
  • clues (list) - Sequence of clues to analyze
  • maxdelta (int) - Maximum difference allowed between a clue's time difference and the previous one.
Returns: list
Sequence where all irrelevant clues pointing out to proxy caches have been filtered out.

uniq(clues)

source code 

Return a list of unique clues.

This is needed when merging clues coming from different sources. Clues with the same time diff and digest are not discarded, they are merged into one clue with the aggregated number of hits.

Parameters:
  • clues (list) - A sequence containing the clues to analyze.
Returns: list
Filtered sequence of clues where no clue has the same digest and time difference.

hits(clues)

source code 

Compute the total number of hits in a sequence of clues.

Parameters:
  • clues (list) - Sequence of clues.
Returns: int
Total hits.

analyze(clues)

source code 

Draw conclusions from the clues obtained during the scanning phase.

Parameters:
  • clues (list) - Unprocessed clues obtained during the scanning stage.
Returns: list
Coherent list of clues identifying real web servers.

reanalyze(clues, analyzed, threshold)

source code 

Identify and ignore changing header fields.

After initial analysis one must check that there aren't as many realservers as obtained clues. If there were it could be a sign of something wrong happening: each clue is different from the others due to one or more MIME header fields which change unexpectedly.

Parameters:
  • clues (list) - Raw sequence of clues.
  • analyzed (list) - Result from the first analysis phase.
  • threshold (float) - Minimum clue-to-realserver ratio in order to trigger field inspection.