Skip to content

API

The package contains the main classes and functions used to analyze the emails.

Abstraction

In information technology, abstraction is the process of hiding the implementation details from the user and it is one of the three fundamental concepts of object-oriented programming (OOP).

Here we use abstraction to hide the complexity of the email analysis process from the user. And provide a simple interface to use the package. The following code showes the core concept of this package:

from spamanalyzer.analyzer import MailAnalyzer

analyser = MailAnalyzer(wordlist)
analysis = analyser.analyze(email_path) # in the future we will support asynchroneous
analysis

analysis.is_spam()
we istantiate the MailAnalyzer class and pass the wordlist to it. Then we call the analyze method to get the analysis of the email: in this way we can also parallelize the analysis of multiple emails.

Date

A date object, it is used to store the date of the email and to perform some checks on it.

The focus of the checks is to determine if the date is valid and if it is in the correct format. The date is valid if it is in the RFC2822 format and if the timezone is valid:

  • RFC2822: specifies the format of the date in the headers of the mail in the form Day, DD Mon YYYY HH:MM:SS TZ. Of course it is not the only format used in the headers, but it is the most common, so it is the one we use to check if the date is valid.
  • TZ: specifies the timezone of the date. We included this check since often malicious emails can have a weird behavior, it is not uncommon to see a not existing timezone in the headers of the mail (valid timezones are from -12 to +14).

day: int property

Get the day of the date.

hour: int property

Get the hour of the date.

minutes: int property

Get the minutes of the date.

month: int property

Get the month of the date.

seconds: int property

Get the seconds of the date.

timezone: int property

Get the timezone of the date.

Returns:

Name Type Description
int int

The timezone of the date, if the timezone is not found it returns 0

year: int property

Get the year of the date. It raises a ValueError if the year is less than 1971 since the first email was sent in 1971.

See

history of email to know more about the first email sent.

Raises:

Type Description
ValueError

If the year is less than 1971

is_RFC2822_formatted()

Check if the date is in the RFC2822 format.

is_tz_valid() cached

The timezone is valid if it is in the range [-12, 14]

Domain dataclass

A Domain is a class representing an internet domain, here you can get information about the target domain.

The constructor resolves any domain alias to the real domain name: in fact common domain names are aliases for more complex server names that would be difficult to remember for common users, since there is not a direct method in the socket module to resolve domain aliases, we use the gethostbyname chained with the gethostbyaddr methods this way makes the instatiation of the class slower, but it is the only way to get the real domain name.

from_ip(ip_addr) async classmethod

Create a Domain object from an ip address. It translate the ip address to its domain name via the socket.gethostbyaddr method.

Parameters:

Name Type Description Default
ip_addr str

the targetted ip address

required

Returns:

Name Type Description
Domain Self

the domain obtained from the ip address

from_string(domain_str) classmethod

Instantiate a Domain object from string, it is a wrapper of the self.__init__ method.

Parameters:

Name Type Description Default
domain_str str

a string containing a domain to be parsed

required

Returns:

Name Type Description
Domain Self

the domain obtained from the string

get_ip_address() async

Translate the domain name to its ip address querying the DNS server.

Returns:

Name Type Description
str str

the ip address of the domain

Note: this method is async since it performs a network request

is_subdomain(domain)

Is the domain a subdomain of the given domain?

Parameters:

Name Type Description Default
domain Domain

the reference domain

required

Returns:

Name Type Description
bool bool

True if the domain is a subdomain of the given domain,

Raises:

Type Description
TypeError

if the given object is not a Domain

Note: a domain is a subdomain of itself

is_superdomain(domain)

Is the domain a superdomain of the given domain?

Parameters:

Name Type Description Default
domain Domain

the reference domain

required

Returns:

Name Type Description
bool bool

True if the domain is a superdomain of the given domain,

Raises:

Type Description
TypeError

if the given object is not a Domain

Note: a domain is a superdomain of itself

relation(domain)

Define the relation between two domains.

Parameters:

Name Type Description Default
domain Domain

the domain to compare with

required

Returns:

Name Type Description
DomainRelation DomainRelation

the relation between the two domains

Raises:

Type Description
TypeError

if the given object is not a Domain

MailAnalysis dataclass

A summary of the analysis of a mail.

attachments: dict[str, bool] instance-attribute

It is a dictionary containing a detailed analysis of the mail's attachments. It contains the following keys:

Key Type Description
has_attachments bool flag that indicates if the mail has attachments
attachment_is_executable bool flag that indicates if the mail has an attachment in executable format

body: dict[str, bool | float] instance-attribute

It is a dictionaty containing a detailed analysis of the mail's body. It contains the following keys:

Key Type Description
contains_html bool flag that indicates if the body contains an html tag
contains_script bool flag that indicates if the body contains a script tag or a callback function
forbidden_words_percentage float the rate of forbidden words in the body of the mail, it is a float between 0 and 1
has_links bool flag that indicates if the body contains an url
has_mailto bool flag that indicates if the body contains a mailto link
https_only bool flag that indicates if the body contains only https links
contains_form bool flag that indicates if the body contains a form tag
has_images bool flag that indicates if the body contains an image
is_uppercase bool flag that indicates if the body is in uppercase more than \(60\)% of its length
text_polarity float the polarity of the body, it is a float between -1 and 1
text_subjectivity float the subjectivity of the body, it is a float between 0 and 1

file_path: str instance-attribute

The path of the file analyzed.

headers: dict instance-attribute

It is a dictionaty containing a detailed analysis of the mail's headers. It contains the following keys:

Key Type Description
has_spf bool flag that indicates if the mail has a SPF header
has_dkim bool flag that indicates if the mail has a DKIM header
has_dmarc bool flag that indicates if the mail has a DMARC header
auth_warn bool flag that indicates if the mail has an Authentication-Warning header
domain_matches bool flag that indicates if the domain
of the sender matches the first domain in the Received headers
has_suspect_subject bool flag that indicates if the mail's subject contains
a suspicious word or a gappy word (e.g. H*E*L*L*O)
subject_is_uppercase bool flag that indicates if the mail's subject is in uppercase
send_date Date the date when the mail was sent, if the mail has no Date header, it is None
received_date Date the date when the mail was received, if the mail hasn't a date
in Received header, it is None
  • has_spf, it is True if the mail has a SPF header (Sender Policy Framework), it is a standard to prevent email spoofing. The SPF record is a TXT record that contains a policy that specifies which mail servers are allowed to send email from a specified domain.
  • has_dkim, it is True if the mail has a DKIM header (DomainKeys Identified Mail). The DKIM signature is a digital signature that is added to an email message to verify that the message has not been altered since it was signed.
  • has_dmarc, it is True if the mail has a DMARC header (Domain-based Message Authentication, Reporting & Conformance). The DMARC record is a type of DNS record that is used to help email receivers determine whether an email is legitimate or not.
  • auth_warn, it is True if the mail has an Authentication-Warning header The Authentication-Warning header is used to indicate that the message has been modified in transit.
  • domain_matches, it is True if the domain of the sender matches the first domain in the Received headers
  • has_suspect_subject, it is True if the mail's subject contains a suspicious word or a gappy word (e.g. H*E*L*L*O)
  • subject_is_uppercase, it is True if the mail's subject is in uppercase
  • send_date, it is the date when the mail was sent in a Date object, if the mail has no Date header, it is None
  • received_date, it is the date when the mail was received in a Date object, if the mail hasn't a date in Received header, it is None

SpamAnalyzer

Analyze a mail and return a MailAnalysis object, essentially it is a factory of MailAnalysis.

The MailAnalyzer object provides two methods to analyze a mail:

  • analyze to analyze a mail from a file, it returns a MailAnalysis object containing a description of the headers, body and attachments of the mail
  • get_domain to get the domain of the mail from the headers, it returns a Domain object

The core of the analysis is the analyze method, it uses the MailParser class (from mailparser library) to parse the mail. The analysis is based on separated checks for the headers, body and attachments and each check is implemented in a separated function: this make the analysis modular and easy to extend in future versions.

classify_multiple_input(mails)

Classify a list of mails.

Parameters:

Name Type Description Default
mails list[MailAnalysis]

a list of mails to be classified

required

Returns:

Name Type Description
list List[bool]

a list of boolean values, True if the mail is spam, False

List[bool]

otherwise

is_spam(email)

Determine if the email is spam based on the analysis of the mail.