API¶
The package contains the main classes and functions used to analyze the emails.
Abstraction¶
In information technology, abstraction is the process of hiding the implementation details from the user and it is one of the three fundamental concepts of object-oriented programming (OOP).
Here we use abstraction to hide the complexity of the email analysis process from the user. And provide a simple interface to use the package. The following code showes the core concept of this package:
from spamanalyzer.analyzer import MailAnalyzer
analyser = MailAnalyzer(wordlist)
analysis = analyser.analyze(email_path) # in the future we will support asynchroneous
analysis
analysis.is_spam()
MailAnalyzer
class and pass the wordlist to it. Then we call the
analyze
method to get the analysis of the email:
in this way we can also parallelize the analysis of multiple emails.
Date
¶
A date object, it is used to store the date of the email and to perform some checks on it.
The focus of the checks is to determine if the date is valid and if it is in the correct format. The date is valid if it is in the RFC2822 format and if the timezone is valid:
- RFC2822: specifies the
format of the date in the headers of the mail in the form
Day, DD Mon YYYY HH:MM:SS TZ
. Of course it is not the only format used in the headers, but it is the most common, so it is the one we use to check if the date is valid. - TZ: specifies the timezone of the date. We included this check since often malicious emails can have a weird behavior, it is not uncommon to see a not existing timezone in the headers of the mail (valid timezones are from -12 to +14).
day: int
property
¶
Get the day of the date.
hour: int
property
¶
Get the hour of the date.
minutes: int
property
¶
Get the minutes of the date.
month: int
property
¶
Get the month of the date.
seconds: int
property
¶
Get the seconds of the date.
timezone: int
property
¶
Get the timezone of the date.
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
The timezone of the date, if the timezone is not found it returns 0 |
year: int
property
¶
Get the year of the date. It raises a ValueError if the year is less than 1971 since the first email was sent in 1971.
See
history of email to know more about the first email sent.
Raises:
Type | Description |
---|---|
ValueError
|
If the year is less than 1971 |
is_tz_valid()
cached
¶
The timezone is valid if it is in the range [-12, 14]
Domain
dataclass
¶
A Domain is a class representing an internet domain, here you can get information about the target domain.
The constructor resolves any domain alias to the real domain name:
in fact common domain names are aliases for more complex server names
that would be difficult to remember for common users,
since there is not a direct method in the socket
module to resolve domain
aliases, we use the gethostbyname
chained with the gethostbyaddr
methods
this way makes the instatiation of the class slower, but it is the only way to
get the real domain name.
from_ip(ip_addr)
async
classmethod
¶
Create a Domain object from an ip address. It translate the ip address
to its domain name via the socket.gethostbyaddr
method.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ip_addr |
str
|
the targetted ip address |
required |
Returns:
Name | Type | Description |
---|---|---|
Domain |
Self
|
the domain obtained from the ip address |
from_string(domain_str)
classmethod
¶
Instantiate a Domain object from string, it is a wrapper of the
self.__init__
method.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
domain_str |
str
|
a string containing a domain to be parsed |
required |
Returns:
Name | Type | Description |
---|---|---|
Domain |
Self
|
the domain obtained from the string |
get_ip_address()
async
¶
Translate the domain name to its ip address querying the DNS server.
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
the ip address of the domain |
Note: this method is async since it performs a network request
is_subdomain(domain)
¶
Is the domain a subdomain of the given domain?
Parameters:
Name | Type | Description | Default |
---|---|---|---|
domain |
Domain
|
the reference domain |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if the domain is a subdomain of the given domain, |
Raises:
Type | Description |
---|---|
TypeError
|
if the given object is not a Domain |
Note: a domain is a subdomain of itself
is_superdomain(domain)
¶
Is the domain a superdomain of the given domain?
Parameters:
Name | Type | Description | Default |
---|---|---|---|
domain |
Domain
|
the reference domain |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if the domain is a superdomain of the given domain, |
Raises:
Type | Description |
---|---|
TypeError
|
if the given object is not a Domain |
Note: a domain is a superdomain of itself
relation(domain)
¶
Define the relation between two domains.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
domain |
Domain
|
the domain to compare with |
required |
Returns:
Name | Type | Description |
---|---|---|
DomainRelation |
DomainRelation
|
the relation between the two domains |
Raises:
Type | Description |
---|---|
TypeError
|
if the given object is not a Domain |
MailAnalysis
dataclass
¶
A summary of the analysis of a mail.
attachments: dict[str, bool]
instance-attribute
¶
It is a dictionary containing a detailed analysis of the mail's attachments. It contains the following keys:
Key | Type | Description |
---|---|---|
has_attachments |
bool | flag that indicates if the mail has attachments |
attachment_is_executable |
bool | flag that indicates if the mail has an attachment in executable format |
body: dict[str, bool | float]
instance-attribute
¶
It is a dictionaty containing a detailed analysis of the mail's body. It contains the following keys:
Key | Type | Description |
---|---|---|
contains_html |
bool | flag that indicates if the body contains an html tag |
contains_script |
bool | flag that indicates if the body contains a script tag or a callback function |
forbidden_words_percentage |
float | the rate of forbidden words in the body of the mail, it is a float between 0 and 1 |
has_links |
bool | flag that indicates if the body contains an url |
has_mailto |
bool | flag that indicates if the body contains a mailto link |
https_only |
bool | flag that indicates if the body contains only https links |
contains_form |
bool | flag that indicates if the body contains a form tag |
has_images |
bool | flag that indicates if the body contains an image |
is_uppercase |
bool | flag that indicates if the body is in uppercase more than \(60\)% of its length |
text_polarity |
float | the polarity of the body, it is a float between -1 and 1 |
text_subjectivity |
float | the subjectivity of the body, it is a float between 0 and 1 |
file_path: str
instance-attribute
¶
The path of the file analyzed.
headers: dict
instance-attribute
¶
It is a dictionaty containing a detailed analysis of the mail's headers. It contains the following keys:
Key | Type | Description |
---|---|---|
has_spf |
bool | flag that indicates if the mail has a SPF header |
has_dkim |
bool | flag that indicates if the mail has a DKIM header |
has_dmarc |
bool | flag that indicates if the mail has a DMARC header |
auth_warn |
bool | flag that indicates if the mail has an Authentication-Warning header |
domain_matches |
bool | flag that indicates if the domain |
of the sender matches the first domain in the Received headers |
||
has_suspect_subject |
bool | flag that indicates if the mail's subject contains |
a suspicious word or a gappy word (e.g. H*E*L*L*O ) |
||
subject_is_uppercase |
bool | flag that indicates if the mail's subject is in uppercase |
send_date |
Date | the date when the mail was sent, if the mail has no Date header, it is None |
received_date |
Date | the date when the mail was received, if the mail hasn't a date |
in Received header, it is None |
has_spf
, it isTrue
if the mail has a SPF header (Sender Policy Framework), it is a standard to prevent email spoofing. The SPF record is a TXT record that contains a policy that specifies which mail servers are allowed to send email from a specified domain.has_dkim
, it isTrue
if the mail has a DKIM header (DomainKeys Identified Mail). The DKIM signature is a digital signature that is added to an email message to verify that the message has not been altered since it was signed.has_dmarc
, it isTrue
if the mail has a DMARC header (Domain-based Message Authentication, Reporting & Conformance). The DMARC record is a type of DNS record that is used to help email receivers determine whether an email is legitimate or not.auth_warn
, it isTrue
if the mail has an Authentication-Warning header The Authentication-Warning header is used to indicate that the message has been modified in transit.domain_matches
, it isTrue
if the domain of the sender matches the first domain in theReceived
headershas_suspect_subject
, it isTrue
if the mail's subject contains a suspicious word or a gappy word (e.g.H*E*L*L*O
)subject_is_uppercase
, it isTrue
if the mail's subject is in uppercasesend_date
, it is the date when the mail was sent in aDate
object, if the mail has noDate
header, it isNone
received_date
, it is the date when the mail was received in aDate
object, if the mail hasn't a date inReceived
header, it isNone
SpamAnalyzer
¶
Analyze a mail and return a MailAnalysis
object, essentially it is a
factory of MailAnalysis
.
The MailAnalyzer
object provides two methods to analyze a mail:
analyze
to analyze a mail from a file, it returns aMailAnalysis
object containing a description of the headers, body and attachments of the mailget_domain
to get the domain of the mail from the headers, it returns aDomain
object
The core of the analysis is the analyze
method, it uses the MailParser
class
(from mailparser
library) to parse the mail.
The analysis is based on separated checks for the headers, body and attachments and
each check is implemented in a separated function: this make the analysis modular
and easy to extend in future versions.
classify_multiple_input(mails)
¶
Classify a list of mails.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mails |
list[MailAnalysis]
|
a list of mails to be classified |
required |
Returns:
Name | Type | Description |
---|---|---|
list |
List[bool]
|
a list of boolean values, |
List[bool]
|
otherwise |
is_spam(email)
¶
Determine if the email is spam based on the analysis of the mail.