I had same problem in last few months, advertising t-shirt/pill emails.
Another kind of spam mail like this is they use google docs or sharepoint to add a huge CC list.
However, defending them is hard and I think I can share my knowledge on topic of spam filtering(given I run an email forwarding service (https://mailwip.com) and have to deal with spam a lot)
When an email come from gmail/hotmail(any popular free email service) itself, it's harder to detect spam, especially if the email is in a non English language.
It has a few way to flag spams:
- Look at the IP address:
- Look at the structure of emails: follow best practice, such as html/text plain part, has right mime encoding etc
- Look at header of emails: no weird header, no "bad" ip in received chains
- Look at attachment file type, virus scan those attachment
- Finally, tokenize content of the email to find similar email that are flagged as spam
When the email come from their own IP, send out by gmail themselves, email format looks good, DKIM/SPF all pass and this is the first kind of email then the only way to flag spam is by analyze content. And if the email is in non English language, it's harder to analyze. Especially if not enough people flagged it as spam to train the naieve bayes tokenizer then we're out of luck here. The long CC list looks like a legitimate indicator for spam, but librarian/scool has a tradition of sending out a huge CC of entire class/department, sometime even BCC which make the email looks very suspicious (undisclose recipient) yet they are legitimate email so the CC alone cannot easily be used for spam indicator.
Yet, at the same time, legitimate emails form your own server get flagged because low reputation or a history of previous owner send spam...
Another kind of spam mail like this is they use google docs or sharepoint to add a huge CC list.
However, defending them is hard and I think I can share my knowledge on topic of spam filtering(given I run an email forwarding service (https://mailwip.com) and have to deal with spam a lot)
When an email come from gmail/hotmail(any popular free email service) itself, it's harder to detect spam, especially if the email is in a non English language.
It has a few way to flag spams:
- Look at the IP address: - Look at the structure of emails: follow best practice, such as html/text plain part, has right mime encoding etc - Look at header of emails: no weird header, no "bad" ip in received chains - Look at attachment file type, virus scan those attachment - Finally, tokenize content of the email to find similar email that are flagged as spam
When the email come from their own IP, send out by gmail themselves, email format looks good, DKIM/SPF all pass and this is the first kind of email then the only way to flag spam is by analyze content. And if the email is in non English language, it's harder to analyze. Especially if not enough people flagged it as spam to train the naieve bayes tokenizer then we're out of luck here. The long CC list looks like a legitimate indicator for spam, but librarian/scool has a tradition of sending out a huge CC of entire class/department, sometime even BCC which make the email looks very suspicious (undisclose recipient) yet they are legitimate email so the CC alone cannot easily be used for spam indicator.
Yet, at the same time, legitimate emails form your own server get flagged because low reputation or a history of previous owner send spam...