Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add EML parser #249

Merged

Conversation

wneessen
Copy link
Owner

This PR introduces an EML parser to go-mail. It allows to read generic EML data from a file, a string or a reader into a go-mail Msg struct. It supports all types of message parts and encodings. It should be able to recognize Mulitpart messages as well as attachments and embeds (inline attachments).

This PR closes #145

wneessen and others added 30 commits September 15, 2023 13:16
Added two new functions `EMLToMsg` and `readEML` to the `mail` package. `EMLToMsg` function opens and parses a .eml file and returns a pre-filled Msg pointer. `readEML` opens an EML file and uses net/mail to parse the header and body. These changes are made to provide support for EML file parsing, which is a common requirement in many email-based applications.
The diff modifies how the email library handles the extraction of the mime media type from an email header. It uses the mime.ParseMediaType function to parse the content type header. The function gives back the media type as a string and a mapping of different associated parameters. This mapping was previously just printed, but now the charset parameter is also used for setting the charset of the email if it exists.
We can no parse simple mails (multipart is not working yet). The existing implementation was made more efficient by refactoring the EML file parsing and header extraction mechanism. Added 'strings' and 'bytes' packages to facilitate these changes. Previously, headers and body were parsed separately which was unnecessarily complex and increased the chance of errors. Now, with the new function 'readEML' and the helper function 'parseEMLBodyParts', we are able to parse headers and body together which not only simplifies the code but also increases its reliability. Specifically, 'bytes.Buffer' now helps us capture body while parsing, which removes need for separate handling. Additionally, certain headers like 'charset' and body types are also accounted for in the new implementation, enhancing the completeness of information extracted from EML files.
Added "References" header field to cover more potential use cases and enhance versatility. This field will allow applications to track series of related messages.
Test for "References" field has also been added for validation.
Also included are string methods for Content-type objects with relevant tests, ensuring accurate string conversion. Unnecessary duplicate method of string conversion for Charset has been removed to streamline the code and improve readability.
Renamed field 'Mime10' to 'MIME10' across multiple files for canonical representation and consistency with standard MIME naming format in the protocol."
Added support for quoted-printable encoding in email parser to increase its functionality. The change includes a case handling feature for 'EncodingQP' and related conversions to allow for proper message body reading and encoding setting. This improves the robustness and the scope of email content types that the parser can handle."
Implemented base64 encoding support in the email parser. This addition allows the parser to read and decode base64 encoded emails.
This commit changes the usage of error value and improves the string comparison for encoding types in EML file parsing. It ensures file closure after read operations to avoid memory leaks. Error messages are made dynamic for improved error reporting. Comments on function has also been made more descriptive.
Added two new methods `EMLToMsgFromString` and `EMLToMsgFromReader` in "eml.go". They allow EML parsing directly from a given string and a reader object, increasing overall functionality and versatility of the EML parsing process. This will enable the users to parse EML documents more flexibly."
A new test `TestEMLToMsgFromString` was added to "eml_test.go". This test asserts the proper functionality of `EMLToMsgFromString` method that allows us to parse EMLs directly from a string input. This test is a necessary part of ensuring the functionality and reliability of our EML parsing process.
The test emails in the eml_test.go file have been updated with more diverse fields, including variations of encoding types. These changes help improve the robustness of our parser tests by evaluating its function with a wider range of email structures. Tests including quoted-printable and base64 encoded emails have been added.
Added `time` import in the eml_test.go and added two new test use-cases: `exampleMailPlainNoEncInvalidDate` and `exampleMailPlainNoEncNoDate`. The `exampleMailPlainNoEncInvalidDate` is used to check if the parser can correctly handle email with invalid date. Meanwhile, `exampleMailPlainNoEncNoDate` checks if the parser can correctly add the current date to an email that didn't specify a date. This will improve the parser's resilience and flexibility in handling various email scenarios.
The list of common content types in encoding.go has been revised. The type "multipart/alternative" has been added and the order of types has been adjusted for consistency with net/smtp upstream.
This commit introduces the ability to handle multipart messages within the eml.go file. It reads individual parts of multipart messages, sets the encoding and content for each part, and implements error handling for potential issues like a missing boundary tag or difficulties acquiring the next part of a multipart message.
The variable names "mbbuf", "mt", and "par" have been renamed to "bodybuf", "mediatype", and "params" respectively, for clarification. Moreover, the multipart parsing block within the parseEMLBodyParts function was extracted into its own function, parseEMLMultipartAlternative, for improved code structure and readability.
Extended the settings for content type and charset from headers. Also, refactored the handling of encoding types - 'QP' and 'B64' - within the mail header and body parsing sections. The process of handling encoding for plain type mail specifically is now encapsulated in a new function, parseEMLBodyPlain. These changes enhance code readability, maintainability, and error handling efficiency.
Refactored the processing of multipart encoding to be robust and easily maintainable. The changes include setting 'QP' encoding as default when the Content-Transfer-Encoding header is empty, accounting for the removal of this header by the standard Go multipart package. Also, parser functions for content type and charset are now independently handling the headers, replacing the split-string approach, thus improving efficiency and code readability.
Introduced "multipart/mixed" and "multipart/related" content types in encoding.go and updated msgwriter.go to accommodate these. Adjustments made in related tests for these new types. Additionally, removed unnecessary print statements and improved multipart alternative parsing in eml.go.
wneessen and others added 25 commits May 27, 2024 10:59
Variable names in eml.go have been refactored for better readability and understanding. Shortened abbreviations have been expanded into meaningful names, and complex object names have been made simpler, making it easier to understand their role within the codebase. Cooperative variable names will improve maintainability and ease future development. This is a follow up to #179 which didn't consider this branch.
Introduced a new test, `TestEMLToMsgFromFile`, to validate the functions responsible for EML message parsing. This complements the existing `EMLToMsgFromString` test, holding them accountable for subject and encoding accuracy. Also, a temporary directory is now created for testing File-related operations in isolation.
The commit includes extraction of blocks of code related to EML message encoding and content-type parsing into their own separate functions. By doing so, it improves code readability and maintainability.
The code is refactored to improve multipart parsing in EML. The `parseEMLMultipartAlternative` function is updated to `parseEMLMultipart` for more general utilization. This involves iterating through the parts of a multipart message until content disposition is found and appended. A new function `parseMultiPartHeader` is introduced to parse multipart header and handle charset more sensibly.
The EML parsing has been refactored to separate the handling of attachments and embeds into a new helper function. This improves the organization of the code, makes it easier to understand and helps to better manage error handling and resource closing.
The content print statement in eml.go was removed to optimize code readability and performance. In addition, several assertions in the test cases of eml_test.go were corrected for string formatting errors and a new test case was added for handling emails with attachments. These changes aim to enhance the robustness of tests for email encoding and decoding operations.
This update expands the EML parser to support multipart/related content types. It also includes relevant error handling and creates a specific routine for parsing multipart/related parts separately. Furthermore, adjustments were made to avoid processing headers unnecessarily when TypeMultipartMixed is used. The diff also shows some refactoring for clearer error messages and cleaner code.
The new test ensures that the EMLToMsgFromString function properly handles an EML that contains embedded content. The expected subject content and number of embedded objects are checked to confirm correct parsing.
The update adds a case to the switch clause in eml.go for properly handling unknown content types. An error will now be returned when the media type of the body to be parsed is not recognized, increasing the robustness of the system.
New failing tests have been added to 'eml_test.go' to account for a variety of error situations, such as broken FROM, TO, headers, bodies, and unknown or unsupported content types. Improving the robustness of test coverage helps identify potential issues and ensure the resilience and correctness of the code.
The previous separate parsing of EML headers and body parts has been refactored into a single function, parseEML. This change simplifies the operations in the readEML and makes the code cleaner by reducing repetition.
Refactored the way EML files are tested, the errors are now handled more efficiently. Temporary directory and file creation, as well as file writing, have been moved to a helper function named 'stringToTempFile'. Moreover, additional test cases were added to ensure proper parsing failure for various types of email-related errors.
The error message previously referenced a constant 'HeaderTo' which might not always be the header being parsed. The commit replaces this with 'addrHeader', significantly improving the accuracy of error messages.
The commit modifies the parseMultiPartHeader function to handle optional fields accurately. The delimiter was changed from "; " to ";" and whitespace is being trimmed from the start of optional fields to ensure correct splitting and mapping.
The newly added function, WithContentID, allows for setting the Content-ID header for the File. This provides enhanced handling and differentiation of files.
The updated code adds base64 encoding support to email attachments and inline content in eml.go. It does this by introducing a new dataReader which uses a base64 decoder if the content transfer encoding is base64. With this update, attachments with base64 content will be correctly decoded when processed.
The EML parser now includes logic to manage 'multipart/alternative' content types. This adjustment is made within the section handling 'multipart/related' parts, allowing for better handling and parsing of varying content types.
The comment in the eml.go file was extended to include the possibility of 'Multipart/alternative' parts. Previously, it only mentioned 'Multipart/related' parts. The actual code functionality remains unchanged, this is purely a clarification in the documentation.
The debug print statement that outputs the content type of the email has been removed from eml.go. This change improves code cleanliness and avoids unnecessary console output in production.
This commit introduces tests for different email scenarios including emails with no boundaries, emails with attachments, emails with embedded items, and multipart mixed emails. The tests ensure that the code correctly returns all expected parts of the email (e.g., embeds, attachments, parts) and that it correctly processes the email subject.
In the eml.go and file.go files, the function WithContentID has been renamed to WithFileContentID. This aligns more accurately with the function purpose, which is to set the content ID for a File object.
The change introduces a new unit test, TestFile_WithContentID, in file_test.go. This test aims to verify the correct function of the WithFileContentID option by using differing scenarios, with assertions for error control and validation of expected content.
The README has been updated to show that go-mail now supports outputting a message as an EML file and parsing an EML file into a go-mail message. This addition enhances the flexibility and control over the email content and format.
@wneessen wneessen linked an issue Jun 28, 2024 that may be closed by this pull request
Added conditional statements to handle potential failures when writing the EML string to the temporary files during testing. These updates ensure test failures due to write errors are accurately reflected in test results. Also, a minor fix is implemented on file permission in the os.WriteFile function.
Copy link

codecov bot commented Jun 28, 2024

Codecov Report

Attention: Patch coverage is 86.95652% with 33 lines in your changes missing coverage. Please review.

Project coverage is 85.21%. Comparing base (d8a3d0b) to head (7e4bb00).

Files Patch % Lines
eml.go 86.41% 18 Missing and 15 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #249      +/-   ##
==========================================
+ Coverage   85.01%   85.21%   +0.20%     
==========================================
  Files          24       25       +1     
  Lines        1808     2056     +248     
==========================================
+ Hits         1537     1752     +215     
- Misses        160      178      +18     
- Partials      111      126      +15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wneessen wneessen merged commit ffdea83 into main Jun 28, 2024
28 of 29 checks passed
@wneessen wneessen deleted the feature/145_add-eml-parser-to-generate-msg-from-eml-files branch June 28, 2024 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add eml parser to generate Msg from .eml files
1 participant