-
-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add EML parser #249
Merged
wneessen
merged 61 commits into
main
from
feature/145_add-eml-parser-to-generate-msg-from-eml-files
Jun 28, 2024
Merged
Add EML parser #249
wneessen
merged 61 commits into
main
from
feature/145_add-eml-parser-to-generate-msg-from-eml-files
Jun 28, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Added two new functions `EMLToMsg` and `readEML` to the `mail` package. `EMLToMsg` function opens and parses a .eml file and returns a pre-filled Msg pointer. `readEML` opens an EML file and uses net/mail to parse the header and body. These changes are made to provide support for EML file parsing, which is a common requirement in many email-based applications.
The diff modifies how the email library handles the extraction of the mime media type from an email header. It uses the mime.ParseMediaType function to parse the content type header. The function gives back the media type as a string and a mapping of different associated parameters. This mapping was previously just printed, but now the charset parameter is also used for setting the charset of the email if it exists.
We can no parse simple mails (multipart is not working yet). The existing implementation was made more efficient by refactoring the EML file parsing and header extraction mechanism. Added 'strings' and 'bytes' packages to facilitate these changes. Previously, headers and body were parsed separately which was unnecessarily complex and increased the chance of errors. Now, with the new function 'readEML' and the helper function 'parseEMLBodyParts', we are able to parse headers and body together which not only simplifies the code but also increases its reliability. Specifically, 'bytes.Buffer' now helps us capture body while parsing, which removes need for separate handling. Additionally, certain headers like 'charset' and body types are also accounted for in the new implementation, enhancing the completeness of information extracted from EML files.
Added "References" header field to cover more potential use cases and enhance versatility. This field will allow applications to track series of related messages. Test for "References" field has also been added for validation. Also included are string methods for Content-type objects with relevant tests, ensuring accurate string conversion. Unnecessary duplicate method of string conversion for Charset has been removed to streamline the code and improve readability.
Renamed field 'Mime10' to 'MIME10' across multiple files for canonical representation and consistency with standard MIME naming format in the protocol."
Added support for quoted-printable encoding in email parser to increase its functionality. The change includes a case handling feature for 'EncodingQP' and related conversions to allow for proper message body reading and encoding setting. This improves the robustness and the scope of email content types that the parser can handle."
Implemented base64 encoding support in the email parser. This addition allows the parser to read and decode base64 encoded emails.
This commit changes the usage of error value and improves the string comparison for encoding types in EML file parsing. It ensures file closure after read operations to avoid memory leaks. Error messages are made dynamic for improved error reporting. Comments on function has also been made more descriptive.
Added two new methods `EMLToMsgFromString` and `EMLToMsgFromReader` in "eml.go". They allow EML parsing directly from a given string and a reader object, increasing overall functionality and versatility of the EML parsing process. This will enable the users to parse EML documents more flexibly."
A new test `TestEMLToMsgFromString` was added to "eml_test.go". This test asserts the proper functionality of `EMLToMsgFromString` method that allows us to parse EMLs directly from a string input. This test is a necessary part of ensuring the functionality and reliability of our EML parsing process.
The test emails in the eml_test.go file have been updated with more diverse fields, including variations of encoding types. These changes help improve the robustness of our parser tests by evaluating its function with a wider range of email structures. Tests including quoted-printable and base64 encoded emails have been added.
Added `time` import in the eml_test.go and added two new test use-cases: `exampleMailPlainNoEncInvalidDate` and `exampleMailPlainNoEncNoDate`. The `exampleMailPlainNoEncInvalidDate` is used to check if the parser can correctly handle email with invalid date. Meanwhile, `exampleMailPlainNoEncNoDate` checks if the parser can correctly add the current date to an email that didn't specify a date. This will improve the parser's resilience and flexibility in handling various email scenarios.
The list of common content types in encoding.go has been revised. The type "multipart/alternative" has been added and the order of types has been adjusted for consistency with net/smtp upstream.
This commit introduces the ability to handle multipart messages within the eml.go file. It reads individual parts of multipart messages, sets the encoding and content for each part, and implements error handling for potential issues like a missing boundary tag or difficulties acquiring the next part of a multipart message.
The variable names "mbbuf", "mt", and "par" have been renamed to "bodybuf", "mediatype", and "params" respectively, for clarification. Moreover, the multipart parsing block within the parseEMLBodyParts function was extracted into its own function, parseEMLMultipartAlternative, for improved code structure and readability.
Extended the settings for content type and charset from headers. Also, refactored the handling of encoding types - 'QP' and 'B64' - within the mail header and body parsing sections. The process of handling encoding for plain type mail specifically is now encapsulated in a new function, parseEMLBodyPlain. These changes enhance code readability, maintainability, and error handling efficiency.
Refactored the processing of multipart encoding to be robust and easily maintainable. The changes include setting 'QP' encoding as default when the Content-Transfer-Encoding header is empty, accounting for the removal of this header by the standard Go multipart package. Also, parser functions for content type and charset are now independently handling the headers, replacing the split-string approach, thus improving efficiency and code readability.
Introduced "multipart/mixed" and "multipart/related" content types in encoding.go and updated msgwriter.go to accommodate these. Adjustments made in related tests for these new types. Additionally, removed unnecessary print statements and improved multipart alternative parsing in eml.go.
Variable names in eml.go have been refactored for better readability and understanding. Shortened abbreviations have been expanded into meaningful names, and complex object names have been made simpler, making it easier to understand their role within the codebase. Cooperative variable names will improve maintainability and ease future development. This is a follow up to #179 which didn't consider this branch.
…erate-msg-from-eml-files
Introduced a new test, `TestEMLToMsgFromFile`, to validate the functions responsible for EML message parsing. This complements the existing `EMLToMsgFromString` test, holding them accountable for subject and encoding accuracy. Also, a temporary directory is now created for testing File-related operations in isolation.
The commit includes extraction of blocks of code related to EML message encoding and content-type parsing into their own separate functions. By doing so, it improves code readability and maintainability.
The code is refactored to improve multipart parsing in EML. The `parseEMLMultipartAlternative` function is updated to `parseEMLMultipart` for more general utilization. This involves iterating through the parts of a multipart message until content disposition is found and appended. A new function `parseMultiPartHeader` is introduced to parse multipart header and handle charset more sensibly.
The EML parsing has been refactored to separate the handling of attachments and embeds into a new helper function. This improves the organization of the code, makes it easier to understand and helps to better manage error handling and resource closing.
The content print statement in eml.go was removed to optimize code readability and performance. In addition, several assertions in the test cases of eml_test.go were corrected for string formatting errors and a new test case was added for handling emails with attachments. These changes aim to enhance the robustness of tests for email encoding and decoding operations.
This update expands the EML parser to support multipart/related content types. It also includes relevant error handling and creates a specific routine for parsing multipart/related parts separately. Furthermore, adjustments were made to avoid processing headers unnecessarily when TypeMultipartMixed is used. The diff also shows some refactoring for clearer error messages and cleaner code.
The new test ensures that the EMLToMsgFromString function properly handles an EML that contains embedded content. The expected subject content and number of embedded objects are checked to confirm correct parsing.
The update adds a case to the switch clause in eml.go for properly handling unknown content types. An error will now be returned when the media type of the body to be parsed is not recognized, increasing the robustness of the system.
New failing tests have been added to 'eml_test.go' to account for a variety of error situations, such as broken FROM, TO, headers, bodies, and unknown or unsupported content types. Improving the robustness of test coverage helps identify potential issues and ensure the resilience and correctness of the code.
The previous separate parsing of EML headers and body parts has been refactored into a single function, parseEML. This change simplifies the operations in the readEML and makes the code cleaner by reducing repetition.
Refactored the way EML files are tested, the errors are now handled more efficiently. Temporary directory and file creation, as well as file writing, have been moved to a helper function named 'stringToTempFile'. Moreover, additional test cases were added to ensure proper parsing failure for various types of email-related errors.
The error message previously referenced a constant 'HeaderTo' which might not always be the header being parsed. The commit replaces this with 'addrHeader', significantly improving the accuracy of error messages.
The commit modifies the parseMultiPartHeader function to handle optional fields accurately. The delimiter was changed from "; " to ";" and whitespace is being trimmed from the start of optional fields to ensure correct splitting and mapping.
The newly added function, WithContentID, allows for setting the Content-ID header for the File. This provides enhanced handling and differentiation of files.
The updated code adds base64 encoding support to email attachments and inline content in eml.go. It does this by introducing a new dataReader which uses a base64 decoder if the content transfer encoding is base64. With this update, attachments with base64 content will be correctly decoded when processed.
The EML parser now includes logic to manage 'multipart/alternative' content types. This adjustment is made within the section handling 'multipart/related' parts, allowing for better handling and parsing of varying content types.
The comment in the eml.go file was extended to include the possibility of 'Multipart/alternative' parts. Previously, it only mentioned 'Multipart/related' parts. The actual code functionality remains unchanged, this is purely a clarification in the documentation.
The debug print statement that outputs the content type of the email has been removed from eml.go. This change improves code cleanliness and avoids unnecessary console output in production.
This commit introduces tests for different email scenarios including emails with no boundaries, emails with attachments, emails with embedded items, and multipart mixed emails. The tests ensure that the code correctly returns all expected parts of the email (e.g., embeds, attachments, parts) and that it correctly processes the email subject.
In the eml.go and file.go files, the function WithContentID has been renamed to WithFileContentID. This aligns more accurately with the function purpose, which is to set the content ID for a File object.
The change introduces a new unit test, TestFile_WithContentID, in file_test.go. This test aims to verify the correct function of the WithFileContentID option by using differing scenarios, with assertions for error control and validation of expected content.
The README has been updated to show that go-mail now supports outputting a message as an EML file and parsing an EML file into a go-mail message. This addition enhances the flexibility and control over the email content and format.
Added conditional statements to handle potential failures when writing the EML string to the temporary files during testing. These updates ensure test failures due to write errors are accurately reflected in test results. Also, a minor fix is implemented on file permission in the os.WriteFile function.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #249 +/- ##
==========================================
+ Coverage 85.01% 85.21% +0.20%
==========================================
Files 24 25 +1
Lines 1808 2056 +248
==========================================
+ Hits 1537 1752 +215
- Misses 160 178 +18
- Partials 111 126 +15 ☔ View full report in Codecov by Sentry. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces an EML parser to go-mail. It allows to read generic EML data from a file, a string or a reader into a go-mail
Msg
struct. It supports all types of message parts and encodings. It should be able to recognize Mulitpart messages as well as attachments and embeds (inline attachments).This PR closes #145