Unlocking Chemical Volatility: How the volcalc R Package is Streamlining Scientific Research

Kristina Riemer, director of the CCT Data Science Team at the University of Arizona, and Eric Scott, Scientific Programmer and Educator in the CCT Data Science Team, the developers behind the volcalc package, discuss the motivation and development of this innovative tool designed to automate the calculation of chemical compound volatilities.
Author

R Consortium

Published

September 23, 2024

The R Consortium recently interviewed Kristina Riemer, director of the CCT Data Science Team at the University of Arizona, and Eric Scott, Scientific Programmer and Educator in the CCT Data Science Team, the developers behind the volcalc package, to discuss the motivation and development of this innovative tool designed to automate the calculation of chemical compound volatilities. volcalc streamlines the process by allowing users to input a compound and quickly receive its volatility information, eliminating the need for time-consuming manual calculations. Initially created to assist Dr. Laura Meredith in managing a large database of volatile compounds, volcalc has since grown into a more versatile tool under Eric’s leadership, now supporting a wider range of researchers.

Kristina and Eric share insights into the challenges they faced, including managing dependencies, integrating with CRAN and Bioconductor, and refining complex molecular identification methods. They also discuss future enhancements, such as incorporating temperature-specific volatility calculations and expanding the package’s functionality to estimate other compound characteristics. This project was funded by the R Consortium.

Could you share what motivated the development of the volcalc package and how it aligns with the broader goals of the R ecosystem, particularly in scientific computing?

Kristina: I was heavily involved in the initial development of volcalc, and later on, Eric took over the project. We developed volcalc because we began collaborating with Dr. Laura Meredith, who was compiling a database of volatile chemical compounds. At the time, she had around 300 compounds, and her students manually gathered details for each one by examining their representations and calculating various associated values. This process was tedious and prone to errors, so we thought there must be a more efficient and automated way to handle it.

That’s when we came up with the idea of creating a pipeline where someone could input a compound and quickly receive its volatility information, eliminating the need for all the manual labor. The purpose of volcalc was to transform the process from taking months to gather details for 300 compounds to obtaining information for thousands in a much shorter time.

Eric: volcalc was initially developed specifically for a project where the researchers were mainly interested in chemical compounds from the KEGG database (Kyoto Encyclopedia of Genes and Genomes). When I joined the team and learned about the project, I was thrilled because, as a chemical ecologist, I saw its potential. However, I also recognized a limitation: the tool only worked with the KEGG database. This was a drawback because many researchers, including food scientists and others who work with similar compounds, might not find their compounds in that specific database.

This realization inspired me to apply for the R Consortium grant. We saw a significant opportunity to expand volcalc, making it more flexible and applicable to a wider range of researchers. We also wanted to improve its integration within the R ecosystem by adding features like returning the file path of a molecule representation after downloading it, so it could be easily piped into subsequent steps. These enhancements aimed to make the tool more versatile and user-friendly for a broader audience.

What were the most significant challenges you faced during the development of the initial version of volcalc, and how did you overcome them?

Kristina: One of the most challenging aspects of developing volcalc, which continues to be an issue, is managing dependencies. Specifically, we rely heavily on a command-line program to handle much of the processing. Early on, we struggled with how to enable users to run volcalc without needing to install this program on their own computers, as many of our users aren’t familiar with that kind of setup. I spent a lot of time trying to create a reproducible environment using Binder, but I was never able to get it fully working. Even today, there are still issues related to managing these dependencies, which Eric can elaborate on further.

It was incredibly important to have Eric on this project because I don’t have a strong background in chemistry. His ability to come in and figure out some of the intricate details that would have taken me much longer to grasp was a huge advantage. The more we can collaborate with domain experts, the better our results will be.

Eric: One thing that has helped with the dependency challenges is that we’ve started building volcalc on R-Universe, which means binaries are available there. While it’s not on CRAN yet, having these binaries on R-Universe makes installation a bit easier. However, we’ve faced some challenges with dependencies, particularly because two of them are from Bioconductor. We didn’t originally aim to develop this package for Bioconductor, which uses S4 objects and has different standards than CRAN. Our goal was to get it on CRAN, but our first submission was rejected because the license field for the Bioconductor package wasn’t formatted to CRAN’s liking. These differences between Bioconductor and CRAN have created barriers, even though the authors of the Bioconductor package have been very responsive. Their package works fine on Bioconductor, but it doesn’t meet CRAN’s criteria, which has been a frustrating challenge.

Another major challenge in developing volcalc relates to the method we use for estimating volatility. This method involves counting the numbers of different functional groups on molecules—such as hydroxyl groups or sulfur atoms—and assigning coefficients to them. To do this programmatically, we use something called SMARTS, which is essentially like regular expressions but for molecular structures. Regular expressions for text are already challenging, but SMARTS is even more complex because it deals with three-dimensional molecules.

Before I joined the group, the first version of volcalc had most of these functional groups figured out, but not all. I spent a significant amount of time trying to develop SMARTS strings to match additional molecules. Moving forward, I hope that if we implement new versions, we can get help from the community to refine these SMARTS strings, as there are likely people out there who are more skilled at it than I am.

The original project proposal mentions expanding volcalc to work with any chemical compound with a known structure. What are the key technical challenges you anticipate in achieving this goal?

Eric: This task turned out to be less difficult than I initially expected, but let me explain. In the original version of volcalc, before we received the R Consortium funding, the main function started with a KEGG ID—an identifier specific to the KEGG database. The function would download a MOL file, which is a text representation of a molecule corresponding to that ID. It would then identify and count the functional groups in the molecule, and finally, calculate the volatility based on those counts.

The major change we needed to implement to make volcalc more versatile was to decouple these steps. In the current version of volcalc, the functionality to download a MOL file from KEGG is still available, but it’s now separate from the main function that calculates volatility. This means that the inputs for calculating volatility can now be any MOL file, not just ones from KEGG. The file can come from any database, be exported from other software, or even be downloaded manually. Additionally, the tool now supports SMILES, which is another, simpler text-based representation of molecules.

There are various ways to represent chemicals in text, including another format called InChI. The Bioconductor packages we use, ChemmineR and ChemmineOB, have the ability to translate from InChI and other types of chemical representations. However, that feature isn’t available on Windows. So, I decided to keep volcalc focused on SMILES and MOL files. I believe that chemists and other researchers should be able to obtain data in one of these two formats, or use another tool to translate their data into these formats. I didn’t want to overload volcalc with the responsibility of being a chemical representation translator, as that didn’t seem like its primary purpose.

Can you walk us through the process of implementing the SIMPOL algorithm within the volcalcc package?

Kristina: The algorithm itself is fairly simple; it’s just basic math. You need to input some constants, the mass of the compound, and the counts of the functional groups we discussed earlier. Writing the code for this was straightforward and not particularly challenging.

Eric: Each functional group has a coefficient associated with it, which is multiplied by the number of times that group appears in the molecule. These values are then summed up, and the mass of the molecule is factored in as well. The challenging part wasn’t the algorithm itself, which is straightforward—just multiplying by coefficients and adding them up. The real difficulty was interpreting what the authors of the algorithm meant by each of the functional groups. Some were oddly specific, like how the hydroxyl group that is part of a nitrophenol group isn’t supposed to count toward the total number of hydroxyl groups. I spent a lot of time poring over the paper, particularly one table, to fully understand how they defined each group. That interpretation was the hardest part.

What future functionalities or expansions do you see as crucial for volcalc, especially in the context of evolving research needs in chemoinformatics?

Eric: Right now, we’re working on allowing users to specify different temperatures. The paper that describes the SIMPOL.1 method includes equations for how the coefficients of each functional group change with temperature. These changes aren’t always linear, and the contributions of functional groups can shift in importance as the temperature varies. This is an important feature to include because the version of volcalc we currently have uses coefficients calculated at 20°C, based on a table from the original paper. To accommodate other temperatures, we need to integrate another table that provides equations for calculating these coefficients based on temperature, and that’s what we’re working on.

Another key feature we want to leave room for in the future is the ability to add other methods for estimating volatility. SIMPOL.1 is just one type of group contribution method, but there are other approaches described in various papers that use different functional groups, equations, and coefficients. The basic idea remains the same: count the functional groups in a molecule, apply an equation, and estimate volatility. We’re trying to structure the code in a way that makes it easy to incorporate additional methods later, even if we don’t add them right away. I think these are the most important features we’re focusing on right now.

Kristina: We’re focused on the features I mentioned in the near future, but looking further ahead, I could see volcalc expanding to estimate other characteristics of compounds beyond just volatility. While I’m not a chemistry expert or a chemical ecologist, I imagine that those interested in volatility might also be interested in other compound characteristics that currently lack automated tools for estimation. So, it’s possible the package could evolve to include those features.

That said, one of the things I appreciate about the R package ecosystem is that it allows for specialized tools. Since anyone can build what they need, we don’t end up with massive, overly complex packages that try to do everything and become difficult to maintain. It might be better to keep volcalc focused and leave room for separate packages to handle additional functionality. This way, the tools remain manageable and easier to maintain in the long run.

How has it been working with the R Consortium? Would you recommend applying for an ISC grant to other R developers?

Kristina: The application process was straightforward, and I found the grant format to be very practical. It was focused on milestones and product development, which is refreshing compared to many academic research grants that tend to avoid specific deliverables. I highly recommend considering this grant. I believe people often overlook smaller funding sources, but even small amounts can make a big impact on the work you’re doing.

Eric: The first time I applied for an R Consortium grant was as a grad student, and I strongly encourage trainees to apply as well. It was a great experience for me because I could do it independently—my advisor wasn’t involved as one of the authors, and it wasn’t a complex process like applying for an NSF grant. It was straightforward and really rewarding. The only tricky part was figuring out the payment process, but that’s something people can work out.

I’ve noticed there seem to be fewer projects in recent years, and I don’t think it’s due to a lack of funding. It seems like fewer people are applying, which is why I especially encourage others to give it a shot. From what I’ve seen, there’s a very good chance of getting funded if you apply right now.

People should be creative and think broadly about how their project can benefit the broader R community. This doesn’t mean you need to develop the next big thing like R-Universe or CRAN. It can be something smaller, like a package that other R users will find helpful. For example, with our project, volcalc, our main goal was to encourage chemists—who usually use point-and-click software—to start using R. That was enough of a contribution to the R community to get funded. So, I really encourage people to think creatively about what “benefiting the R community” can mean.

About ISC Funded Projects

A major goal of the R Consortium is to strengthen and improve the infrastructure supporting the R Ecosystem. We seek to accomplish this by funding projects that will improve both technical infrastructure and social infrastructure.

Learn More!