![]() join () def clean_filter ( s : str ) -> str : return "". : Matches one or more of the previous element (i.e., the negated set)ĭef clean_list_comprehension ( s : str ) -> str : return "".A-Z: Matches uppercase alphabetic characters in the given range.a-z: Matches lowercase alphabetic characters in the given range.0-9: Matches numeric characters in the given range.: Negated set that matches any character NOT in the set.The core component of this approach is the regex pattern.įor this example, we’ll be using " ".Īs explained by a regex testing website, this pattern includes the following elements: While this post’s example is quite trivial, not all scenarios will be as forgiving. Regex requires the developer to test the expression (unless you’re some kind of regex wizard) and check for corners cases. The Pythonic approaches shown above are easy to read, easy to debug, and easy to understand their scope. ![]() The problem with regex is that it’s too powerful and too flexible. However in my humble opinion, depending on the context, it’s overkill. Regex is truly powerful and can be used in a variety of situations and across all programming languages.Įven though it has a relatively difficult learning curve, it’s definitely something that most developers and engineers should try understand and get familiar with. Our final approach is to use Regular Expressions (regex). join ( clean_string ) clean_string 'Temperature C' isspace (), ugly_string ) # filter returns a generator, so we need to re-join the string clean_string = "". isalnum, ugly_string ) # or use filter to return alphanumeric and whitespace clean_string = filter ( lambda x : x. # use filter to keep just alphanumeric characters filter ( str. So let’s take a look at how to efficiently clean strings in Python. Unfortunately, this is not uncommon, and while Excel parses these strings normally, we can’t always guarantee how different systems or programs will react.įor example in LaTeX, the percent sign is the special character for comments, and thus needs to be escaped ( \%) else you’ll have a bad day. Stripping non-alphanumeric characters is a simple and useful step for many data processing applications.Īs seen in our previous post, the data logger tried to be fancy and used the (evil) degree symbol ( ☌) and the percent sign ( %RH) for the temperature and humidity column headers, respectively. String validation and sanitization ensures that a string meets a set of criteria (validation) or modifies it to conform to a schema (sanitization).Īs discussed above, there are plenty of important situations where incoming strings (e.g., data labels, paths, filenames) may not conform to a standard and behave unexpectedly.Īs previously discussed in the temperature and humidity data analysis post, the Extract and Transform steps of an ETL workflow typically applies some data cleaning. If we extend this line of thought from filenames to a generalized “path” or “resource location” (e.g., URLs, data column headings, data labels), strict and consistent schemas are import, as the devil in the details.Īnd non-alphanumeric characters are evil. Yes, there are spaces, but as long as the schema is enforced, the database will still be easily machine parsable. One example is in mechanical engineering when working with a PDM system (and non-developer stakeholders).Ĭommon practice is to have your filenames as. Now, I understand there are times when this level of rigidity doesn’t fully make sense. My obsession comes from my years of jumping back and forth between Linux (programming, servers), Windows ( CAD), and MacOS (personal laptop), where each system has different de facto filename standards.Īs such, I eventually converged towards simple and draconian filenames to ensure easy data scraping when I needed to make automation. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |