Audio Signal Processing for Music Application
October 1st, 2016
This coursera mooc was offered by Universitat Pompeu Fabra, Barcelona (and Stanford University), the main instructor being Xavier Serra.
I originally signed up for this mooc maybe 2 years ago? I started into it but it seemed more intense than I had first thought, and I had other responsibilities at the time, so I left it. Having taken it a second time, I'm glad I did. I'm glad I put in the effort and commitment required. It was my first real attempt at and introduction to signal processing the engineer's way of thinking in particular. It pushed me outside of my comfort zone, and has expanded my academic perspective. I'm glad and grateful.
As I'm writing this preamble, I have the advantage of hindsight. I'm able to say this course is broken down into three parts: Weeks 1-4 are theory and tools needed to do audio signal processing, namely the fast fourier transform; Weeks 5-7 build more specific theoretical models of audio analysis/synthesis; finally weeks 8-10 are more about applications.
- Week 0 - Introduction
- Week 1 - Overview
- Week 2 - Discrete Fourier Transform
- Week 3 - Fourier Transform Properties
- Week 4 - Short-time Fourier Transform
- Week 5 - Sinusoidal Model
- Week 6 - Harmonic Model
- Week 7 - Sinusoidal plus Residual Modeling
- Week 8 - Sound Transformations
- Week 9 - Semantic Description
- Week 10 - Concluding Topics
Week 0 - Introduction
October 5th, 2016
I decided to title this week 0, it's a lot like the first class of a course in university. You pick up your syllabus, learn about your instructor(s), class expectations, overview of the course, that kind of stuff. I do like how they've chosen to structure the mooc videos, breaking them down into theory, demonstration, programming. It's nice, because you have the theory, then you can see it in action with demonstrations, then you get an overview of the programming assignment.
This audio signal processing mooc is a 10 week course. I'll try to get it done in 2!
Week 1 - Overview
October 7th, 2016
Week 1 was a pretty standard week 1 as far as moocs go. It was more of a conceptual overview: What is Audio signal processing? Analog vs digital signals, that sort of thing, ending in basic math prerequisites.
The applications such as compression, transformation, synthesis, semantic description seem interesting, they definitely offer motivation, but looking ahead they're pretty far off still, it seems we have a lot of theory to get through first.
Week 2 - Discrete Fourier Transform
October 10th, 2016
As I proclaimed previously, I am looking to get through this course in two weeks. I think this is possible because I had initially taken this course about two years ago. I had to drop the class as they say, but I skimmed the videos at the time trying to get a feel for what signal processing is all about, and finding a narrative to ease me into things this time.
This week as a lot of heavy theory: The discrete Fourier transform, DFT equation, complex exponentials, scalar product in the DFT, DFT of complex sinusoids, inverse-DFT. I don't feel it's rushed, but everything builds on this later on so it can't be taken for granted now. It's worth doing the assignments as they get you to process what you've learned even further.
Week 3 - Fourier Transform Properties
October 13th, 2016
Now that we know how the DFT is defined, this week we continue with our exploration looking at Fourier transform properties. In particular we learned about linearity, shift, symmetry, convolution, energy conservation with a definition of decibels. We also explored best practices in actually using the DFT such as phase unwrapping, zero-padding, not to mention the fast Fourier transform (FFT) as optimization, with its own best practice of zero-phase windowing.
I won't lie, I don't fully get all of what's going on here. I'm trying to keep up, I understand each individual point,
Week 4 - Short-time Fourier Transform
October 18th, 2016
I've been rushing through the last few weeks because I could, but I finally had to slow down this week. After two weeks with the DFT, somehow the short-time Fourier transform (STFT) felt alien, foreign, and forced. Individually, I understood the STFT equation, analysis windows, FFT size and Hop size, time-frequency compromise, inverse STFT, but I also felt like I didn't really understand why we're specializing in an alternative discrete transform as compared with the general variety we just learned about. This part the instructors didn't explain very well.
After researching it a bit further, I've found the perfect example: Spectrograms from linguistic analysis. I've taken some linguistics in the past, and in the phonetic analysis component we study a visual representation of speech sounds called a spectrogram. It's like: If we apply the DFT to short periods of time along such a recording, we can analyse each small interval of time and determine the frequencies for that region. So when you're looking at a spectrogram, it's like you're looking at a 3D plot from above, where the time axis follows rightward as the recording continues, and each time segment maps outward showing the prevalence of certain frequencies over others as those regions are darker and more defined.
I dunno, maybe this sounds obvious to you, but I was never trained as an engineer, and I never had any in person teacher explain these things to me, so if you don't know the story in advance, it's not obvious until it is, you know? Anyway, that's my experience it seems. At least I'm starting to understand how this all works now!
Week 5 - Sinusoidal Model
October 22nd, 2016
This week as a little boring for me. I've taken enough math and physics in the past I know how sinewaves work and how they correspond nearly to pure tones in nature, and form the basis of more complex sounds. In any case we specifically went into the sinusoidal model equation, sinewaves in a spectrum, sinewaves as spectral peaks, time-varying in spectrogram, sinusoidal synthesis.
The bigger thing for me (in terms of my current interest) is how this relates back to everything else and how it'll be applied coming up. I don't really see it yet, I'll just have to wait. Other than that, the homework assignments have been pretty time consuming and challenging so far.
Week 6 - Harmonic Model
October 26th, 2016
This audio signal processing mooc is taking longer than expected. I admit it.
The hardest part for me I feel is that engineers use the exact same math but "invent" entirely new lexicons to describe the same concepts. They don't do it to be difficult, they do it because embedded within these variant languages of theirs are alternative connotations, all of which shape an alternate narrative story, and that makes a difference.
It's hard for me because I have to learn an entirely new language and figure out a hidden implicit story, and yet this is applied to the same math otherwise that I already know. Having to discern and differentiate the subtleties of all this is slowing me down—especially because I am trying to learn it thoroughly and well.
Anyway, our content this week was regarding harmonic model equation, sinusoids-partials-harmonics, monophonic/polyphonic signals, harmonic detection, fundamental frequency detection.
I'm familiar with harmonics in a general sense having taken some linguistics previously which overlapped with what is taught here. That helps for sure. Regardless, I think I'm starting to see where we're going with this and how it applies to signal processing in general. I mean being able to model voices is a definite application of audio manipulation in audicity for example, so I can see how those sound transformation tools might actually be designed and built. Then again, maybe I should wait a little longer and see.
Week 7 - Sinusoidal plus Residual Modeling
November 1st, 2016
This week we looked at the stochastic model, stochastic approximation of sounds, sinusoidal/harmonic plus residual model, residual subtraction, sinusoidal/harmonic plus stochastic model, stochastic model of residual.
Okay! Things are finally starting to click with me here. The way I see it, we take the harmonic model and use it to approximate a common sound such as human voice or a musical instrument. But this harmonic model (or the basic sinusoidal model) doesn't approximate everything within the recorded sound. So we take that original recording, and "subtract" our harmonic approximation and we're left with a recording of the remaining sounds (the residual). It seems people with our physical oral/nasal cavities create echoes which we take for granted as parts of the sounds we make when speaking. Furthermore, there are many instruments (such as the trombone) which have similar echo chambers which end up creating percussive sounds which form actual sound content, not just noise—even if we take it for granted.
So, it's like we've stratified our signal. We have an original sound recording, full of complex sounds, and as engineers we are looking to mitigate this complexity of sound by peeling back layers of its complexity through layers of approximation. We subtract the harmonic layer and we're left with a percussive residual layer, which we then approximate using a stochastic model.
I'm really starting to like audio signal processing!
Week 8 - Sound Transformations
November 6th, 2016
So far I've learned a lot of details in this course I know I will need more practice with. Otherwise I'm starting (I think) to become comfortable with the bigger picture of what's going on and what we're actually doing when we say we're doing audio signal processing. I feel may be able to pass this mooc now—this time around, we'll see—but I still have to admit this engineer's way of signal processing as a whole language and methodology unto itself is gonna take some time to get use to. This of course is why it matters I start learning all of this now rather than later :)
This week was all about application. I loved it! I had so much fun seeing just how all this theory pays off, not to mention now having an overall idea of how sound software like audacity actually work under-the-hood, so to speak.
Our topics for this week:
- Short-time Fourier transform - filtering, morphing.
- Sinusoidal model - time and frequency scaling.
- Harmonic plus residual model - pitch transposition.
- Harmonic plus stochastic model - time stretching, morphing.
Week 9 - Semantic Description
November 12th, 2016
It seems like things are starting to wind down in terms of the heavy content of the course. I'm glad, haha. I'm always happy to learn more, but sometimes you do need a break from the heavy math. Even me.
This week we discussed sound/music description. Given our tools for modeling and analyzing sound, we already have a lot to work with in finding measures, metrics, and general ways to compare and contrast audio signals and recordings: spectral-based audio features, description of sound/music events and collections. Beyond that, I'd say this week has so far been the weakest in the course. We didn't really go in depth in any way, just explored the possibilities.
The fact that this week was slow did give me the opportunity to return to some of the concepts introduced earlier in the course, namely that of the analysis window when implementing the STFT. In particular, just for clarity, the windows introduced in this course have been: rectangular, hanning, hamming, blackman, and blackman-harris.
I mean I get how these windows are applied, and analytically I get why they're applied, but the whole concept of applying them still lacks intuition, it still lacks a story for me. Fortunately, in this nineth week, our instructor Xavier Serra describes filter banks which is another variety of smoothing window. He describes their application as a change of the signal perception to more closely match human hearing. We're better at hearing subtle change in pitch at lower frequencies than higher ones—and I finally get it! This whole windowing aspect of signal processing is not about changing the typological context, but rather it's about changing the typological distribution, and we do it to change our perception of the data. Change in perspective in the discrete formal language world differs from the continuous formal language world, and now I see how. For example in a vector space (as discrete structure) a change in perspective is a change in basis. In the continuous world it's about changing distributions. Interesting!
Well I say "I get it", but really it's just the first stepping stone intuition, but at least I finally have a starting point to a story I otherwise don't yet know.
Week 10 - Concluding Topics
November 19th, 2016
This week we had a review of class, and we took a look beyond audio signal processing for music applications.
We didn't go heavy into anything, but I was happy with this week as we got to see some really interesting areas of research and development, as well as cutting-edge applications such as vocaloids!
It turns out the Music Technology Group (MTG), the people who made this mooc were in collaboration with Yamaha to make Vocaloids. They even reference Hatsune Miku haha. If you don't know who that celebrity is, please watch Kids React. It's really cool finding out I got my introductory signal processing lessons from such a group.
Since the course is winding down, I will critique (or complain) that the quality of this course could be a bit better regarding narrative pedagogical design and similar such things. Engineers aren't always known to be natural communicators. Regardless, it's not like it was horrible either, I'm not trying to overstate the complaint. They definitely provide lots of quality reference material of the landscape for those who want to delve deeper. I'll also admit it was harder than I thought it'd be, and I learned a lot. I was gonna share my final music composition, but it turned out really weird and aesthetically unpleasing ahah. I tried to over-complicate things, as always. I won't be quitting my day-job (or my night-job) any time soon to become a music composer, I'll tell you that.
I guess all that's left to say is: Yeah! I'm finally finishing my audio signal processing for music application mooc! All in all I'm glad I took this course.